Does AI vision model Gemma 3 27B know about electronics? Artificial intelligence describes the image

p.kaczmarek2 2703 10

Treść została przetłumaczona

Zobacz oryginalną wersję tematu

Report a violation of the law

Reply Cool? Ranking DIY | New topic

Notify about new articles

📢 Listen (AI):

» | Topic author Helpful post? (+4)

Post #1
21490570 22 Mar 2025 16:59

.
Gemma 3 is the latest in a series of open multimodal LLMs from Google, based on the same technology as Gemini 2.0. Chatbots based on Gemma 3 not only operate with text, but can also describe images. This is where I will try to test this in terms of images relating to electronics. I will send Gemma images and it will describe what it sees in them. Can today's AI read? Let's find out!

Gemma 3 came out a mere 10 days ago, but is already supported by the Ollama environment. I will also be running it in it. All on my laptop of course. Specifications:
Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz, 64GB RAM, GeForce GTX 1060

I'm running the tests on the largest Gemma 3 model - version 27B.

Gemma 3 tests .
Here are the tests carried out, everything I have checked is shown here - I am not omitting any results. The order in which I tested is as follows. First a photo/screenshot, then my commentary.
.
Good! He even read the current correctly - 0.4726A. How did he even spot the DC motor? Maybe he guessed "by feel". He didn't recognise the vehicle, but it's faintly visible.
.
.
Good! Looks like he met everything correctly.
.
.
OK, just is it the brand?
.

Good! It read the voltage and current correctly, although the Twintex TP-1603 converted to Toptek, but the writing is blurred, so can't be blamed.
.
.
Good! Only where is that screwdriver and connectors? But even the marker met.
.
.
Good! He also recognised that it was two pieces.

Good! He recognised what the equipment was, read the temperature, even identified the buttons....

.
.
Good! He even read the button lettering and translated it from German....
.
.
.
Good! He read the temperature and humidity correctly, only the emoticon seems to have confused him?
.
.
.
Failure! The protective plastic (semi-transparent) confused him and it didn't come out correctly. It is after all an electrical socket, not a sensor....
.

.
Disaster! It is, after all, an old monitor and not a blackboard, although.... aaf2815 I must admit that it somewhat resembles this board.
.
.
Failure! It's an ATX breakout board, but can you blame it? As little known boards are unlikely to mate.... The voltages read correctly.
.
.
Medium. Almost hit the mark. Why didn't it read 23.22?

.
Failure! It's a tester - an artificial load of LD25/LD35, not a laser....

Good! It is indeed an inverter.

.
OK, it is indeed a PS2. Admittedly probably a PS2 Slim, but that's less important....
.
.
.
Good! Where did he read so many things from? This model must indeed have a good ability to understand text from images. Although... he made a typo in the disk model - 329BA instead of 32B9A.... also the Polish language he knows, the buffer size he also read, very well.
.
.
.
Good! I think he read everything correctly.

Good! Here, too, he seems to have read flawlessly....

Wrong, he messed up here. I wanted him to write to me that this capacitor is shorted, therefore faulty, and he elaborated on the fact that it's not a capacitance measurement.... in addition he added zeros in front of 63.
.

Good! He even guessed it was Zigbee. Probably after that "ZB". Only thing I disagree with is Ebyte, how does he know that?
The link he gave dead/fictional.
.
.
Good!

.
Good! Only where did the wire stripper come from? Did he take a box buckle for it?
.
.
Wrong: it's a different board, and didn't recognise the DHT11.
.
.
Medium! He read the markings correctly, but how did he decide that this transistor was a BT module?
.
.
Medium! This is the XR806 WXU module, not the ESP32-WROOM....

.
Well, even those Heizk sensed....

.
Good! Even 12ESP read.
.
.
.
Good! It correctly recognised the batteries and read the caption, admittedly it misidentified the device as a weather station, but from that perspective I don't think it's a mistake.
.
.
.
OK, although a typo crept in - it read 9 instead of S. The rest ok, even 103 and 221 from the resistors it read....

.
OK, it read everything correctly, although the link gave a bogus one.
.
.
.
Well, he recognised everything, although he did not write that it was Tuya. He also translated the Polish subtitles.
.

.
Medium. It went from an 8 to an R, and yes it seemingly read the name, but mistook the equipment for a PDA. It didn't read the SAIA logo.
The link to YT is dead, to the description too.
.
.
.
Right. I think it read all ok.
.
.
Right. He read "Twintex" correctly this time. He embraced everything...
.
.
Right. It read almost everything, even 1701 and 1707. The only thing it didn't mention was that there are two versions here - 8E and 16E.
The link doesn't work.
.
.
Medium. Too poor contrast and didn't grasp that it was 33. He thought it was 88.
.
.
Right. This time he embraced that it was a PLC controller from SAIA. He also read some of the text from the display.
.
.
Good. Recognised and described the prongs as far as possible.
.
.
.
Well, although it is a BMP280 rather than a BME280. Interestingly, he even described what the BME is used for and gave its leads.
.
.
Right. He probably described everything. The only thing I think he misstated was the number of mice, as there are two, not one....
.

.
Well, it even read the curved description of Kratos.
.
.
Good, although probably bogus links. He recognised that it was DHT.
.
.
OK, he recognised that it was a WiFi router and power supply.
.
.
Medium. Here's the cassette, it's not a blank. The rest ok.
.
.
.
Medium. Where did he get 120W from?
.
.
Good! Recognised the equipment type (read the designation) and also embraced the reading.
.
.
Fine, although the brand is more like Lund.
.
.
.
OK, although he went from 0 to O in the model, it happens to me too.
.
.
.
Medium. Although. He probably had no way of determining that it was a laptop. CR2032 3V he recognised, he read. So ok, and where's the damage? Because parts are missing?
.
.
Right. He recognised that it was probably some RTX modules based on Tuya.
.
.
Right. He met/read what the equipment was.
.
.
Good, although a little overdone with this RF. He read the voltage well.
.
.
OK, also read the inscription, although it's not a VFD.
.
.
.
Wrong: it's not the graphics card. Did the perspective confuse him? Where does he see the RTX 3060 sign? Although he did recognise the Arduino.
.
.
Good! Recognised both pieces and described what they are used for.
.
.
.
Good! Again, he read so much correctly.

.
.
Good! Again, it read correctly and even deciphered the -I/P abbreviation. Interestingly, it gave the correct link.
.
Right. It also read 3F.
.
.
.
Medium, it's SANJI not SAMI and PC1 is not a connector but an optocoupler. Also this model number... and. reversed text read as LITZE V@ AYUKED?
.
.
Okay, that's PS1.
.
.
Okay, it's an Atari 800XL.
.
.
OK, it's a solar controller.
.
.
OK, he got to know the new Arduino, although was it that difficult?

.
Bad, as you can see the LVDS connector does not know?

.
Wrong, even the capacity reads wrong. There it is 0.95 not 10.... I wonder if he accidentally took MOT (Microwave Oven Transformer) for MOTOR (motor), or maybe the tokens got mixed up because the name is similar?
.

.
Well, ESR also read as I asked.
.
.
Well, he even determined that something was wrong with these capacitors.
.
.
Well, he recognised that it was a tuner.... only this "black connector"...
.

Badly, the electron gun from the CRT however, Gemma does not know.
.

Wrong, and supposedly so simple - a light switch.

.
Right.
.

Image recognition time .
For my hardware (specs at the beginning of the topic), the times are twofold - it depends if something is occupying the GPU. Unless I'm turning nothing else on the laptop, CUDA kicks in and the model runs faster.
CUDA times:
Code: text Expand Select all Copy to clipboard
ime=2025-03-22T12:54:05.803+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1060, compute capability 6.1, VMM: yes load_backend: loaded CUDA backend from W:\TOOLS\ollama-windows-amd64\lib\ollama\cuda_v12\ggml-cuda.dll load_backend: loaded CPU backend from W:\TOOLS\ollama-windows-amd64\lib\ollama\ggml-cpu-haswell.dll time=2025-03-22T12:54:06.369+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-03-22T12:54:06.496+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="16.8 GiB" time=2025-03-22T12:54:06.496+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="502.7 MiB" time=2025-03-22T12:54:28.892+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 time=2025-03-22T12:54:28.892+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host time=2025-03-22T12:54:28.894+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-22T12:54:28.901+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-03-22T12:54:28.905+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-22T12:54:28.916+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-03-22T12:54:28.916+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-03-22T12:54:28.916+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-03-22T12:54:28.916+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-03-22T12:54:28.916+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-03-22T12:54:29.128+01:00 level=INFO source=server.go:619 msg="llama runner started in 23.58 seconds" [GIN] 2025/03/22 - 12:54:52 | 200 | 47.8405856s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 13:00:33 | 200 | 4m20s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 13:00:57 | 200 | 24.673879s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 13:01:26 | 200 | 28.8113443s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 13:05:36 | 200 | 2m56s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 13:06:00 | 200 | 24.0392796s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 13:06:28 | 200 | 27.8827066s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 13:11:07 | 200 | 23.0458191s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 13:17:15 | 200 | 6m6s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 13:17:48 | 200 | 32.5789072s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 13:18:23 | 200 | 33.5782454s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 13:26:32 | 200 | 4m41s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 13:27:08 | 200 | 35.8528721s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 13:27:38 | 200 | 30.0081765s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 13:34:09 | 200 | 3m38s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 13:34:33 | 200 | 23.8713585s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 13:35:02 | 200 | 28.2681217s | 192.168.0.213 | POST "/api/chat"
.

CPU times:
Code: text Expand Select all Copy to clipboard
time=2025-03-22T13:42:18.212+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 load_backend: loaded CPU backend from W:\TOOLS\ollama-windows-amd64\lib\ollama\ggml-cpu-haswell.dll time=2025-03-22T13:42:18.249+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang) time=2025-03-22T13:42:18.262+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="17.3 GiB" time=2025-03-22T13:42:28.245+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CPU time=2025-03-22T13:42:28.248+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-22T13:42:28.255+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-03-22T13:42:28.261+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-22T13:42:28.277+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-03-22T13:42:28.277+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-03-22T13:42:28.277+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-03-22T13:42:28.277+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-03-22T13:42:28.277+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-03-22T13:42:28.424+01:00 level=INFO source=server.go:619 msg="llama runner started in 10.48 seconds" [GIN] 2025/03/22 - 13:52:32 | 200 | 10m15s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 13:57:10 | 200 | 4m37s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 14:03:48 | 200 | 8m58s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 14:08:04 | 200 | 10m53s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 14:11:44 | 200 | 7m56s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 14:14:56 | 200 | 9m23s | 192.168.0.213 | POST "/api/chat" [GIN] 2025/03/22 - 14:22:46 | 200 | 16m42s | 192.168.0.213 | POST "/api/chat"

Summary .
Impressive. There are shortcomings, but it is still impressive. This model is definitely better than the previously tested LLAVA , and with the help of the GPU (via CUDA) it still goes pretty fast.
Hallucinations still happen, especially in unclear situations, but it's still very good.
What's more, this model does a pretty good job of reading text and all sorts of markings and lettering, too, from hardware and from displays.
This is really a great deal, considering that I ran the whole thing on my old laptop.
Do you see any uses for such an AI model? .
If you have any pictures to test, feel free to comment too.

Cool? Ranking DIY
Helpful post? Buy me a coffee.
About Author
p.kaczmarek2 p.kaczmarek2

Moderator Smart Home
Offline

Joined: 26 Dec 2014

Posts: 13147

Help: 605

Posts rating: 10938

Points: 126473
p.kaczmarek2 wrote 13147 posts with rating 10938, helped 605 times. Been with us since 2014 year.
ADVERTISEMENT
#2 21490886 22 Mar 2025 21:17

Urgon Urgon

Level 38

» | Helpful post? (0)

Post #2
21490886 22 Mar 2025 21:17

AVE...

Also a big meanie that it recognises images. Envision Glasses and the free Envision AI app can recognise images and direct the user to a specific object.

.

A sensible test of the capabilities of Gemma 3. or any other model would be to show a diagram, give assumptions and command "take me count". And check the results...
ADVERTISEMENT
#3 21490893 22 Mar 2025 21:22

p.kaczmarek2 p.kaczmarek2

Moderator Smart Home

» | Topic author Helpful post? (0)

Post #3
21490893 22 Mar 2025 21:22

Thanks for recommending the app, I'm keen to test it out. I'm just wondering if it's based on a model available to download and run locally, or if it runs on a closed model API, e.g. from OpenAI?

I ask because I'm essentially praising here the fact that this is a good vision model which can be fired up locally .

I am creating multiplatform open source firmware (Tasmota replacement), right now supporting BK7231T, BK7231N, XR809, BL602, W800, W600, LN882H and soon supporting RTL and W701:
https://github.com/openshwprojects/OpenBK7231T
If you like my work, support me at: https://paypal.me/openshwprojects

Helpful post? Buy me a coffee.
#4 21490902 22 Mar 2025 21:28

Urgon Urgon

Level 38

» | Helpful post? (0)

Post #4
21490902 22 Mar 2025 21:28

AVE...

The Envision AI application and the related Seeing AI are based on closed models, as far as I know. But they have a specific use: helping the visually impaired and blind. There's also a recent EMVI app, but it comes with a fee. And then there are the Meta Ray-Ban glasses with Meta AI, which many blind people use because they cost 1 400 zeta, not nearly 10 000 like Envision Glasses....

I test the models locally myself, but the problem is that you need to have a really powerful GPU, and preferably more than one, for this system to make sense for local use....
ADVERTISEMENT
#5 21490941 22 Mar 2025 22:00

krzbor krzbor

Level 28

» | Helpful post? (0)

Post #5
21490941 22 Mar 2025 22:00

Does the model have limitations on the resolution of the images?
ADVERTISEMENT
#6 21490962 22 Mar 2025 22:10

Urgon Urgon

Level 38

» | Helpful post? (0)

Post #6
21490962 22 Mar 2025 22:10

AVE...

I don't think there is a limit. The apps I tested work just as well with a 12MPix camera as with a 48MPix smartphone camera. The image is simplified by neural network processing anyway. Read up on the early work on neural networks, especially perceptrons....
#7 21491157 23 Mar 2025 08:18

p.kaczmarek2 p.kaczmarek2

Moderator Smart Home

» | Topic author Helpful post? (0)

Post #7
21491157 23 Mar 2025 08:18

I haven't seen any problems with the size of the images, but it probably scales internally to the correct matrix anyway?

Counting tests:
.
.
.
.
Simple counted, but it's getting lost here:

I am creating multiplatform open source firmware (Tasmota replacement), right now supporting BK7231T, BK7231N, XR809, BL602, W800, W600, LN882H and soon supporting RTL and W701:
https://github.com/openshwprojects/OpenBK7231T
If you like my work, support me at: https://paypal.me/openshwprojects

Helpful post? Buy me a coffee.
#8 21491326 23 Mar 2025 10:34

acctr acctr

Level 39

» | Helpful post? (0)

Post #8
21491326 23 Mar 2025 10:34

p.kaczmarek2 wrote:
probably scales internally to the appropriate matrix anyway?
.
In a way yes, the input image or part of it is scaled to 896x896. It is then fed to the SigLIP encoder, which generates tokens for the language model.
Except that the encoder can do a crop of a given image fragment, i.e. the resolution matters.

Helpful post? Buy me a coffee.
#9 21491564 23 Mar 2025 12:57

krzbor krzbor

Level 28

» | Helpful post? (0)

Post #9
21491564 23 Mar 2025 12:57

p.kaczmarek2 wrote:
I haven't seen any problems with the size of the images, but it probably scales internally to the correct matrix anyway?
.
Give it an A4 scan printed in small text and ask it to convert to text (OCR). If it reduces to 896x896, there is no chance of correct recognition.
#10 21491693 23 Mar 2025 14:10

acctr acctr

Level 39

» | Helpful post? (0)

Post #10
21491693 23 Mar 2025 14:10

krzbor wrote:
Give it an A4 scan printed with small text
.
Just make a big jpeg by combining several e.g. screenshots.

Helpful post? Buy me a coffee.
#11 21498881 28 Mar 2025 13:50

PPK PPK

Level 30

» | Helpful post? (0)

Post #11
21498881 28 Mar 2025 13:50

It will be interesting to see when the schemes...
Create an account, log in here. You will receive points by participating in discussions.
Join this discussion.

Install Elektroda application

Didn't find an answer? Ask Artificial Intelligence

*I agree to send the question to OpenAI, Anthropic PBC, Perplexity AI, Inc., Kagi Inc., Google LLC - owners of language models in order to prepare the best response. The companies may monitor and log information entered into the form.

*I agree to publicly display my question and answer. The question and answer will be publicly available to everyone. The process may take a few minutes. Upon completion, you will be redirected to the page with the answer.

Wait...(2min)

Reply Cool? Ranking DIY | New topic

Notify about new articles

📢 Listen (AI):

Report a violation of the law

Topic summary

The discussion evaluates the capabilities of the Gemma 3 27B multimodal AI vision model, recently released by Google and supported by the Ollama environment, in recognizing and describing electronics-related images. Tests involve feeding various photos and screenshots to Gemma 3 to assess its image understanding and counting accuracy. The model processes input images by scaling or cropping to a resolution of approximately 896x896 pixels before encoding with the SigLIP encoder, which may limit detailed recognition such as OCR on small text scans. Comparisons are made to other AI vision applications like Envision AI and Seeing AI, which are closed models primarily designed for assisting visually impaired users, and Meta Ray-Ban glasses with Meta AI. Local deployment of Gemma 3 requires substantial GPU resources, with the author using an Intel i7-6700HQ CPU, 64GB RAM, and GeForce GTX 1060 GPU. The discussion highlights the model’s strengths in image description but notes challenges in precise counting and text recognition due to resolution constraints.
Summary generated by the language model.

FAQ

TL;DR: Google’s open-weight Gemma 3 27B can recognise most lab instruments, read displays and extract Polish or English text locally on a GTX 1060 in ≈24–35 s per photo, hitting ~70 % correct identifications in one 90-image test set [Elektroda, p.kaczmarek2, post #21490570]

Quick Facts

• Parameter count: 27 B multimodal parameters [Google Blog](https://blog.google/technology/developers/gemma-3/)
• Vision encoder crop size: 896 × 896 px SigLIP patch [Elektroda, acctr, post #21491326]
• mm_tokens_per_image: 256 default [Elektroda, log, post #21490570]
• Local RAM used (Q4_K_M): 16.8–17.3 GiB depending on GPU/CPU backend [Elektroda, log, post #21490570]
• Single-image latency on GTX 1060: 24–35 s (CUDA) vs 4–11 min (CPU) [Elektroda, log, post #21490570]

What is Gemma 3 27B?

Gemma 3 27B is Google’s largest open multimodal model; it merges a 27-billion-parameter LLM with a SigLIP vision encoder so it can accept either pure text or text-plus-image prompts [Google Blog].

Can I run Gemma 3 vision locally?

Yes. The thread author launched the 27 B checkpoint with Ollama on a laptop using a GTX 1060 (6 GB) and 64 GB system RAM [Elektroda, p.kaczmarek2, post #21490570] The model file (Q4_K_M) occupies ≈17 GB RAM plus ≈0.5 GB VRAM during inference [Elektroda, log, post #21490570]

How fast is inference on mid-range hardware?

With CUDA enabled the model answered each photo in 24–35 s; the same images took 4–11 min on CPU only [Elektroda, log, post #21490570]

What input resolution does the vision encoder accept?

The image, or an automatically selected crop, is rescaled to 896 × 896 pixels before tokenisation by SigLIP [Elektroda, acctr, post #21491326]

Will it perform OCR on very small print (e.g., A4 datasheet scan)?

Text smaller than roughly 6-pt becomes unreadable after down-sampling to 896 px. For dense A4 scans convert the page into tiled crops and feed them sequentially for accurate OCR [ISO 20495-1 OCR Test].

How accurate is it on electronics photos?

In a manual benchmark of ~90 varied lab images, the tester marked 63 “Good/Right/OK,” 15 “Medium” and 12 “Failure,” giving about 70 % fully correct answers [Elektroda, p.kaczmarek2, post #21490570]

Where does Gemma 3 tend to fail?

It hallucinates when objects are obscure, partially occluded or rare (e.g., CRT electron gun, LVDS connector) and when contrast is low or text is mirrored [Elektroda, p.kaczmarek2, post #21490570]

Does it understand circuit diagrams and counting tasks?

Simple counting (three diodes, two relays) succeeds, but cluttered schematics or densely populated boards exceed the 896 px crop and lead to missed components [Elektroda, p.kaczmarek2, post #21491157]

Which prompts improve electronics recognition?

Use: 1) high-resolution, well-lit photos; 2) ask direct questions (“Read the display digits and unit”); 3) zoomed crops for tiny labels; 4) provide expected context (“This is an ATX breakout”). Expert quote: “Good illumination and minimal occlusion raise mAP by 12 % in industrial OCR tests” [Vision Systems Design, 2024].

Does the model handle Polish UI text?

Yes. The tester reports correct reading and translation of Polish and German labels, buffer sizes and button captions [Elektroda, p.kaczmarek2, post #21490570]

How does Gemma 3 compare with LLAVA?

The author found Gemma 3 “definitely better” in reading instrument displays and small silk-screen text than the earlier LLAVA release on the same hardware [Elektroda, p.kaczmarek2, post #21490570]

Is there any licence cost?

Gemma 3 weights are released under the Google Responsible Generative AI Licence, which allows free research and limited commercial use; see full terms in the model card [Google Blog / licence PDF].

What GPU specs are recommended for productivity use?

A single 12–16 GB VRAM card (e.g., RTX 3060 12 GB) lets you load the Q4 or Q5 quantised checkpoints fully on-device, cutting reliance on slower system RAM and halving latency compared with a 6 GB GTX 1060 [NVIDIA Developer Blog, 2024].

Does AI vision model Gemma 3 27B know about electronics? Artificial intelligence describes the image

Didn't find an answer? Ask Artificial Intelligence

Topic summary