logo elektroda
logo elektroda
X
logo elektroda

Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR

p.kaczmarek2 558 0

TL;DR

  • Gemma 4, gemini-2.5-pro, and gemini-2.5-flash were tested on Elektroda forum photos for generic tagging and OCR, alongside older local models like qwen3.5, llava, and minicpm-v.
  • The test used two prompts: one demanding more than 25 tags sorted by match, and one asking to transcribe only visible text from each image.
  • On an Intel Core i7-6700HQ laptop with 48GB RAM and GeForce GTX 1060, gemma4:e2b averaged 37.21s per image, while gemini-2.5-flash averaged 3.76s.
  • Simple labels and OCR often worked, but models still misread parts like 25Q32CS1G, hallucinated tags such as SMD or IGBT, and no model was clearly reliable.
Generated by the language model.
ADVERTISEMENT
This content has been translated flag-pl » flag-en View the original version here
📢 Listen (AI):
  • Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    Are modern LLM models run locally, on an old gaming laptop, able to meaningfully tag photos? Are modern models suitable for OCR and correctly recognise electronic circuits? I invite you to the Electrode test of artificial intelligence, this time enriched by the locally run model Gemma 4 and by the paid models gemini-2.5-pro and gemini-2.5-flash run via API.

    I'll check out a wide selection of newer and older LLM models here, based on two prompts - one for tagging and one for OCR.

    Let's start with the definitions.
    Tagging is the process of automatically assigning to an image a set of descriptive keywords (tags) that define what is in the image.
    Prompt used for tagging:
    
    Choose more than 25 generic tags for this image, sorted from most matching to less matching. Reply just with tags, separated by ;
    


    OCR (Optical Character Recognition) is a technique for recognising text in an image. An algorithm analyses the graphic and attempts to read the visible characters, converting them into further processable text.
    Prompt used for OCR:
    
    Detect text on the image and write it down. Do not write anything else.
    



    Tested models with number of described images (at the time of topic publication): gemini-2.5-pro (4370), gemini-2.5-flash (4111), gemma4:e2b (1349), gemma3:4b (1146), gemma3:12b (1073), minicpm-v:latest (1073), llava:latest (1058), gemma4:e4b (874), qwen3.5:2b (870), qwen3.5:4b (858), qwen3.5:0.8b (836), llava (379), minicpm-v (379), qwen3.5:9b (300), qwen3.5:27b (45), qwen3.5:35b (16).

    The photo database will be updated , so the number of photos described will also grow. By force, the larger models take longer to process the photos, so the described ones have fewer examples.

    Previous presentations in the series:
    Intelligence has described over 1000 images from the Elektroda forum. How do you assess the results?
    Is Qwen3.5 suitable for image description and OCR? Practical tests on your own computer

    Image database preview.
    Old UI version: https://openshwprojects.github.io/IndexingElektrodaImages/search.html
    New UI version: https://openshwprojects.github.io/IndexingElektrodaImages/search2.html

    You can now move on to the results.

    OCR - something simple - Sonoff packaging:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    The main inscription "Wireless Door/Window Sensor" decoded every model tested - both the gemma4 and the professional gemini 2.5, as well as the slightly older qwen 3.5. It also went well with "Sonoff", but the gemma3 version 4b lost it. In addition, the other subtitles were also reasonably translated, although the eWeLink logo made the e itself.

    OCR - CC2530 chip:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    Gemini coped with this. The other models had a problem. Gemma 4 was close, CO2330 came out, qwen too - G02530. Probably too poor quality, or these smaller models internally operate on too small graphics.

    OCR - 25Q32CSIG memory on the board:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    Most models have made this 25Q32CS1G, i.e. the letter "I" has changed to "1". Gemini 2.5 flash did even worse. Older gemma 3 also - "25032CS1G". Many models also read the description layer of the board, and qwen 3.5 version 0.8b started adding its descriptions against the prompt.

    OCR - the name of the switch:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    The product name is M5-3C-80W and it decoded every model. Not bad! The models also decoded the inscriptions in smaller print, such as "SwitchMan".

    OCR - IRFP460LC transistor:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    Every model correctly decoded the IRFP460L, only the gemma4 in the e2b version lost the 'C'.

    OCR - TDA2822M audio amplifier:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    Virtually every model read the TDA2822M, the exception being the gemini 2.5 pro, which by some miracle started to list tags instead of reading subtitles. A large proportion of models also read more information from the board, RXD pads, TXD pads, etc.

    OCR - electrolytic capacitors:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    The values 4.7 and 50 were read correctly, but are without units. In addition, gemma4, for example, misrepresented one of the values and the result was 5.0. All in all, however, the lack of units is understandable, as the photo does not show them either.

    OCR - SA612AN with NXP logo.
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    It went quite well, although there are some hypocrisies, e.g. qwen3.5 rebranded as 5A612AN. Gemini 2.5 Flash was the only one to decode the NXP logo.

    Tags - board:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    You can see here how the newer and newer models are doing better. The old minicpm-v doesn't have precise keywords, but the new gemma does. It's only a pity about the keywords added by force, for example "heat gun" should rather not be here, but again, it's an older model - llava.

    Tags - IRFP460LC:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    This time the prompt was about the tags, but some models intelligently deciphered that it was an IRFP460 anyway, and even added MOSFET and IGBT tags. This is a MOSFET transistor with an N-type channel, so IGBT is not correct here, which makes me hesitate how to judge it. I was also surprised by this 600V and 30A at gemma3. This is not from its datasheet, so it must have been adjusted by force. Too bad qwen3.5 too guessed and even added some IRF540. Another qwen added the word Infineon, but it's not that manufacturer after all?


    Tags - LED:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    This was fairly straightforward, although surprisingly some of the models did not detect the word LED. That's too bad, especially as two of them are the newer Gemma 4 family. What's more, the term SMD appeared in Gemma4, which is total nonsense here. This raises some doubts about the use of these models for parts sorting.

    Tags - microswitch button:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    Same here - seemingly related tags, but also meaningless. In gemma 3 the term resistor appears, in qwen 3.5 on the other hand LED.... "Switch" also appears, but with a lot of noise.

    Tags - USB:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    Similar situation, although here it looks like it's the gemma4 that doesn't know the USB connector. The other models recognised.

    Tags - battery:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    Not bad, although too much. I think the prompt needs to be changed. Even that gemini 2.5 - "still life"? Interesting that gemma3 has added the 1.5V tag and Gemini no longer. Qwen3.5 on the other hand caught the expiry date - 2036.

    Tags - TL431:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    Some models read the tagging, but not all. In addition, a part specified a TO-92 enclosure. Again, in response one of them came up with some form of "thought", and I quote "Operational amplifier (Wait; text says TL431A which is a logic trans). Stick with Transistor or Logic IC.". This is also incorrect - it is not an amplifier or transistor.

    Tags - remote control:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    The consensus of the models is for the "remote control" tag, then the stairs begin. Gemini 2.5 Flash detected the colour orange and gave the tag "orange". It even described the mat as 'bamboo'. The other models are also fine, although some tags don't seem all that practical, such as 'text display', it doesn't fit in my opinion. Interestingly, only the qwen3.5 2b decoded the Natec logo.

    Tags - OBK simulator:
    Screenshot UI with a circuit diagram and multiple columns of LLM-generated tags on the right
    They did pretty well here, but where did qwen3.5 4b get the ESP32 from? Version 2b referred correctly as "openbeken simulator", not bad.


    Finally, a few words about performance. The hardware used was a laptop with an Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz, 48GB RAM, GeForce GTX 1060.
    I have collected tagging times for Gemma4 version e2b and for models from Google called by the API:
    
    gemma4:e2b
      Images: 120
      Min:    23,42s
      Max:    313,52s
      Avg:    37,21s
    
    === Model Stats ===
    
    gemini-2.5-flash
      Images: 175
      Min:    0,57s
      Max:    40,42s
      Avg:    3,76s
    
    gemini-2.5-pro
      Images: 442
      Min:    1,99s
      Max:    89,52s
      Avg:    12,44s
    

    The API is quite fast, although it can take up to 10 seconds. Tagging locally on my hardware averages just under 40 seconds per image with the model used. As you can see, with a large database of images this can drag on, although the computer is potentially usable for tagging. It's clear that for more hardware-intensive activities it won't be suitable, but you can browse the internet in the process.

    You could go on for a long time, but everyone has access to the results on GitHub, so I'll get to the conclusions. It seems that modern models both perform moderately well at tagging photos and simple OCR tasks. Interestingly, I did not feel that the closed models available through the API (gemini 2.5 flash and gemini 2.5 pro) were somehow significantly better in terms of tagging my photos. Even they, too, made occasional errors or omitted something, although probably with more testing one would have to concede their superiority. The biggest problem with such tagging and OCR, in my opinion, is still the uncertainty of the results and the unpredictability of the generated tags. It seems to me that one has to wait a few more generations of LLMs to get more reliable results.

    I invite you to evaluate the results yourself on my page on GitHub:
    https://openshwprojects.github.io/IndexingElektrodaImages/search.html
    https://openshwprojects.github.io/IndexingElektrodaImages/search2.html

    Have you tested Gemma 4 in practice yet?

    Cool? Ranking DIY
    Helpful post? Buy me a coffee.
    About Author
    p.kaczmarek2
    Moderator Smart Home
    Offline 
    p.kaczmarek2 wrote 14459 posts with rating 12468, helped 650 times. Been with us since 2014 year.
  • ADVERTISEMENT
📢 Listen (AI):

FAQ

TL;DR: On an i7-6700HQ laptop, local Gemma 4 averaged 37.21s/image; the core finding is: "results are still uncertain." This FAQ helps electronics users compare local Gemma 4 with Gemini 2.5 for photo tagging, OCR, and part-mark reading on real component and PCB images. [#21894362]

Why it matters: If you index electronics photos or read part markings from images, model speed matters less than error patterns such as missed labels, wrong tags, and invented part identities.

Model Use mode Avg tagging time Practical result in thread
Gemma 4 e2b Local 37.21s/image Usable for tagging, but slower and still inconsistent
Gemini 2.5 Flash API 3.76s/image Fast and strong on simple OCR, with occasional odd tags
Gemini 2.5 Pro API 12.44s/image Strong OCR overall, but one OCR prompt returned tags instead

Key insight: Newer vision LLMs can tag electronics photos and read simple markings, but reliability is still the bottleneck. In this thread, closed API models were not dramatically better than local models for tagging, because both still produced noise and hallucinated details.

Quick Facts

  • Test hardware for local runs: Intel Core i7-6700HQ @ 2.60GHz, 48 GB RAM, GeForce GTX 1060; the author describes the machine as usable for tagging while still browsing the web. [#21894362]
  • Measured tagging speed shows a large gap: Gemma 4 e2b 37.21s avg, Gemini 2.5 Flash 3.76s avg, Gemini 2.5 Pro 12.44s avg. That difference strongly affects large image databases. [#21894362]
  • The local Gemma 4 timing range was 23.42s to 313.52s over 120 images, so worst-case latency can be far higher than the average. [#21894362]
  • The tested model list included database counts from 16 to 4,370 described images, with gemini-2.5-pro (4,370) and gemini-2.5-flash (4,111) leading the sample size at publication time. [#21894362]
  • Several hallucinated tags were explicitly noted, including IGBT, SMD, heat gun, resistor, and ESP32, showing that electronics tagging errors are often semantic, not just cosmetic. [#21894362]

1. What is image tagging in the context of LLM vision models, and how is it used to describe electronics photos?

Image tagging is automatic keyword assignment for an image, used here to describe electronics photos with more than 25 generic tags sorted by relevance. "Image tagging" is an image-analysis task that assigns descriptive keywords to a photo, prioritizing the most visible objects and attributes. In this thread, the prompt asked for tags only, separated by semicolons, so the output could index boards, parts, tools, and packaging images in a searchable database. [#21894362]

2. What is OCR, and how does it differ from generic image tagging when analyzing component photos and PCB images?

OCR reads visible text from an image, while tagging names objects or attributes without requiring exact text transcription. "OCR" is an optical text-recognition technique that converts characters visible in an image into machine-readable text, preserving specific markings instead of broad visual labels. In this thread, OCR was tested with chip photos, package labels, and PCB markings, using a prompt that told the model to write only the detected text. [#21894362]

3. How do Gemma 4 and Gemini 2.5 compare for image tagging and OCR on electronics-related images?

Gemini 2.5 was generally stronger on hard OCR, but Gemma 4 stayed competitive on tagging and simple reads. Gemini handled the CC2530 chip best, while Gemma 4 was close but returned CO2330. For tagging, the author did not feel Gemini 2.5 Flash or Pro were significantly better than local models on his photo set, because both API and local models still made omissions and occasional false tags. [#21894362]

4. How can I test a local model like Gemma 4 on my own image database using prompts for tagging and OCR?

You can test it with two fixed prompts and compare outputs across a photo set. 1. Use a tagging prompt: ask for more than 25 generic tags, sorted from most to less matching, separated by semicolons. 2. Use an OCR prompt: ask the model to detect text on the image and write only that text. 3. Run both prompts on the same image database and review errors such as missed part names, wrong package text, and invented tags. [#21894362]

5. Why do LLM vision models confuse similar part markings like CC2530, 25Q32CSIG, and SA612AN on low-quality chip photos?

They confuse similar markings because low-quality close-ups push small models into character substitution and guesswork. In the thread, CC2530 became CO2330 in Gemma 4 and G02530 in Qwen 3.5. The memory marking 25Q32CSIG often became 25Q32CS1G, replacing the letter I with the digit 1, and SA612AN was once changed to 5A612AN. The author also suggests smaller models may internally work on images that are too small. [#21894362]

6. Which models handled simple electronics OCR best in these tests, such as reading Sonoff packaging, switch labels, and transistor markings?

Gemini 2.5 and several other models handled simple OCR well, but the easiest cases were solved by almost every model tested. All models decoded the main Sonoff package text "Wireless Door/Window Sensor," and every model read the switch name M5-3C-80W. For IRFP460LC, every model got IRFP460L correct, while Gemma 4 e2b dropped only the final C. These results show simple, high-contrast text is already a strong case for current vision LLMs. [#21894362]

7. Why do some models hallucinate tags like IGBT, SMD, heat gun, resistor, or ESP32 when the image shows a different electronic part?

They hallucinate because the tagging prompt rewards broad guessing and the models over-associate visual patterns with common electronics terms. In the thread, a MOSFET photo triggered IGBT, an LED photo triggered SMD, a board image triggered heat gun, a microswitch image triggered resistor, and an OBK simulator image triggered ESP32. Those errors make the output less reliable for inventory, sorting, and precise search unless you manually review the tags. [#21894362]

8. What hardware is needed to run Gemma 4 locally for image tagging, and what performance can I expect from an i7-6700HQ with GTX 1060 and 48 GB RAM?

An older gaming laptop can run Gemma 4 locally, but throughput is modest. The tested machine used an Intel Core i7-6700HQ at 2.60GHz, 48 GB RAM, and a GeForce GTX 1060. On that hardware, Gemma 4 e2b averaged 37.21s per image for tagging, with a minimum of 23.42s and a maximum of 313.52s across 120 images. The author says the computer stays usable for light browsing during tagging. [#21894362]

9. How much slower is local Gemma 4 tagging compared with Gemini 2.5 Flash or Gemini 2.5 Pro over API?

Local Gemma 4 was much slower than both Gemini API models in this test. Gemma 4 e2b averaged 37.21s per image, while Gemini 2.5 Flash averaged 3.76s and Gemini 2.5 Pro averaged 12.44s. That makes local Gemma about 9.9× slower than Flash and about 3× slower than Pro on average. For large archives, that delay becomes the main operational cost even before accuracy is judged. [#21894362]

10. What is the best prompt format for getting more reliable tags from LLMs on electronics photos without extra noise or invented details?

A strict, minimal prompt works best, but this thread shows it still does not eliminate noise. The author used: ask for more than 25 generic tags, sorted from most matching to less matching, and reply only with tags separated by semicolons. That format gave consistent, comparable outputs across models, yet models still added terms like still life, bamboo, IGBT, and SMD. The practical takeaway is to keep the prompt simple and then post-filter obvious false tags. [#21894362]

11. How should I evaluate OCR accuracy when models read board silkscreen, package logos, or partial markings in addition to the main text?

Evaluate OCR by separating the primary target text from extra text the model also captured. In the thread, some models correctly read the main marking but also added board silkscreen such as RXD and TXD pads, or captured logos like NXP. That behavior is useful if you want richer extraction, but it is an OCR failure if your prompt requires only the main visible text. Score exact target accuracy first, then treat extras as a second metric. [#21894362]

12. Why might Gemini 2.5 Pro return tags instead of OCR text even when the prompt says to only detect text on the image?

It can happen because vision LLMs sometimes follow image-understanding habits instead of the narrow output format. In the TDA2822M test, Gemini 2.5 Pro unexpectedly listed tags rather than transcribing text, even though the OCR prompt explicitly said "Do not write anything else." That is a format-control failure, not just a recognition mistake. It shows that even strong API models can break instruction fidelity on simple extraction tasks. [#21894362]

13. What practical results have users seen when testing Gemma 4 locally for electronics image tagging and OCR?

The practical result is that Gemma 4 is usable locally for simple OCR and basic electronics tagging, but not yet dependable enough for unsupervised precision work. It read many easy markings, stayed close on some harder chips, and produced solid tags on boards and simulators. It also missed details like LED, misread characters such as I versus 1, and invented tags such as SMD. The author’s conclusion is clear: reliability remains the main weakness, despite useful real-world progress. [#21894362]

14. Which models are most useful for identifying parts like TL431, IRFP460LC, LEDs, microswitches, USB connectors, and remote controls from photos?

No single model dominated every electronics object class in this thread. Models often recognized the broad object, such as remote control, but differed on fine detail. Some models read TL431 markings and even inferred a TO-92 package, while others failed or produced incorrect reasoning. IRFP460LC was easy in OCR, but tagging still produced false terms like IGBT. USB was recognized by several models, yet Gemma 4 reportedly missed the USB connector in that example. [#21894362]

15. Where can I browse and compare the GitHub results database for Gemma, Gemini, Qwen, LLaVA, and MiniCPM-V image tagging tests?

You can browse the public results in two GitHub Pages interfaces linked in the thread. One is labeled the old UI version, and the other is the new UI version. The author also says everyone has access to the results there and invites readers to evaluate them directly. Those pages compare outputs across Gemma, Gemini, Qwen, LLaVA, and MiniCPM-V on the same electronics image database. [#21894362]
Generated by the language model.
ADVERTISEMENT