logo elektroda
logo elektroda
X
logo elektroda

Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR

p.kaczmarek2 69 0

TL;DR

  • Gemma 4, gemini-2.5-pro, and gemini-2.5-flash were tested on Elektroda forum photos for generic tagging and OCR, alongside older local models like qwen3.5, llava, and minicpm-v.
  • The test used two prompts: one demanding more than 25 tags sorted by match, and one asking to transcribe only visible text from each image.
  • On an Intel Core i7-6700HQ laptop with 48GB RAM and GeForce GTX 1060, gemma4:e2b averaged 37.21s per image, while gemini-2.5-flash averaged 3.76s.
  • Simple labels and OCR often worked, but models still misread parts like 25Q32CS1G, hallucinated tags such as SMD or IGBT, and no model was clearly reliable.
Generated by the language model.
ADVERTISEMENT
Treść została przetłumaczona polish » english Zobacz oryginalną wersję tematu
📢 Listen (AI):
  • Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    Are modern LLM models run locally, on an old gaming laptop, able to meaningfully tag photos? Are modern models suitable for OCR and correctly recognise electronic circuits? I invite you to the Electrode test of artificial intelligence, this time enriched by the locally run model Gemma 4 and by the paid models gemini-2.5-pro and gemini-2.5-flash run via API.

    I'll check out a wide selection of newer and older LLM models here, based on two prompts - one for tagging and one for OCR.

    Let's start with the definitions.
    Tagging is the process of automatically assigning to an image a set of descriptive keywords (tags) that define what is in the image.
    Prompt used for tagging:
    
    Choose more than 25 generic tags for this image, sorted from most matching to less matching. Reply just with tags, separated by ;
    


    OCR (Optical Character Recognition) is a technique for recognising text in an image. An algorithm analyses the graphic and attempts to read the visible characters, converting them into further processable text.
    Prompt used for OCR:
    
    Detect text on the image and write it down. Do not write anything else.
    



    Tested models with number of described images (at the time of topic publication): gemini-2.5-pro (4370), gemini-2.5-flash (4111), gemma4:e2b (1349), gemma3:4b (1146), gemma3:12b (1073), minicpm-v:latest (1073), llava:latest (1058), gemma4:e4b (874), qwen3.5:2b (870), qwen3.5:4b (858), qwen3.5:0.8b (836), llava (379), minicpm-v (379), qwen3.5:9b (300), qwen3.5:27b (45), qwen3.5:35b (16).

    The photo database will be updated , so the number of photos described will also grow. By force, the larger models take longer to process the photos, so the described ones have fewer examples.

    Previous presentations in the series:
    Intelligence has described over 1000 images from the Elektroda forum. How do you assess the results?
    Is Qwen3.5 suitable for image description and OCR? Practical tests on your own computer

    Image database preview.
    Old UI version: https://openshwprojects.github.io/IndexingElektrodaImages/search.html
    New UI version: https://openshwprojects.github.io/IndexingElektrodaImages/search2.html

    You can now move on to the results.

    OCR - something simple - Sonoff packaging:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    The main inscription "Wireless Door/Window Sensor" decoded every model tested - both the gemma4 and the professional gemini 2.5, as well as the slightly older qwen 3.5. It also went well with "Sonoff", but the gemma3 version 4b lost it. In addition, the other subtitles were also reasonably translated, although the eWeLink logo made the e itself.

    OCR - CC2530 chip:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    Gemini coped with this. The other models had a problem. Gemma 4 was close, CO2330 came out, qwen too - G02530. Probably too poor quality, or these smaller models internally operate on too small graphics.

    OCR - 25Q32CSIG memory on the board:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    Most models have made this 25Q32CS1G, i.e. the letter "I" has changed to "1". Gemini 2.5 flash did even worse. Older gemma 3 also - "25032CS1G". Many models also read the description layer of the board, and qwen 3.5 version 0.8b started adding its descriptions against the prompt.

    OCR - the name of the switch:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    The product name is M5-3C-80W and it decoded every model. Not bad! The models also decoded the inscriptions in smaller print, such as "SwitchMan".

    OCR - IRFP460LC transistor:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    Every model correctly decoded the IRFP460L, only the gemma4 in the e2b version lost the 'C'.

    OCR - TDA2822M audio amplifier:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    Virtually every model read the TDA2822M, the exception being the gemini 2.5 pro, which by some miracle started to list tags instead of reading subtitles. A large proportion of models also read more information from the board, RXD pads, TXD pads, etc.

    OCR - electrolytic capacitors:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    The values 4.7 and 50 were read correctly, but are without units. In addition, gemma4, for example, misrepresented one of the values and the result was 5.0. All in all, however, the lack of units is understandable, as the photo does not show them either.

    OCR - SA612AN with NXP logo.
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    It went quite well, although there are some hypocrisies, e.g. qwen3.5 rebranded as 5A612AN. Gemini 2.5 Flash was the only one to decode the NXP logo.

    Tags - board:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    You can see here how the newer and newer models are doing better. The old minicpm-v doesn't have precise keywords, but the new gemma does. It's only a pity about the keywords added by force, for example "heat gun" should rather not be here, but again, it's an older model - llava.

    Tags - IRFP460LC:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    This time the prompt was about the tags, but some models intelligently deciphered that it was an IRFP460 anyway, and even added MOSFET and IGBT tags. This is a MOSFET transistor with an N-type channel, so IGBT is not correct here, which makes me hesitate how to judge it. I was also surprised by this 600V and 30A at gemma3. This is not from its datasheet, so it must have been adjusted by force. Too bad qwen3.5 too guessed and even added some IRF540. Another qwen added the word Infineon, but it's not that manufacturer after all?


    Tags - LED:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    This was fairly straightforward, although surprisingly some of the models did not detect the word LED. That's too bad, especially as two of them are the newer Gemma 4 family. What's more, the term SMD appeared in Gemma4, which is total nonsense here. This raises some doubts about the use of these models for parts sorting.

    Tags - microswitch button:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    Same here - seemingly related tags, but also meaningless. In gemma 3 the term resistor appears, in qwen 3.5 on the other hand LED.... "Switch" also appears, but with a lot of noise.

    Tags - USB:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    Similar situation, although here it looks like it's the gemma4 that doesn't know the USB connector. The other models recognised.

    Tags - battery:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    Not bad, although too much. I think the prompt needs to be changed. Even that gemini 2.5 - "still life"? Interesting that gemma3 has added the 1.5V tag and Gemini no longer. Qwen3.5 on the other hand caught the expiry date - 2036.

    Tags - TL431:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    Some models read the tagging, but not all. In addition, a part specified a TO-92 enclosure. Again, in response one of them came up with some form of "thought", and I quote "Operational amplifier (Wait; text says TL431A which is a logic trans). Stick with Transistor or Logic IC.". This is also incorrect - it is not an amplifier or transistor.

    Tags - remote control:
    Practical tests of Gemma 4 and comparison with Gemini 2.5 - image tagging and OCR
    The consensus of the models is for the "remote control" tag, then the stairs begin. Gemini 2.5 Flash detected the colour orange and gave the tag "orange". It even described the mat as 'bamboo'. The other models are also fine, although some tags don't seem all that practical, such as 'text display', it doesn't fit in my opinion. Interestingly, only the qwen3.5 2b decoded the Natec logo.

    Tags - OBK simulator:
    Screenshot UI with a circuit diagram and multiple columns of LLM-generated tags on the right
    They did pretty well here, but where did qwen3.5 4b get the ESP32 from? Version 2b referred correctly as "openbeken simulator", not bad.


    Finally, a few words about performance. The hardware used was a laptop with an Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz, 48GB RAM, GeForce GTX 1060.
    I have collected tagging times for Gemma4 version e2b and for models from Google called by the API:
    
    gemma4:e2b
      Images: 120
      Min:    23,42s
      Max:    313,52s
      Avg:    37,21s
    
    === Model Stats ===
    
    gemini-2.5-flash
      Images: 175
      Min:    0,57s
      Max:    40,42s
      Avg:    3,76s
    
    gemini-2.5-pro
      Images: 442
      Min:    1,99s
      Max:    89,52s
      Avg:    12,44s
    

    The API is quite fast, although it can take up to 10 seconds. Tagging locally on my hardware averages just under 40 seconds per image with the model used. As you can see, with a large database of images this can drag on, although the computer is potentially usable for tagging. It's clear that for more hardware-intensive activities it won't be suitable, but you can browse the internet in the process.

    You could go on for a long time, but everyone has access to the results on GitHub, so I'll get to the conclusions. It seems that modern models both perform moderately well at tagging photos and simple OCR tasks. Interestingly, I did not feel that the closed models available through the API (gemini 2.5 flash and gemini 2.5 pro) were somehow significantly better in terms of tagging my photos. Even they, too, made occasional errors or omitted something, although probably with more testing one would have to concede their superiority. The biggest problem with such tagging and OCR, in my opinion, is still the uncertainty of the results and the unpredictability of the generated tags. It seems to me that one has to wait a few more generations of LLMs to get more reliable results.

    I invite you to evaluate the results yourself on my page on GitHub:
    https://openshwprojects.github.io/IndexingElektrodaImages/search.html
    https://openshwprojects.github.io/IndexingElektrodaImages/search2.html

    Have you tested Gemma 4 in practice yet?

    Cool? Ranking DIY
    Helpful post? Buy me a coffee.
    About Author
    p.kaczmarek2
    Moderator Smart Home
    Offline 
    p.kaczmarek2 wrote 14393 posts with rating 12313, helped 650 times. Been with us since 2014 year.
  • ADVERTISEMENT
📢 Listen (AI):
ADVERTISEMENT