logo elektroda
logo elektroda
X
logo elektroda

Does AI vision model Gemma 3 27B know about electronics? Artificial intelligence describes the image

p.kaczmarek2 2190 10
ADVERTISEMENT
Treść została przetłumaczona polish » english Zobacz oryginalną wersję tematu
  • Graphic showing the Gemma 3 logo on a dark background. .
    Gemma 3 is the latest in a series of open multimodal LLMs from Google, based on the same technology as Gemini 2.0. Chatbots based on Gemma 3 not only operate with text, but can also describe images. This is where I will try to test this in terms of images relating to electronics. I will send Gemma images and it will describe what it sees in them. Can today's AI read? Let's find out!

    Gemma 3 came out a mere 10 days ago, but is already supported by the Ollama environment. I will also be running it in it. All on my laptop of course. Specifications:
    Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz, 64GB RAM, GeForce GTX 1060

    I'm running the tests on the largest Gemma 3 model - version 27B.

    Gemma 3 tests .
    Here are the tests carried out, everything I have checked is shown here - I am not omitting any results. The order in which I tested is as follows. First a photo/screenshot, then my commentary.
    DIY motor control setup with a multimeter showing a current of 0.4726 Amps. .
    Good! He even read the current correctly - 0.4726A. How did he even spot the DC motor? Maybe he guessed "by feel". He didn't recognise the vehicle, but it's faintly visible.

    .
    The image shows a small tracked robot chassis and a motor driver board. .
    Good! Looks like he met everything correctly.

    .
    Boom box with blue accents on a tiled floor. .
    OK, just is it the brand?

    .
    TP-1603 DC power supply with a display showing voltage and current.
    Good! It read the voltage and current correctly, although the Twintex TP-1603 converted to Toptek, but the writing is blurred, so can't be blamed.

    .
    LED strip in an LCD screen backlight with a tester .
    Good! Only where is that screwdriver and connectors? But even the marker met.

    .
    Two Alfawise air purifiers on a tiled floor. .
    Good! He also recognised that it was two pieces.


    Digital display showing a temperature of 350°C on a hot air SMD rework station.
    Good! He recognised what the equipment was, read the temperature, even identified the buttons....


    Black digital radio with time display function and several control buttons. .
    Old digital LED clock radio by Crown with German labels. .
    Good! He even read the button lettering and translated it from German....

    .
    Digital thermometer and hygrometer on a white background. .
    Screenshot of a dialog with chatbot Gemma3:27b describing an emoticon next to the 64% reading. .
    Good! He read the temperature and humidity correctly, only the emoticon seems to have confused him?

    .
    Rectangular device with ribbed plastic cover. .
    Gemma 3 interface responding to an incorrect sensor interpretation. .
    Failure! The protective plastic (semi-transparent) confused him and it didn't come out correctly. It is after all an electrical socket, not a sensor....

    .
    Image of a monitor placed on a tiled floor.
    Photo of an old monitor on a tile floor. .
    Disaster! It is, after all, an old monitor and not a blackboard, although.... aaf2815 I must admit that it somewhat resembles this board.

    .
    CNC TB6600 controller on black background .
    Failure! It's an ATX breakout board, but can you blame it? As little known boards are unlikely to mate.... The voltages read correctly.

    .
    Digital display on a PCB showing voltage 23.22V and 0.4726A. .
    Medium. Almost hit the mark. Why didn't it read 23.22?


    Device with LED display and cylindrical sensor, described as a laser power meter. .
    Failure! It's a tester - an artificial load of LD25/LD35, not a laser....


    A small boost converter board placed on a wooden surface with wires attached.
    Good! It is indeed an inverter.


    Console with cabling on a wooden surface .
    OK, it is indeed a PS2. Admittedly probably a PS2 Slim, but that's less important....

    .
    Screenshot of CrystalDiskInfo showing details of the Hitachi HTS545032B9A300 drive. .
    Screenshot from CrystalDiskInfo tool showing the health status of a hard drive. .
    Good! Where did he read so many things from? This model must indeed have a good ability to understand text from images. Although... he made a typo in the disk model - 329BA instead of 32B9A.... also the Polish language he knows, the buffer size he also read, very well.

    .
    BIOS screen displaying system information of ASUS 1201N computer. .
    Screenshot of a computer system specification. .
    Good! I think he read everything correctly.


    Label from an Ambiano drip coffee maker.
    Good! Here, too, he seems to have read flawlessly....


    Multimeter displaying resistance 0.0063 Ω
    Wrong, he messed up here. I wanted him to write to me that this capacitor is shorted, therefore faulty, and he elaborated on the fact that it's not a capacitance measurement.... in addition he added zeros in front of 63.

    .
    Printed circuit board with markings EWL-ZBS02-LG V2.0 and date 2023/03/21, held by tweezers.
    Good! He even guessed it was Zigbee. Probably after that "ZB". Only thing I disagree with is Ebyte, how does he know that?
    The link he gave dead/fictional.

    .
    Four plastic solderless breadboards in packaging. .
    Good!


    Set of electronic test leads in a plastic box. .
    Good! Only where did the wire stripper come from? Did he take a box buckle for it?

    .
    Raspberry Pi Pico connected to a breadboard using a blue connector. .
    Wrong: it's a different board, and didn't recognise the DHT11.

    .
    Close-up of an electronic module with markings on a circuit board. .
    Medium! He read the markings correctly, but how did he decide that this transistor was a BT module?

    .
    ESP32-WROOM-32 module connected to a breadboard. .
    Medium! This is the XR806 WXU module, not the ESP32-WROOM....


    Box with a product label containing information about compatibility and manufacturer. .
    Well, even those Heizk sensed....


    Philips CD player with shock protection feature.
    Philips portable CD player on a wooden surface. .
    Good! Even 12ESP read.

    .
    Battery compartment with two red AA-R6 batteries, partially covered by an open flap. .
    Open battery compartment with two visible red AA batteries. .
    Good! It correctly recognised the batteries and read the caption, admittedly it misidentified the device as a weather station, but from that perspective I don't think it's a mistake.

    .
    Close-up of a green printed circuit board with various electronic components. .
    Section of a printed circuit board with various electronic components. .
    OK, although a typo crept in - it read 9 instead of S. The rest ok, even 103 and 221 from the resistors it read....


    MXCHIP EMW3072 WiFi and Bluetooth module on a wooden background. .
    OK, it read everything correctly, although the link gave a bogus one.

    .
    App screen showing the process of connecting a device with the Scanning section highlighted. .
    Smart home device setup screen with connection process. .
    Well, he recognised everything, although he did not write that it was Tuya. He also translated the Polish subtitles.

    .
    Control panel with WAIT displayed on the screen.
    Screenshot showing a description of the Sony PCD-DR1 Pocket Communicator by Gemma 3. .
    Medium. It went from an 8 to an R, and yes it seemingly read the name, but mistook the equipment for a PDA. It didn't read the SAIA logo.
    The link to YT is dead, to the description too.

    .
    Set of bits in two packages: one with PZ bits and the other with PH bits, by C.K Impact. .
    Two C.K. Impact screwdriver bit sets on a patterned table. .
    Right. I think it read all ok.

    .
    DC power supply with displays showing 0.11 A and 22.0 V. .
    Right. He read "Twintex" correctly this time. He embraced everything...

    .
    Two Atmel ATSAM4E8E and ATSAM4E16E microcontrollers on a white background. .
    Right. It read almost everything, even 1701 and 1707. The only thing it didn't mention was that there are two versions here - 8E and 16E.
    The link doesn't work.

    .
    Seven-segment display on a breadboard showing the number 88. .
    Medium. Too poor contrast and didn't grasp that it was 33. He thought it was 88.

    .
    Close-up of a SAIA PCD7.D81 programmable logic controller with a display and LED indicators. .
    Right. This time he embraced that it was a PLC controller from SAIA. He also read some of the text from the display.

    .
    Two smart plugs lying on a wooden surface. .
    Good. Recognised and described the prongs as far as possible.

    .
    BME280 module with pins on a wooden surface .
    BME280 environmental sensor module with visible markings and pins. .
    Well, although it is a BMP280 rather than a BME280. Interestingly, he even described what the BME is used for and gave its leads.

    .
    The image shows old computer components such as a keyboard, ball mice, RAM sticks, and cables. .
    Right. He probably described everything. The only thing I think he misstated was the number of mice, as there are two, not one....

    .
    Old slide projector with plug on a table.
    The image shows a slide projector with accessories on a table. .
    Well, it even read the curved description of Kratos.

    .
    DHT11 or DHT22 temperature and humidity sensors with wires. .
    Good, although probably bogus links. He recognised that it was DHT.

    .
    The image shows a white TP-Link router with two black antennas. .
    OK, he recognised that it was a WiFi router and power supply.

    .
    Empty VHS cassette shell with an open case. .
    Medium. Here's the cassette, it's not a blank. The rest ok.

    .
    Ceiling LED lamp with built-in Wi-Fi module and LED panels. .
    Back side of a LED ceiling light with electronic components. .
    Medium. Where did he get 120W from?

    .
    UNI-T power meter with a display showing values. .
    Good! Recognised the equipment type (read the designation) and also embraced the reading.

    .
    Image of a rotary tool with a flexible shaft and motor parts. .
    Fine, although the brand is more like Lund.

    .
    Samsung 850 EVO 4TB SSD lying on a wooden surface. .
    Image showing a Samsung 850 EVO 4TB SSD on a wooden surface. .
    OK, although he went from 0 to O in the model, it happens to me too.

    .
    Close-up of a motherboard with a CR2032 lithium battery and various electronic components. .
    Screenshot of a description of a circuit board from a smart card reader. .
    Medium. Although. He probably had no way of determining that it was a laptop. CR2032 3V he recognised, he read. So ok, and where's the damage? Because parts are missing?

    .
    Four boxes of RTX Smart Module lying on a wooden surface. .
    Right. He recognised that it was probably some RTX modules based on Tuya.

    .
    Vintage Unitra Diora stereo radio with control panels and station scale. .
    Right. He met/read what the equipment was.

    .
    Image of a digital multimeter with visible buttons and a display showing a reading of 3.3421 V. .
    Good, although a little overdone with this RF. He read the voltage well.

    .
    Vacuum fluorescent display showing the word ARDUINO. .
    OK, also read the inscription, although it's not a VFD.

    .
    Electronic circuit with a PCB and Arduino board. .
    Disassembled graphics card and development board on a table. .
    Wrong: it's not the graphics card. Did the perspective confuse him? Where does he see the RTX 3060 sign? Although he did recognise the Arduino.

    .
    Two MAX232ACPE integrated circuits on a white background. .
    Good! Recognised both pieces and described what they are used for.

    .
    Rear of Midland Alan 109 CB radio with labels and ports. .
    Midland ALAN 109 CB radio held in hand, with visible connectors and label. .
    Good! Again, he read so much correctly.


    Two Microchip chips with logo and designation PIC16F1719-I/P lying on a white background. .
    Two integrated circuits labeled as Microchip PIC16F1719-I/P on a piece of fabric. .
    Good! Again, it read correctly and even deciphered the -I/P abbreviation. Interestingly, it gave the correct link.
    Digital logic circuit simulation or timing diagram with clock signals and data. .
    Right. It also read 3F.

    .
    Circuit board with electronic components including capacitors, coils, and resistors. .
    Image with a description of the power supply board components. .
    Medium, it's SANJI not SAMI and PC1 is not a connector but an optocoupler. Also this model number... and. reversed text read as LITZE V@ AYUKED?

    .
    Gray Sony PlayStation (PS1) console with a controller and cables. .
    Okay, that's PS1.

    .
    Photo of an Atari 800XL home computer. .
    Okay, it's an Atari 800XL.

    .
    Solar charge controller mounted on a wall with a display and connected wires. .
    OK, it's a solar controller.

    .
    Arduino Uno R4 Minima board on a wooden background. .
    OK, he got to know the new Arduino, although was it that difficult?


    Ribbon cable connector with wires extending. .
    Bad, as you can see the LVDS connector does not know?


    Fan motor with a capacitor on a checkered fabric background. .
    Wrong, even the capacity reads wrong. There it is 0.95 not 10.... I wonder if he accidentally took MOT (Microwave Oven Transformer) for MOTOR (motor), or maybe the tokens got mixed up because the name is similar?

    .
    ESR meter atlas ESR+ model ESR70 by PEAK Electronic Design Ltd on cardboard background.
    Screenshot showing the ESR reading result using the Gemma 3 model. .
    Well, ESR also read as I asked.

    .
    Printed circuit board with electrolytic capacitors .
    Well, he even determined that something was wrong with these capacitors.

    .
    The image shows a tuner module from an analog TV or VCR on a wooden surface. .
    Well, he recognised that it was a tuner.... only this "black connector"...

    .
    Internal mechanism of a mechanical pencil
    Badly, the electron gun from the CRT however, Gemma does not know.

    .
    White plastic electronic component with mounting holes, placed on wooden background. Several wires extend from the back.
    Wrong, and supposedly so simple - a light switch.


    LED bulb with visible circuit board and LED diodes. .
    Right.

    .

    Image recognition time .
    For my hardware (specs at the beginning of the topic), the times are twofold - it depends if something is occupying the GPU. Unless I'm turning nothing else on the laptop, CUDA kicks in and the model runs faster.
    CUDA times:
    
    ime=2025-03-22T12:54:05.803+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
    ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
    ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
    ggml_cuda_init: found 1 CUDA devices:
      Device 0: NVIDIA GeForce GTX 1060, compute capability 6.1, VMM: yes
    load_backend: loaded CUDA backend from W:\TOOLS\ollama-windows-amd64\lib\ollama\cuda_v12\ggml-cuda.dll
    load_backend: loaded CPU backend from W:\TOOLS\ollama-windows-amd64\lib\ollama\ggml-cpu-haswell.dll
    time=2025-03-22T12:54:06.369+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
    time=2025-03-22T12:54:06.496+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="16.8 GiB"
    time=2025-03-22T12:54:06.496+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="502.7 MiB"
    time=2025-03-22T12:54:28.892+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
    time=2025-03-22T12:54:28.892+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host
    time=2025-03-22T12:54:28.894+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
    time=2025-03-22T12:54:28.901+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
    time=2025-03-22T12:54:28.905+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
    time=2025-03-22T12:54:28.916+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
    time=2025-03-22T12:54:28.916+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
    time=2025-03-22T12:54:28.916+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
    time=2025-03-22T12:54:28.916+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
    time=2025-03-22T12:54:28.916+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
    time=2025-03-22T12:54:29.128+01:00 level=INFO source=server.go:619 msg="llama runner started in 23.58 seconds"
    [GIN] 2025/03/22 - 12:54:52 | 200 |   47.8405856s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 13:00:33 | 200 |         4m20s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 13:00:57 | 200 |    24.673879s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 13:01:26 | 200 |   28.8113443s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 13:05:36 | 200 |         2m56s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 13:06:00 | 200 |   24.0392796s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 13:06:28 | 200 |   27.8827066s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 13:11:07 | 200 |   23.0458191s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 13:17:15 | 200 |          6m6s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 13:17:48 | 200 |   32.5789072s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 13:18:23 | 200 |   33.5782454s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 13:26:32 | 200 |         4m41s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 13:27:08 | 200 |   35.8528721s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 13:27:38 | 200 |   30.0081765s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 13:34:09 | 200 |         3m38s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 13:34:33 | 200 |   23.8713585s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 13:35:02 | 200 |   28.2681217s |   192.168.0.213 | POST     "/api/chat"
    
    .

    CPU times:
    
    
    time=2025-03-22T13:42:18.212+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
    load_backend: loaded CPU backend from W:\TOOLS\ollama-windows-amd64\lib\ollama\ggml-cpu-haswell.dll
    time=2025-03-22T13:42:18.249+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
    time=2025-03-22T13:42:18.262+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="17.3 GiB"
    time=2025-03-22T13:42:28.245+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CPU
    time=2025-03-22T13:42:28.248+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
    time=2025-03-22T13:42:28.255+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
    time=2025-03-22T13:42:28.261+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
    time=2025-03-22T13:42:28.277+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
    time=2025-03-22T13:42:28.277+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
    time=2025-03-22T13:42:28.277+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
    time=2025-03-22T13:42:28.277+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
    time=2025-03-22T13:42:28.277+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
    time=2025-03-22T13:42:28.424+01:00 level=INFO source=server.go:619 msg="llama runner started in 10.48 seconds"
    [GIN] 2025/03/22 - 13:52:32 | 200 |        10m15s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 13:57:10 | 200 |         4m37s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 14:03:48 | 200 |         8m58s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 14:08:04 | 200 |        10m53s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 14:11:44 | 200 |         7m56s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 14:14:56 | 200 |         9m23s |   192.168.0.213 | POST     "/api/chat"
    [GIN] 2025/03/22 - 14:22:46 | 200 |        16m42s |   192.168.0.213 | POST     "/api/chat"
    



    Summary .
    Impressive. There are shortcomings, but it is still impressive. This model is definitely better than the previously tested LLAVA , and with the help of the GPU (via CUDA) it still goes pretty fast.
    Hallucinations still happen, especially in unclear situations, but it's still very good.
    What's more, this model does a pretty good job of reading text and all sorts of markings and lettering, too, from hardware and from displays.
    This is really a great deal, considering that I ran the whole thing on my old laptop.
    Do you see any uses for such an AI model? .
    If you have any pictures to test, feel free to comment too.

    Cool? Ranking DIY
    Helpful post? Buy me a coffee.
    Do you have a problem with Arduino? Ask question. Visit our forum Arduino.
    About Author
    p.kaczmarek2
    Moderator Smart Home
    Offline 
    p.kaczmarek2 wrote 11838 posts with rating 9933, helped 566 times. Been with us since 2014 year.
  • ADVERTISEMENT
  • #2 21490886
    Urgon
    Level 38  
    AVE...

    Also a big meanie that it recognises images. Envision Glasses and the free Envision AI app can recognise images and direct the user to a specific object.



    .

    A sensible test of the capabilities of Gemma 3. or any other model would be to show a diagram, give assumptions and command "take me count". And check the results...
  • ADVERTISEMENT
  • #3 21490893
    p.kaczmarek2
    Moderator Smart Home
    Thanks for recommending the app, I'm keen to test it out. I'm just wondering if it's based on a model available to download and run locally, or if it runs on a closed model API, e.g. from OpenAI?

    I ask because I'm essentially praising here the fact that this is a good vision model which can be fired up locally .
    Helpful post? Buy me a coffee.
  • #4 21490902
    Urgon
    Level 38  
    AVE...

    The Envision AI application and the related Seeing AI are based on closed models, as far as I know. But they have a specific use: helping the visually impaired and blind. There's also a recent EMVI app, but it comes with a fee. And then there are the Meta Ray-Ban glasses with Meta AI, which many blind people use because they cost 1 400 zeta, not nearly 10 000 like Envision Glasses....

    I test the models locally myself, but the problem is that you need to have a really powerful GPU, and preferably more than one, for this system to make sense for local use....
  • #5 21490941
    krzbor
    Level 27  
    Does the model have limitations on the resolution of the images?
  • ADVERTISEMENT
  • #6 21490962
    Urgon
    Level 38  
    AVE...

    I don't think there is a limit. The apps I tested work just as well with a 12MPix camera as with a 48MPix smartphone camera. The image is simplified by neural network processing anyway. Read up on the early work on neural networks, especially perceptrons....
  • ADVERTISEMENT
  • #7 21491157
    p.kaczmarek2
    Moderator Smart Home
    I haven't seen any problems with the size of the images, but it probably scales internally to the correct matrix anyway?

    Counting tests:
    Five small circuit boards on a white background. .
    Three ESP32-S3 development boards on a white background. .
    Two small electronic circuit boards on a white surface. .
    Three plastic boxes containing LILYGO ESP32 CAN RS485 PINMAP electronic modules. .
    Simple counted, but it's getting lost here:
    The image shows several blue USB cables next to five packaged in clear plastic bags with labels.
    Helpful post? Buy me a coffee.
  • #8 21491326
    acctr
    Level 38  
    p.kaczmarek2 wrote:
    probably scales internally to the appropriate matrix anyway?
    .
    In a way yes, the input image or part of it is scaled to 896x896. It is then fed to the SigLIP encoder, which generates tokens for the language model.
    Except that the encoder can do a crop of a given image fragment, i.e. the resolution matters.
    Helpful post? Buy me a coffee.
  • #9 21491564
    krzbor
    Level 27  
    p.kaczmarek2 wrote:
    I haven't seen any problems with the size of the images, but it probably scales internally to the correct matrix anyway?
    .
    Give it an A4 scan printed in small text and ask it to convert to text (OCR). If it reduces to 896x896, there is no chance of correct recognition.
  • #11 21498881
    PPK
    Level 29  
    It will be interesting to see when the schemes...

Topic summary

The discussion evaluates the capabilities of the Gemma 3 27B multimodal AI vision model, recently released by Google and supported by the Ollama environment, in recognizing and describing electronics-related images. Tests involve feeding various photos and screenshots to Gemma 3 to assess its image understanding and counting accuracy. The model processes input images by scaling or cropping to a resolution of approximately 896x896 pixels before encoding with the SigLIP encoder, which may limit detailed recognition such as OCR on small text scans. Comparisons are made to other AI vision applications like Envision AI and Seeing AI, which are closed models primarily designed for assisting visually impaired users, and Meta Ray-Ban glasses with Meta AI. Local deployment of Gemma 3 requires substantial GPU resources, with the author using an Intel i7-6700HQ CPU, 64GB RAM, and GeForce GTX 1060 GPU. The discussion highlights the model’s strengths in image description but notes challenges in precise counting and text recognition due to resolution constraints.
Summary generated by the language model.
ADVERTISEMENT