logo elektroda
logo elektroda
X
logo elektroda

Minitest: robot vision? Multimodal AI LLaVA and workshop photo analysis - 100% local

p.kaczmarek2 2820 10

TL;DR

  • LLaVA, a local multimodal AI assistant, was tested on workshop photos to check how well it recognizes objects and reads text from images.
  • The tests covered dryers, monitors, routers, PCBs, speakers, clocks, controllers, and screens, with the model often identifying the right category but sometimes inventing details.
  • It could read some labels and numbers, such as T9 Max, MSI, Winbond, 10 inches, and 17:20, but OCR was inconsistent and often distorted.
  • Results were mixed: object recognition was often surprisingly good, yet hallucinations, wrong readings, and failures on jokes or fine details showed it is not as strong as closed solutions.
  • The main conclusion is that 100% local LLaVA is already useful for basic visual assistance, but it still feels too early for reliable workshop use.
Generated by the language model.
ADVERTISEMENT
Treść została przetłumaczona polish » english Zobacz oryginalną wersję tematu
📢 Listen (AI):
  • LLaVA project homepage with logo and information.
    Here I will test in practice the ability to analyze photos by the LLaVA AI model, which I managed to run locally on my computer. I will check how advanced this model is and how well it can describe photos from the workshop, and I will also check whether it is able to read text from photos.
    What is LLaVA?
    LLaVA is an abbreviation of "Large Language and Vision Assistant", which loosely translates as a large language and vision model, i.e. able to respond to both text and images. To simplify even more - it is like ChatGPT, but it can also be run locally on our computer.
    How to run LLaVA?
    I refer here to the previous topic in the series: ChatGPT locally? AI/LLM assistants to run on your computer - download and installation
    LLaVA can be downloaded via the WebUI mentioned in the previous topic installed locally on Docker, but interested parties can also see the project`s home page:
    https://llava-vl.github.io/
    The authors of the project promise quite good results, but of course we are not sure whether their examples were not selected to be as effective as possible (so-called cherry-picking):
    Screenshots showing image analysis by different AI models.
    For this reason I ran this model locally and tested it for you with different photos .

    So here we go. The testing format will be simple
    - first, a verbal description of the photo I posted
    - then a screenshot of the conversation with LLaVA
    - and then separately the pros and cons of the AI response in my opinion
    Here are my tests, in no particular order.

    Photo of the inside of the dryer
    Dryer heating element on a wooden surface with a connected electrical cord.
    + the model recognized that it was a heating element
    + the model recognized that a cable was connected
    - the model hallucinated that there was some text overlay here
    - the model hallucinated that he saw a table leg

    Screen with the T9SMAX logo from Android Box
    TV screen displaying T9SMAX logo and promotional slogan.
    + the model recognized that it was a TV screen
    + the model recognized that it was the inscription T9 Max (it lost the S?)

    Photo of the dryer:
    Hairdryer with black cord on wooden background
    + the model recognized that it was a dryer with a black cable

    Multimeter:
    Digital multimeter DT832 with test leads on a wooden table.
    + the model recognized that it was a multimeter with wires

    Monitor:
    Computer monitor on a stand oriented upside down on a tiled floor.
    + the model recognized that it was a monitor
    + the model recognized that the monitor was on its side (photo?)
    - the model hallucinated that there was a mouse and keyboard...

    Atari:
    Old Atari 800XL keyboard on a wooden table.
    + the model recognized the hardware as an old IBM keyboard
    - the model hallucinated numeric keypad to the right?

    Socket timer:
    Electronic device in plastic packaging on a wooden background.
    + the model recognized that it was some kind of programmer
    + the model recognized that it was in a package that had not been opened
    - the model assumed it was a thermostat
    - the model says there is a barcode here, I don`t see any

    Damaged TV (broken matrix):
    A broken LG TV screen with colorful lines.
    + the model knew that it was a damaged TV and that there were visible lines, cracks and colored stripes
    - the model hallucinated the words "Your TV has been damaged" - absurd?

    Smart power socket measure:
    Electrical socket on a workshop table.
    + the model recognized that it was an electronic device
    - the model hallucinates that it is a charger, that it has some connectors, etc

    Screenshot with the MSI logo:
    MSI logo on screen with BIOS message
    + the model correctly recognized the MSI logo
    - but the model also made up some alleged "Raspberry Pi" below

    PCB covered with plastic:
    The image shows a small electronic circuit board submerged in a liquid, likely a PCB.
    + the model recognized that it was a tile and that it was flooded

    LED lamp lit:
    A light bulb glowing on a base, lying on a wooden surface.
    + the model recognized that it was a light source
    - the model incorrectly determined the type of lamp and made up a switch
    - the model made up tools in the background

    Flashing LED lamps:
    White LED bulb with attached electronic circuit on a wooden table.
    + the model recognized that it was a "LED bulb"
    + the model also knew that there was a connected system, that it was a DIY system, that there was a prototype board


    Prototype board:
    Copper-colored perforated board on a wooden surface.
    + the model correctly knew what type of board it was and how it was constructed

    Photo paste replacement:
    Computer motherboard with tools on a wooden table
    + the model recognized quite precisely what was happening there, even indicated RAM, etc

    Screenshot from flasher:
    Screenshot of a computer program showing a list of devices and their addresses.
    + the model recognized that it was a screenshot from software...
    - the model guessed that these were IP addresses, MAC addresses, etc.

    Sonoff NSPanel photo at 17:20:
    View of a digital clock displaying time and weather.
    + the model learned that the time and weather were displayed
    + the model tried to read the temperature and time, but it turned out to be average (17:09 instead of 17:20, but it was 23°C)
    - the model made a lot of small errors in reading the numbers

    Playstation console:
    PlayStation console with one black controller on a wooden table.
    + the model recognized that it was a Playstation console with a pad
    + the model recognized the SONY inscription
    - did the model come up with the idea that there are supposedly two controllers here?

    RCA to SCART converter:
    RCA connector on a wooden table.
    + the model recognized that it was related to electronics...
    - apart from that, a total failure, hallucinations about PCBs, ICs, protocols

    Old router:
    Black network router with multiple LAN ports and a detached antenna in front.
    + the model more or less knew that it was a switch, that it had ports, etc
    - the model did not notice the antenna
    - is the model hallucinating that there is a cable here?
    - the model hallucinates that there is an inscription?

    Old router and cell...
    Image of an electronic device with an Ultrafire battery on top.
    + how did the AI know it was Ultrafire?
    + AI also tried to read the capacity, but it was mixed up with the cell type - 1865mAh?
    - AI was unable to determine the type of equipment in the photo

    Old radio:
    Vintage radio on a countertop with tuning buttons and volume knob.
    + AI recognized that it was a radio
    - The AI invented some kind of digital display

    System in SOIC housing:
    Image of an integrated circuit with the label Winbond on a wooden surface.
    + AI recognized that it was IC and read "Winbond" as "Winebond"
    - so it did make one typo

    Transformer inside the radio:
    Open electrical appliance with a metallic box and wires connected.
    + AI recognized that it was some electrical device, wires
    - unfortunately I also hallucinate a lot, where are the relays?

    Ball mouse without a ball:
    Old ball mouse with a cord on a wooden surface.
    + AI recognized that it was a mouse
    - AI failed on the mouse cable (it claims it`s USB)
    - AI failed to determine the type of mouse (it claims it is optical)

    Loudspeakers:
    Two small yellow speakers connected by cables on a wooden surface.
    + The AI recognized correctly that these were two speakers on the table

    Camping lamp:
    Black and green camping lamp on a wooden background.
    - total heresy, where`s the power tool?

    LDNIO strip:
    Power strip with two electrical sockets and LDNIO logo.
    + AI recognized that it was an electrical strip
    + AI read the LDNIO logo as LONIO (small typo)
    - However, the AI is wrong about the number of ports

    ESR70 tester with capacitor:
    ESR70 tester displaying 2.8 ohms measurement connected to a capacitor.
    + AI recognized that it was some kind of measurement...
    - unfortunately the AI thought it was an oscilloscope
    + AI read almost correctly 2.8 ohms from the display, but made it 2.9 kHz

    LED ceiling panel:
    Square light panel resting on a surface.
    + AI correctly recognized that it was a square LED panel
    + AI even recognized that the panel was new, in the packaging
    + somehow the AI knew it was in the kitchen (by the stove)

    PCB with a soldered element:
    Close-up of a circuit board with a missing chip.
    + AI recognized that it was a PCB, etc
    - However, the AI did not know that something was missing

    Reading the time from my DIY clock:
    Digital display showing the time 19:54.
    + AI recognized that it was a clock with time...
    - but it didn`t read the time correctly, although it tried, 1 and 5 are correct

    Laptop interior:
    Disassembled laptop interior missing disk and RAM.
    + AI recognized that it was a disassembled laptop
    + AI recognized that the disk was missing (not visible)
    - AI thought the disc drive was a battery?
    - AI sees RAM somewhere here?

    XBOX Controller:
    Black Xbox 360 controller lying on a surface.
    + The AI somehow knew that it was specifically an XBox controller

    Mouse interior:
    Open interior of an electronic device focusing on the mechanism of a computer mouse.
    + AI recognized that it was a mouse, that there was a scroll wheel, etc...
    - why does the AI think that this mouse has a red housing? by PCB?

    Broken monitor:
    Broken computer monitor with cracked screen, on a stand.
    + AI correctly recognized that it was a broken monitor and that it had a broken matrix
    - where does AI see the wall socket?

    Meme with "fuses":
    The image shows a humorous guide to fuse replacement with various items mounted on a board.
    Quote:

    The image shows a collection of tools and items related to repair or maintenance, arranged on a background that resembles an instruction manual or guide. The title at the top reads "GUIDE TO FUSE REPLACEMENT," suggesting that these tools and parts are used for repairing or replacing fuses in various equipment.


    The items included in the display are:



    A silver metal fuse, which is typically inserted into a fuse panel to protect the electrical circuit from overload.

    Screwdrivers of different sizes, which would be used to remove and replace fuses without damaging the surrounding components.

    A small bottle or container with a label that says "500ml," though it`s unclear what it contains.

    A small metal object that could possibly be a clip or a tool part.

    An automotive fuse puller, which is used to extract the fuse from its socket without breaking the circuit.

    A plastic tool with an angled head and a notch for gripping the fuse holder in place while removing the fuse.

    A small metal clip, which might be a retention clip for holding the fuse in the socket.

    A plastic tool designed to press down on the fuse holder to secure the fuse in place after it has been removed from the socket.

    An electrical test light, which can be used to visually check if an electrical circuit is open or closed by identifying the illumination of the light bulb when the probe is clipped to the circuit.

    A small metal object that appears to be a pin or a clip with a notch on one end, possibly for holding components together.

    An electronic component resembling an integrated circuit (IC), which could be used in fuse replacement for controlling electrical circuits within a device.


    The items are neatly organized and labeled, indicating that this might be part of an educational display or a DIY repair guide.

    + AI read the text at the top of the image
    - The AI didn`t understand that it was a joke...

    Disassembled mouse:
    Disassembled computer mouse on a table with visible internal components.
    + AI recognized that it was a mouse in parts

    Screwdriver length (photo from the web):
    Image of a screwdriver with labeled dimensions.
    Screenshot of a conversation about the length of a screwdriver.
    + The AI somehow knew that the total length was 10 inches...

    Three screwdrivers:
    Three blue screwdrivers with black handles.
    + AI recognized that these were three screwdrivers...

    Hammer graphics:
    The image shows a claw hammer with a metal head and a black and red handle.
    + AI recognized it was a hammer

    Graphic measurement:
    Digital display of a measuring device with Accuenergy branding.
    + The AI tried to read something, and it was quite successful with "500".
    - most of them are read incorrectly

    Walkman:
    Image of an open electronic device lying on a wooden table.
    - failure, AI thought it was a laptop

    Or maybe without opening it?
    Portable audio device lying on a wooden surface.
    + AI recognized that it was an audio recorder or radio
    - AI decided that LCD is visible here...

    My BK7231 clock:
    Electronic device with LED display and wiring.
    + AI almost read ABCDE, but read it as ABODE, C merged with D

    My clock displaying 20:36:
    Digital clock with red illumination displaying the time 20:36.
    +AI read 2:36
    - however, AI lost 0

    This is now the temperature reading:
    Image of an LED display showing a temperature of 21°C.
    + AI recognized that there is some temperature here...
    - but the reading is heresy, where is 9:30 p.m.? and 12 degrees...

    Mouse cable:
    The image shows a computer mouse with a blue DE-9 connector cable attached.
    - rather a failure, AI forcibly associates USB with mice...

    Summary
    I must admit that it`s good, especially compared to a few years ago. This model can really recognize a wide range of objects and can even sometimes handle several objects/situations in the frame. Sometimes he can also read the text, although he often distorts it. It`s not as good as closed solutions, but remember that LLaVA is available for download and can run on our machine.
    Now the only question is - what to use it for? Maybe a workshop assistant, although it`s probably too early for that? Are we waiting for the 2024/2025 version? I invite you to discuss.

    Cool? Ranking DIY
    Helpful post? Buy me a coffee.
    About Author
    p.kaczmarek2
    Moderator Smart Home
    Offline 
    p.kaczmarek2 wrote 14612 posts with rating 12630, helped 655 times. Been with us since 2014 year.
  • ADVERTISEMENT
  • #2 21035752
    gulson
    System Administrator
    Posts: 29367
    Help: 148
    Rate: 6050
    What if we indexed all images on the forum and made it possible to search within those images? 🤔
    It would make finding information much easier.
    But open source models are still not perfect, as the tests show.
    Apparently the best open source one is CogAgent./CogVLM.

    From the paid models we have:
    GPT-4V, Gemini, Claude
    Thanks for the test!
  • ADVERTISEMENT
  • #3 21035771
    p.kaczmarek2
    Moderator Smart Home
    Posts: 14612
    Help: 655
    Rate: 12630
    This will probably happen sooner or later, as well as assistants who visually determine the fault and help, for example, perform basic activities such as, I don`t know, replacing a light bulb...

    And now some practical information about how quickly the tests went on my equipment. First the specifications:
    Intel Core i7-6700HQ CPU information at 2.60GHz
    RAM before turning on the model (ignore it, I`m doing something in the background of the small one):
    Screenshot of Windows Task Manager showing memory usage.
    After turning on the model:
    Screenshot of Windows Task Manager showing RAM usage and CPU activity.
    And one more question how long the image is described . It takes a while before the AI starts "speaking":
    Red electrical device with cables on a wooden background.
    And here it is ready - 3 minutes after the start:
    Red electrical device with a power supply connected on a wooden table, with a timer reading 3:00.24 and buttons labeled Runda and Stop.
    In the photo above, the AI obviously didn`t recognize what this equipment was, but it also surprised me...

    This type of image description, especially if it is slightly improved and works faster, can have many applications, "intelligent" galleries of family photos and, for example, searching with the phrase "barbecue photo by the lake", as well as in the workshop - maybe some helper in cleaning/organizing the workshop? The ability to run this locally solves privacy issues.
    Helpful post? Buy me a coffee.
  • ADVERTISEMENT
  • #4 21038012
    krzbor
    Level 29  
    Posts: 1756
    Help: 41
    Rate: 1063
    It`s clear he has issues with older devices/things. For him, the RS232 connector simply does not exist, just like RCA to SCART or walkmen. Probably if we showed these things to an eighteen-year-old, he wouldn`t recognize them either. What surprised me, however, was that it didn`t recognize the schuko socket. The model was probably trained on American photos :)
    The poor text recognition is also strange - I thought it would be much better.
    The big drawback is the response time - I thought it was just a few seconds.
  • #5 21040415
    p.kaczmarek2
    Moderator Smart Home
    Posts: 14612
    Help: 655
    Rate: 12630
    There was simply no RS232 in the training examples, so it doesn`t recognize it. Remember that for this system to recognize RS232 it must be in the training examples, marked manually by humans in multiple images.

    Why are you disappointed by poor text recognition? Do you know how this system works? As far as I know, it is not classic OCR, nor is there an OCR "sub-module", rather a certain ability to recognize texts is a secondary ability resulting from how the AI works...
    Helpful post? Buy me a coffee.
  • #6 21040981
    ChatGPT
    Level 12  
    Posts: 84
    Help: 3
    Rate: 20
    I wonder how ChatGPT 4.0 would recognize all this.
  • ADVERTISEMENT
  • #7 21041004
    p.kaczmarek2
    Moderator Smart Home
    Posts: 14612
    Help: 655
    Rate: 12630
    @ElektrodaBot what do you see in the photo?
    A4Tech computer mouse next to a blue connector.
    Helpful post? Buy me a coffee.
  • #8 21041006
    ElektrodaBot
    AI OpenAI ChatGPT4
    Posts: 8812
    Help: 357
    Rate: 937
    p.kaczmarek2 wrote:
    The photo shows part of an electrical or electronic device with the "A+ TECH" logo and a blue plug that appears to be an industrial or specialized connector. This plug has a pin connector which suggests it can be used to transmit signals or power in industrial, medical or other specialized applications. This connector has nine pins arranged in three rows of three pins each and is a male connector with a plastic housing.

    If you need help identifying or want to know what this connector or device is for, let me know! As an electronics engineer, I will be happy to help you solve any puzzles or problems related to this device.
    You can call me in any topic by writing @ElektrodaBot. Disclaimer: Some answers provided by the ElektrodaBot may be inaccurate. Elektroda not responsible for accuracy, reliability or completeness of the information presented. All ElektrodaBot responses should be verified by the user. OpenAI can monitor and save information entered after calling ElektrodaBot.
  • #9 21041011
    p.kaczmarek2
    Moderator Smart Home
    Posts: 14612
    Help: 655
    Rate: 12630
    @ElektrodaBot what`s in the photo? What does this graphic mean?
    Composition of objects used as fuses with different amperage ratings.
    Helpful post? Buy me a coffee.
  • #10 21041012
    ElektrodaBot
    AI OpenAI ChatGPT4
    Posts: 8812
    Help: 357
    Rate: 937
    p.kaczmarek2 wrote:
    what`s in the photo? What does this graphic mean?
    https://obrazki.elektroda.pl/3783272400_1712818150.png
    The photo shows a graphic titled "Guide to Fuse Replacement", which at first glance may look like instructions for different types of fuses with different current ratings. However, upon closer inspection, you can see that the items shown in the graphic are not actually fuses. This is a humorous show in which everyday objects (such as a piece of aluminum foil, a bullet, a screw, a pin, etc.) and tools (such as a spanner) are used to depict "fuses" with different amperage values.

    Here are the items pictured and their alleged amp ratings:

    - 100 amps: A piece of aluminum foil that could jokingly be used as a makeshift fuse by people who do not understand the dangers or proper operation of real fuses.
    - 250 ampere: cartridge, which is of course a very dangerous and absurd joke, because the use of ammunition in an electrical installation is not only dangerous, but also illegal.
    - 350 Amps (Audiovisual Auto-Alert): The actual fuse may not be available and the item shown in the graphic does not provide any indication of its actual function.
    - 600 amperes: a screw that is not an electrical safety item and would be unsafe to use in this context.
    - 130 amperes: a key to some security device or lock, which is also not a fuse.
    - 1500 Amps (Slow-Blow): Another screw that jokingly suggests being a delayed-acting fuse, which is obviously untrue.
    - 2000 ampere (Slow-Blow): an open-end wrench that is a hand tool, not an element that protects the electrical installation.

    Overall, the graphic is a fun compilation of items that definitely shouldn`t be used as fuses, with captions suggesting their unusual uses. Such graphics can be used for humorous or educational purposes to show what definitely not to do in real electrical installations. In fact, using inappropriate items as replacement fuses is very dangerous and can lead to fires or other serious electrical failures.
    You can call me in any topic by writing @ElektrodaBot. Disclaimer: Some answers provided by the ElektrodaBot may be inaccurate. Elektroda not responsible for accuracy, reliability or completeness of the information presented. All ElektrodaBot responses should be verified by the user. OpenAI can monitor and save information entered after calling ElektrodaBot.
  • #11 21041054
    p.kaczmarek2
    Moderator Smart Home
    Posts: 14612
    Help: 655
    Rate: 12630
    @ChatGPT i.e. ElektrodaBot (ChatGPT4) detected the joke, only LLaVa couldn`t handle it.

    If you want, you can invoke ElektrodaBot yourself, but first invoke it in the content of the post and then post the question.

    Added after 1 [minutes]:

    Btw, what this bot wrote earlier regarding the mouse:
    Quote:

    This connector has nine pins arranged in three rows of three pins each

    I see two rows of pins in the photo...
    Helpful post? Buy me a coffee.
📢 Listen (AI):

Topic summary

✨ The discussion revolves around the capabilities of the LLaVA (Large Language and Vision Assistant) AI model, which can analyze images and read text locally on a computer. Users are testing its performance with workshop photos, noting its strengths and weaknesses. While LLaVA can describe images, it struggles with older technology recognition and text accuracy. Comparisons are made with other AI models like GPT-4V, Gemini, and Claude, highlighting LLaVA's limitations in recognizing certain connectors and its slower response time. The conversation also touches on the potential for future advancements in AI image indexing and fault detection.
Generated by the language model.

FAQ

TL;DR: In one local test, LLaVA needed about 3 minutes to start one image description, and the verdict was "it`s good" for workshop photos but unreliable for OCR, jokes, and older connectors. This FAQ helps electronics users judge when a fully local vision model is useful, when it hallucinates, and when cloud tools still win. [#21035771]

Why it matters: A local multimodal model can protect photo privacy in an electronics workshop, but you need realistic expectations for speed, text reading, and fault analysis.

Model or group Deployment Reported strength in thread Reported weakness in thread
LLaVA Local, downloadable Recognized many workshop objects and some text Hallucinated objects, weak OCR, slow startup
CogAgent / CogVLM Open-source Described as the best open-source option in the discussion No direct forum benchmark shown
GPT-4V / Gemini / Claude Paid cloud models Listed as stronger paid alternatives Not local; privacy trade-off implied

Key insight: Local vision AI already identifies many electronics correctly, but it still fails on edge cases that matter in repair work: old connectors, exact text, counts of ports, and image humor.

Quick Facts

  • Local image description in the shared setup started after about 3 minutes, which set the practical speed expectation for workshop use. [#21035771]
  • One display-reading test showed 23°C was read correctly, but the same screen time was misread as 17:09 instead of 17:20. [#21035481]
  • A DIY clock showing 20:36 was read as 2:36, so the model kept the 2 and 36 but lost one 0. [#21035481]
  • In a measurement test, the model nearly read 2.8 ohms, but converted it into 2.9 kHz, showing unit confusion rather than pure recognition failure. [#21035481]
  • A connector was misdescribed as having nine pins in three rows of three, while the poster explicitly noted the photo showed two rows of pins. [#21041054]

What is LLaVA and how does a local multimodal AI model differ from ChatGPT-style cloud tools?

LLaVA is a local multimodal model that answers from both text and images. The thread defines it as "Large Language and Vision Assistant" and describes it as ChatGPT-like, but runnable on your own computer instead of only in the cloud. That local setup matters for private workshop photos, device internals, and family galleries that users may not want to upload externally. [#21035481]

How do you run LLaVA locally on a computer using Docker and WebUI?

You run LLaVA through the local WebUI mentioned in the earlier Docker setup. The thread gives this workflow: 1. install the local AI/LLM WebUI in Docker from the earlier guide, 2. download or enable LLaVA inside that local interface, 3. submit photos and prompt the model for descriptions or text reading. The project home page is also referenced for interested users. [#21035481]

Why does LLaVA hallucinate objects or text in workshop photos, like a keyboard, table leg, barcode, or wall socket that are not really there?

LLaVA hallucinates because it predicts plausible visual details, not guaranteed facts. In the tests, it invented a table leg inside a dryer photo, a barcode on packaged hardware, a mouse and keyboard near a monitor, and a wall socket beside a broken monitor. Those errors show the model can identify the general scene while still adding false objects or false text overlays. [#21035481]

How well can LLaVA read text from images such as T9SMAX, MSI, SONY, Winbond, clocks, and device displays?

LLaVA reads short labels and logos moderately well, but it often distorts letters, digits, or units. It recognized MSI, SONY, and most of T9SMAX, dropped the "S" once, read Winbond as "Winebond," and misread several clock or display values. It also got 23°C right in one case, but not the displayed time. [#21035481]

What hardware and RAM usage should I expect when running LLaVA locally for image analysis?

You should expect noticeable RAM growth and a machine strong enough to tolerate multi-minute image analysis. The poster shared screenshots of system specifications, RAM before model startup, and RAM after startup to show that enabling LLaVA materially changes memory use. The thread does not give a numeric RAM figure, so the safe takeaway is to plan for visible memory overhead rather than a lightweight background tool. [#21035771]

How long does LLaVA usually take to describe a photo on local hardware, and why can the response take around 3 minutes to start?

In the shared local setup, one image description began about 3 minutes after launch. The poster showed a waiting screen, then a ready result three minutes later, and noted that the model takes some time before it starts "speaking." That delay reflects local inference overhead, not instant OCR-style processing, so response time can dominate usability. [#21035771]

What can I realistically use a local vision model like LLaVA for in a workshop or electronics lab?

You can use it for broad object recognition, rough photo tagging, and simple scene summaries. In the tests, it correctly identified items such as a multimeter, monitor, prototype board, Playstation console, XBOX controller, speakers, and a square LED panel in packaging. The thread also suggests workshop cleanup help, family photo search like "barbecue photo by the lake," and privacy-safe local indexing. [#21035771]

Why does LLaVA struggle with older hardware like RS232 connectors, RCA to SCART adapters, Walkman devices, and Schuko sockets?

It struggles because rare legacy items were not sufficiently represented in training examples. The poster states that if RS232 was absent from training images, the system will not recognize it. The test thread also shows poor results on RCA-to-SCART hardware, a Walkman misread as a laptop, and surprise that a Schuko socket was not recognized. [#21040415]

What is CogAgent or CogVLM, and is it considered better than LLaVA for open-source image understanding?

CogAgent or CogVLM is presented in the discussion as the strongest open-source alternative, but the thread does not include a side-by-side benchmark. One reply says "Apparently the best open source one is CogAgent./CogVLM," while paid options listed separately were GPT-4V, Gemini, and Claude. So the forum positions it above LLaVA by reputation, not by measured local test data. [#21035752]

LLaVA vs GPT-4V, Gemini, and Claude: which model is better for recognizing electronics, reading text in photos, and understanding jokes in images?

The thread portrays paid cloud models as stronger overall, especially for higher-level understanding. LLaVA did recognize many electronics, but it failed on exact text, older gear, and the "fuses" joke image. In contrast, the forum’s ChatGPT-based bot explicitly recognized that the fuse graphic was humorous and explained why the pictured objects were not real fuses. [#21041012]

How could image indexing and search on a forum work if AI analyzed all uploaded photos locally or with open models?

It would work by generating text descriptions or tags for each uploaded image and then searching those descriptions. One reply suggests indexing all forum images so users could search within photos, while another gives examples such as finding a family picture with the phrase "barbecue photo by the lake." A local or open-model pipeline would reduce privacy concerns for those image-derived labels. [#21035771]

What is a Schuko socket, and why might a vision model trained mostly on American photos fail to recognize it?

"Schuko socket" is an electrical mains socket standard that accepts grounded European plugs, using side earth contacts rather than the North American outlet shape. In the thread, a user argues the model likely learned mostly from American photos, which would explain why it missed a Schuko outlet despite recognizing many other common objects. [#21038012]

Why is text recognition in multimodal AI often worse than classic OCR, even when the model can describe the rest of the image correctly?

Text recognition is weaker because this model is not described as classic OCR with a dedicated OCR sub-module. The poster explains that text reading is instead a secondary ability that emerges from how the vision-language system works. That matches the test results: it recognized scenes well, but distorted labels, dropped digits, and confused units such as ohms and kHz. [#21040415]

How accurately can AI identify electronics from photos, such as multimeters, prototype boards, smart plugs, ESR70 testers, and XBOX controllers?

It identifies common electronics fairly well at the category level, but precision drops on exact type and function. The model correctly recognized a multimeter with wires, a prototype board, and specifically an XBOX controller. It was much weaker on the smart power socket meter, which it called a charger, and on the ESR70 tester, which it thought was an oscilloscope. [#21035481]

How can a visual AI assistant help diagnose faults or guide basic repair tasks like replacing a bulb, identifying connectors, or organizing a workshop?

A visual assistant can guide simple tasks by identifying objects, reading some labels, and proposing the next action. The thread explicitly imagines assistants that visually determine faults, help replace a bulb, identify connectors, and support workshop cleaning or organizing. Expert view from the discussion: "The ability to run this locally solves privacy issues," which matters when photos show personal devices or workspaces. [#21035771]
Generated by the language model.
ADVERTISEMENT