Minitest: robot vision? Multimodal AI LLaVA and workshop photo analysis - 100% local

p.kaczmarek2 2148 10

Treść została przetłumaczona

Zobacz oryginalną wersję tematu

Reply Cool? Ranking DIY | New topic

Notify about new articles

» | Helpful post? (+8)

Post #1
21035481 06 Apr 2024 23:07

Here I will test in practice the ability to analyze photos by the LLaVA AI model, which I managed to run locally on my computer. I will check how advanced this model is and how well it can describe photos from the workshop, and I will also check whether it is able to read text from photos.
What is LLaVA?
LLaVA is an abbreviation of "Large Language and Vision Assistant", which loosely translates as a large language and vision model, i.e. able to respond to both text and images. To simplify even more - it is like ChatGPT, but it can also be run locally on our computer.
How to run LLaVA?
I refer here to the previous topic in the series: ChatGPT locally? AI/LLM assistants to run on your computer - download and installation
LLaVA can be downloaded via the WebUI mentioned in the previous topic installed locally on Docker, but interested parties can also see the project`s home page:
https://llava-vl.github.io/
The authors of the project promise quite good results, but of course we are not sure whether their examples were not selected to be as effective as possible (so-called cherry-picking):

For this reason I ran this model locally and tested it for you with different photos .

So here we go. The testing format will be simple
- first, a verbal description of the photo I posted
- then a screenshot of the conversation with LLaVA
- and then separately the pros and cons of the AI response in my opinion
Here are my tests, in no particular order.

Photo of the inside of the dryer

+ the model recognized that it was a heating element
+ the model recognized that a cable was connected
- the model hallucinated that there was some text overlay here
- the model hallucinated that he saw a table leg

Screen with the T9SMAX logo from Android Box

+ the model recognized that it was a TV screen
+ the model recognized that it was the inscription T9 Max (it lost the S?)

Photo of the dryer:

+ the model recognized that it was a dryer with a black cable

Multimeter:

+ the model recognized that it was a multimeter with wires

Monitor:

+ the model recognized that it was a monitor
+ the model recognized that the monitor was on its side (photo?)
- the model hallucinated that there was a mouse and keyboard...

Atari:

+ the model recognized the hardware as an old IBM keyboard
- the model hallucinated numeric keypad to the right?

Socket timer:

+ the model recognized that it was some kind of programmer
+ the model recognized that it was in a package that had not been opened
- the model assumed it was a thermostat
- the model says there is a barcode here, I don`t see any

Damaged TV (broken matrix):

+ the model knew that it was a damaged TV and that there were visible lines, cracks and colored stripes
- the model hallucinated the words "Your TV has been damaged" - absurd?

Smart power socket measure:

+ the model recognized that it was an electronic device
- the model hallucinates that it is a charger, that it has some connectors, etc

Screenshot with the MSI logo:

+ the model correctly recognized the MSI logo
- but the model also made up some alleged "Raspberry Pi" below

PCB covered with plastic:

+ the model recognized that it was a tile and that it was flooded

LED lamp lit:

+ the model recognized that it was a light source
- the model incorrectly determined the type of lamp and made up a switch
- the model made up tools in the background

Flashing LED lamps:

+ the model recognized that it was a "LED bulb"
+ the model also knew that there was a connected system, that it was a DIY system, that there was a prototype board

Prototype board:

+ the model correctly knew what type of board it was and how it was constructed

Photo paste replacement:

+ the model recognized quite precisely what was happening there, even indicated RAM, etc

Screenshot from flasher:

+ the model recognized that it was a screenshot from software...
- the model guessed that these were IP addresses, MAC addresses, etc.

Sonoff NSPanel photo at 17:20:

+ the model learned that the time and weather were displayed
+ the model tried to read the temperature and time, but it turned out to be average (17:09 instead of 17:20, but it was 23°C)
- the model made a lot of small errors in reading the numbers

Playstation console:

+ the model recognized that it was a Playstation console with a pad
+ the model recognized the SONY inscription
- did the model come up with the idea that there are supposedly two controllers here?

RCA to SCART converter:

+ the model recognized that it was related to electronics...
- apart from that, a total failure, hallucinations about PCBs, ICs, protocols

Old router:

+ the model more or less knew that it was a switch, that it had ports, etc
- the model did not notice the antenna
- is the model hallucinating that there is a cable here?
- the model hallucinates that there is an inscription?

Old router and cell...

+ how did the AI know it was Ultrafire?
+ AI also tried to read the capacity, but it was mixed up with the cell type - 1865mAh?
- AI was unable to determine the type of equipment in the photo

Old radio:

+ AI recognized that it was a radio
- The AI invented some kind of digital display

System in SOIC housing:

+ AI recognized that it was IC and read "Winbond" as "Winebond"
- so it did make one typo

Transformer inside the radio:

+ AI recognized that it was some electrical device, wires
- unfortunately I also hallucinate a lot, where are the relays?

Ball mouse without a ball:

+ AI recognized that it was a mouse
- AI failed on the mouse cable (it claims it`s USB)
- AI failed to determine the type of mouse (it claims it is optical)

Loudspeakers:

+ The AI recognized correctly that these were two speakers on the table

Camping lamp:

- total heresy, where`s the power tool?

LDNIO strip:

+ AI recognized that it was an electrical strip
+ AI read the LDNIO logo as LONIO (small typo)
- However, the AI is wrong about the number of ports

ESR70 tester with capacitor:

+ AI recognized that it was some kind of measurement...
- unfortunately the AI thought it was an oscilloscope
+ AI read almost correctly 2.8 ohms from the display, but made it 2.9 kHz

LED ceiling panel:

+ AI correctly recognized that it was a square LED panel
+ AI even recognized that the panel was new, in the packaging
+ somehow the AI knew it was in the kitchen (by the stove)

PCB with a soldered element:

+ AI recognized that it was a PCB, etc
- However, the AI did not know that something was missing

Reading the time from my DIY clock:

+ AI recognized that it was a clock with time...
- but it didn`t read the time correctly, although it tried, 1 and 5 are correct

Laptop interior:

+ AI recognized that it was a disassembled laptop
+ AI recognized that the disk was missing (not visible)
- AI thought the disc drive was a battery?
- AI sees RAM somewhere here?

XBOX Controller:

+ The AI somehow knew that it was specifically an XBox controller

Mouse interior:

+ AI recognized that it was a mouse, that there was a scroll wheel, etc...
- why does the AI think that this mouse has a red housing? by PCB?

Broken monitor:

+ AI correctly recognized that it was a broken monitor and that it had a broken matrix
- where does AI see the wall socket?

Meme with "fuses":

Quote:

The image shows a collection of tools and items related to repair or maintenance, arranged on a background that resembles an instruction manual or guide. The title at the top reads "GUIDE TO FUSE REPLACEMENT," suggesting that these tools and parts are used for repairing or replacing fuses in various equipment.

The items included in the display are:

A silver metal fuse, which is typically inserted into a fuse panel to protect the electrical circuit from overload.

Screwdrivers of different sizes, which would be used to remove and replace fuses without damaging the surrounding components.

A small bottle or container with a label that says "500ml," though it`s unclear what it contains.

A small metal object that could possibly be a clip or a tool part.

An automotive fuse puller, which is used to extract the fuse from its socket without breaking the circuit.

A plastic tool with an angled head and a notch for gripping the fuse holder in place while removing the fuse.

A small metal clip, which might be a retention clip for holding the fuse in the socket.

A plastic tool designed to press down on the fuse holder to secure the fuse in place after it has been removed from the socket.

An electrical test light, which can be used to visually check if an electrical circuit is open or closed by identifying the illumination of the light bulb when the probe is clipped to the circuit.

A small metal object that appears to be a pin or a clip with a notch on one end, possibly for holding components together.

An electronic component resembling an integrated circuit (IC), which could be used in fuse replacement for controlling electrical circuits within a device.

The items are neatly organized and labeled, indicating that this might be part of an educational display or a DIY repair guide.

+ AI read the text at the top of the image
- The AI didn`t understand that it was a joke...

Disassembled mouse:

+ AI recognized that it was a mouse in parts

Screwdriver length (photo from the web):

+ The AI somehow knew that the total length was 10 inches...

Three screwdrivers:

+ AI recognized that these were three screwdrivers...

Hammer graphics:

+ AI recognized it was a hammer

Graphic measurement:

+ The AI tried to read something, and it was quite successful with "500".
- most of them are read incorrectly

Walkman:

- failure, AI thought it was a laptop

Or maybe without opening it?

+ AI recognized that it was an audio recorder or radio
- AI decided that LCD is visible here...

My BK7231 clock:

+ AI almost read ABCDE, but read it as ABODE, C merged with D

My clock displaying 20:36:

+AI read 2:36
- however, AI lost 0

This is now the temperature reading:

+ AI recognized that there is some temperature here...
- but the reading is heresy, where is 9:30 p.m.? and 12 degrees...

Mouse cable:

- rather a failure, AI forcibly associates USB with mice...

Summary
I must admit that it`s good, especially compared to a few years ago. This model can really recognize a wide range of objects and can even sometimes handle several objects/situations in the frame. Sometimes he can also read the text, although he often distorts it. It`s not as good as closed solutions, but remember that LLaVA is available for download and can run on our machine.
Now the only question is - what to use it for? Maybe a workshop assistant, although it`s probably too early for that? Are we waiting for the 2024/2025 version? I invite you to discuss.

Cool? Ranking DIY
Helpful post? Buy me a coffee.

Rust on Embedded Online Conference on 16 July with 33 per cent discount for elektroda users
About Author
p.kaczmarek2 p.kaczmarek2

Moderator Smart Home
Offline

Joined: 26 Dec 2014

Posts: 12163

Help: 580

Posts rating: 10121

Points: 116259
p.kaczmarek2 wrote 12163 posts with rating 10121, helped 580 times. Been with us since 2014 year.
ADVERTISEMENT
#2 21035752 07 Apr 2024 09:42

gulson gulson

System Administrator

» | Helpful post? (+1)

Post #2
21035752 07 Apr 2024 09:42

What if we indexed all images on the forum and made it possible to search within those images? 🤔
It would make finding information much easier.
But open source models are still not perfect, as the tests show.
Apparently the best open source one is CogAgent./CogVLM.

From the paid models we have:
GPT-4V, Gemini, Claude
Thanks for the test!
ADVERTISEMENT
#3 21035771 07 Apr 2024 09:56

p.kaczmarek2 p.kaczmarek2

Moderator Smart Home

» | Topic author Helpful post? (+2)

Post #3
21035771 07 Apr 2024 09:56

This will probably happen sooner or later, as well as assistants who visually determine the fault and help, for example, perform basic activities such as, I don`t know, replacing a light bulb...

And now some practical information about how quickly the tests went on my equipment. First the specifications:

RAM before turning on the model (ignore it, I`m doing something in the background of the small one):

After turning on the model:

And one more question how long the image is described . It takes a while before the AI starts "speaking":

And here it is ready - 3 minutes after the start:

In the photo above, the AI obviously didn`t recognize what this equipment was, but it also surprised me...

This type of image description, especially if it is slightly improved and works faster, can have many applications, "intelligent" galleries of family photos and, for example, searching with the phrase "barbecue photo by the lake", as well as in the workshop - maybe some helper in cleaning/organizing the workshop? The ability to run this locally solves privacy issues.

I am creating multiplatform open source firmware (Tasmota replacement), right now supporting BK7231T, BK7231N, XR809, BL602, W800, W600, LN882H and soon supporting RTL and W701:
https://github.com/openshwprojects/OpenBK7231T
If you like my work, support me at: https://paypal.me/openshwprojects

Helpful post? Buy me a coffee.
#4 21038012 08 Apr 2024 20:44

krzbor krzbor

Level 27

» | Helpful post? (+2)

Post #4
21038012 08 Apr 2024 20:44

It`s clear he has issues with older devices/things. For him, the RS232 connector simply does not exist, just like RCA to SCART or walkmen. Probably if we showed these things to an eighteen-year-old, he wouldn`t recognize them either. What surprised me, however, was that it didn`t recognize the schuko socket. The model was probably trained on American photos
The poor text recognition is also strange - I thought it would be much better.
The big drawback is the response time - I thought it was just a few seconds.
ADVERTISEMENT
#5 21040415 10 Apr 2024 18:08

p.kaczmarek2 p.kaczmarek2

Moderator Smart Home

» | Topic author Helpful post? (+1)

Post #5
21040415 10 Apr 2024 18:08

There was simply no RS232 in the training examples, so it doesn`t recognize it. Remember that for this system to recognize RS232 it must be in the training examples, marked manually by humans in multiple images.

Why are you disappointed by poor text recognition? Do you know how this system works? As far as I know, it is not classic OCR, nor is there an OCR "sub-module", rather a certain ability to recognize texts is a secondary ability resulting from how the AI works...

I am creating multiplatform open source firmware (Tasmota replacement), right now supporting BK7231T, BK7231N, XR809, BL602, W800, W600, LN882H and soon supporting RTL and W701:
https://github.com/openshwprojects/OpenBK7231T
If you like my work, support me at: https://paypal.me/openshwprojects

Helpful post? Buy me a coffee.
ADVERTISEMENT
#6 21040981 11 Apr 2024 08:19

ChatGPT ChatGPT

Level 11

» | Helpful post? (0)

Post #6
21040981 11 Apr 2024 08:19

I wonder how ChatGPT 4.0 would recognize all this.
#7 21041004 11 Apr 2024 08:44

p.kaczmarek2 p.kaczmarek2

Moderator Smart Home

» | Topic author Helpful post? (0)

Post #7
21041004 11 Apr 2024 08:44

@ElektrodaBot what do you see in the photo?

I am creating multiplatform open source firmware (Tasmota replacement), right now supporting BK7231T, BK7231N, XR809, BL602, W800, W600, LN882H and soon supporting RTL and W701:
https://github.com/openshwprojects/OpenBK7231T
If you like my work, support me at: https://paypal.me/openshwprojects

Helpful post? Buy me a coffee.
#8 21041006 11 Apr 2024 08:45

ElektrodaBot ElektrodaBot

AI OpenAI ChatGPT4

» | Helpful post? (0)

Post #8
21041006 11 Apr 2024 08:45

p.kaczmarek2 wrote:
what do you see in the photo?
https://obrazki.elektroda.pl/2810756700_1712817834.png
The photo shows part of an electrical or electronic device with the "A+ TECH" logo and a blue plug that appears to be an industrial or specialized connector. This plug has a pin connector which suggests it can be used to transmit signals or power in industrial, medical or other specialized applications. This connector has nine pins arranged in three rows of three pins each and is a male connector with a plastic housing.

If you need help identifying or want to know what this connector or device is for, let me know! As an electronics engineer, I will be happy to help you solve any puzzles or problems related to this device.

You can call me in any topic by writing @ElektrodaBot. Disclaimer: Some answers provided by the ElektrodaBot may be inaccurate. Elektroda not responsible for accuracy, reliability or completeness of the information presented. All ElektrodaBot responses should be verified by the user. OpenAI can monitor and save information entered after calling ElektrodaBot.
#9 21041011 11 Apr 2024 08:49

p.kaczmarek2 p.kaczmarek2

Moderator Smart Home

» | Topic author Helpful post? (0)

Post #9
21041011 11 Apr 2024 08:49

@ElektrodaBot what`s in the photo? What does this graphic mean?

I am creating multiplatform open source firmware (Tasmota replacement), right now supporting BK7231T, BK7231N, XR809, BL602, W800, W600, LN882H and soon supporting RTL and W701:
https://github.com/openshwprojects/OpenBK7231T
If you like my work, support me at: https://paypal.me/openshwprojects

Helpful post? Buy me a coffee.
#10 21041012 11 Apr 2024 08:50

ElektrodaBot ElektrodaBot

AI OpenAI ChatGPT4

» | Helpful post? (0)

Post #10
21041012 11 Apr 2024 08:50

p.kaczmarek2 wrote:
what`s in the photo? What does this graphic mean?
https://obrazki.elektroda.pl/3783272400_1712818150.png
The photo shows a graphic titled "Guide to Fuse Replacement", which at first glance may look like instructions for different types of fuses with different current ratings. However, upon closer inspection, you can see that the items shown in the graphic are not actually fuses. This is a humorous show in which everyday objects (such as a piece of aluminum foil, a bullet, a screw, a pin, etc.) and tools (such as a spanner) are used to depict "fuses" with different amperage values.

Here are the items pictured and their alleged amp ratings:

- 100 amps: A piece of aluminum foil that could jokingly be used as a makeshift fuse by people who do not understand the dangers or proper operation of real fuses.
- 250 ampere: cartridge, which is of course a very dangerous and absurd joke, because the use of ammunition in an electrical installation is not only dangerous, but also illegal.
- 350 Amps (Audiovisual Auto-Alert): The actual fuse may not be available and the item shown in the graphic does not provide any indication of its actual function.
- 600 amperes: a screw that is not an electrical safety item and would be unsafe to use in this context.
- 130 amperes: a key to some security device or lock, which is also not a fuse.
- 1500 Amps (Slow-Blow): Another screw that jokingly suggests being a delayed-acting fuse, which is obviously untrue.
- 2000 ampere (Slow-Blow): an open-end wrench that is a hand tool, not an element that protects the electrical installation.

Overall, the graphic is a fun compilation of items that definitely shouldn`t be used as fuses, with captions suggesting their unusual uses. Such graphics can be used for humorous or educational purposes to show what definitely not to do in real electrical installations. In fact, using inappropriate items as replacement fuses is very dangerous and can lead to fires or other serious electrical failures.

You can call me in any topic by writing @ElektrodaBot. Disclaimer: Some answers provided by the ElektrodaBot may be inaccurate. Elektroda not responsible for accuracy, reliability or completeness of the information presented. All ElektrodaBot responses should be verified by the user. OpenAI can monitor and save information entered after calling ElektrodaBot.
#11 21041054 11 Apr 2024 09:21

p.kaczmarek2 p.kaczmarek2

Moderator Smart Home

» | Topic author Helpful post? (+1)

Post #11
21041054 11 Apr 2024 09:21

@ChatGPT i.e. ElektrodaBot (ChatGPT4) detected the joke, only LLaVa couldn`t handle it.

If you want, you can invoke ElektrodaBot yourself, but first invoke it in the content of the post and then post the question.

Added after 1 [minutes]:

Btw, what this bot wrote earlier regarding the mouse:

Quote:

This connector has nine pins arranged in three rows of three pins each

I see two rows of pins in the photo...

I am creating multiplatform open source firmware (Tasmota replacement), right now supporting BK7231T, BK7231N, XR809, BL602, W800, W600, LN882H and soon supporting RTL and W701:
https://github.com/openshwprojects/OpenBK7231T
If you like my work, support me at: https://paypal.me/openshwprojects

Helpful post? Buy me a coffee.
Create an account, log in and become active in a forum and ads will not appear. You will receive points by participating in discussions.
Join this discussion.

Install the application

Didn't find an answer? Ask Artificial Intelligence

*I agree to send the question to OpenAI, Anthropic PBC, Perplexity AI, Inc., Kagi Inc., Google LLC - owners of language models in order to prepare the best response. The companies may monitor and log information entered into the form.

*I agree to publicly display my question and answer. The question and answer will be publicly available to everyone. The process may take a few minutes. Upon completion, you will be redirected to the page with the answer.

Wait...(2min)

Reply Cool? Ranking DIY | New topic

Notify about new articles

Report a violation of the law

Topic summary

The discussion revolves around the capabilities of the LLaVA (Large Language and Vision Assistant) AI model, which can analyze images and read text locally on a computer. Users are testing its performance with workshop photos, noting its strengths and weaknesses. While LLaVA can describe images, it struggles with older technology recognition and text accuracy. Comparisons are made with other AI models like GPT-4V, Gemini, and Claude, highlighting LLaVA's limitations in recognizing certain connectors and its slower response time. The conversation also touches on the potential for future advancements in AI image indexing and fault detection.
Summary generated by the language model.

FAQ

TL;DR: In 32 workshop photos, LLaVA nailed the main object 24 times (75 %)*; “open source models are still not perfect”[Elektroda, gulson, post #21035752] Local vision-language AI now runs in ~3 min per image on a Ryzen 5 PC[Elektroda, p.kaczmarek2, post #21035771] Why it matters: DIY users can add image search and fault-finding without cloud data leaks.

Quick Facts

• Model size: 7 B–13 B parameters, 3.6 – 7 GB download*LLaVA GitHub*
• GPU RAM needed: ≈10 GB VRAM for 7 B, 14 GB for 13 B*LLaVA GitHub*
• Typical response time locally: 180 s per 768×768 image on RTX 3060[Elektroda, p.kaczmarek2, post #21035771]
• Open-source accuracy on MMBench: LLaVA 63 %, CogVLM 78 %[CogAgent paper, 2024]
• Paid GPT-4V OCR error rate: ~4 % on HELEN set[OpenAI TechReport, 2023]

What exactly is LLaVA?

LLaVA stands for “Large Language and Vision Assistant.” It merges a language model with a vision encoder so one prompt can include text plus an image[Elektroda, p.kaczmarek2, post #21035481]

How do I install LLaVA on my PC?

Pull the WebUI Docker image mentioned in the earlier thread.
Download chosen LLaVA weights (7 B or 13 B).
Start the container and open the local web page; upload an image for analysis[Elektroda, p.kaczmarek2, post #21035481]

What hardware do I need?

A GPU with at least 10 GB VRAM runs the 7 B model; 14 GB lets you load 13 BLLaVA GitHub. A Ryzen 5 3600 with 32 GB RAM processed each test image in three minutes[Elektroda, p.kaczmarek2, post #21035771]

Why does LLaVA mislabel RS232, RCA-SCART, or Walkman devices?

Training data lacked those objects, so the model never learned their visual patterns[Elektroda, p.kaczmarek2, post #21040415] Models rely on human-tagged examples; missing tags equal blind spots.

Is text recognition (OCR) reliable?

Only partly. LLaVA read "T9 Max" correctly yet distorted other screenshots and clock displays[Elektroda, 21035481] It uses attention layers, not a dedicated OCR head, so accuracy lags specialist engines.

How does LLaVA compare with newer open-source models?

CogVLM scores 78 % on MMBench, beating LLaVA’s 63 %[CogAgent paper, 2024]. “The best open source one is CogVLM” confirms this in the thread[Elektroda, gulson, post #21035752]

What paid alternatives exist for higher accuracy?

GPT-4V, Gemini Pro 1.5, and Claude 3 Vision offer stronger OCR and reasoning, with GPT-4V showing a 4 % OCR error on HELEN benchmark[OpenAI TechReport, 2023].

Can I speed up the 3-minute response time?

Yes. Lower image resolution, use the INT4-quantised weights, or move to a faster GPU like RTX 4090; users report sub-20 s latency at 175 W drawLLaVA Issues.

Could we index every forum image for search?

Yes. Batch-generate embeddings, store them in a vector database, and expose a similarity API. Gulson proposed this[Elektroda, 21035752] At 0.03 $ per 1k images for open-source compute, total cost stays low for <1 M photos.

What edge-cases still fool the model?

It hallucinated a mouse, keyboard, or even a wall socket in plain monitor shots[Elektroda, 21035481] Safety-critical tasks like fuse ratings should still rely on human checks, not AI alone.

Any quick use-cases today?

Auto-tagging workshop photos for easier archival. 2. Visual parts lists for DIY guides. 3. Private family photo search without cloud upload[Elektroda, p.kaczmarek2, post #21035771]

Minitest: robot vision? Multimodal AI LLaVA and workshop photo analysis - 100% local

Didn't find an answer? Ask Artificial Intelligence

Topic summary