Image searches are great...
... until they don't work. Since skilled researchers use Search-by-Image a fair bit (at least *I* do), it's always useful to understand just how well it works. And since the LLMs have introduced multimodal capabilities, it's just as important to see how well the new search tools are working.
Last week I gave you 4 full-resolution images that you can download to your heart's delight (with links so you can get the originals, if you really want them). Here, taken on a recent trip, are 1. my hand; 2. a bottle of wine; 3. a piece of pastry; and 4. a beautiful flower. So... what ARE these things?
Our Challenges for this week is are:
1. How good, really, are the different AI systems at telling you what each of these images are?
2. What kinds of questions can the AI systems answer reliably? What kinds of questions CAN you ask? (And how do you know that the answers you find are correct?)
I did several AI-tool "searches" with each of these images. For my testing, I used ChatGPT, Gemini 2.0 Flash, Meta's Llama, and Anthropic's Claude (3.7 Sonnet). I'm not using anything other than what comes up as the default when accessed over the web. (I didn't any additional money to get special super-service.)
I started with a simple [ please describe what you see in this image ] , running this query for each image on each of the four LLMs. Here's the first row of the resulting spreadsheet looks like (and here's a link so you can see the full details):
![]() |
Click to expand to readable size or click the link above to see the entire sheet. |
Overall, the LLMs did better than I expected, but there are clear differences between them.
ChatGPT gave decent answers, getting the name of the pastry correct (including the spelling!), and getting much of the wine info correct. The flower's genus was given, but not the species.
Gemini gave the most details of all, often 3 or 4X the length of other AIs. The hand was described in excruciating detail ("no immediately obvious signs of deformity"), and Gemini also got the name of the pastry correct (although misspelled: it's Kremšnita, not Kremsnita). Again, immense amounts of detail in the description of the pastry, and definitely a ton of information about the wine photo. Oddly, while Gemini describes the flower, it does NOT identify it.
Llama doest okay, but doesn't identify the pastry or the flower. The wine image just extracts text, but has little other description.
Claude is fairly similar to ChatGPTs performance, though with a bit longer description. It also doesn't identify the pastry or the flower.
You can see the differences in style between the systems by looking at this side-by-side of their answers. Gemini tends to go on-and-on-and-on...
![]() |
Click to see at full-size. This is at the bottom of the sheet. |
It's pretty clear that Gemini tries hard to be all-inclusive--a one-query stop for all your information needs.
Interestingly, if you ask follow-up questions about the flower, all of the systems will make a good effort at identifying it--they all agree it's a Hellborus, but disagree on the species (is it Orientalis or Niger?).
By contrast, regular Search-by-image does a good job with the flower (saying it's Helleborus niger), an okay job with the wine bottle, a good job with the pastry (identifying it as a "Bled cream cake," which is acceptable), and a miserable job with the hand.
On the other hand...asking an LLM to describe an image is a very different thing than doing Search-by-Image.
Asking for an image-description in an LLM is like asking different people on the street to describe a random image that you pop in front of them--you get very different answers depending on the person and what they think is a good answer.
Gemini does a good job on the wine image, telling us details about the wine labels and listing the prices shown on the list. By contrast, Claude gives much the same information, but somehow thinks the prices are in British pounds, quoting prices such as "prices ranging from approximately £12.50 to £36.00." (I assure you, the prices were in Swiss Francs, not pounds Sterling!) So that bit seems to be hallucinated.
I included the hand image to see what the systems would do with a very vanilla, ordinary image... and to their credit, they said just plain, vanilla ordinary things without hallucinating much. (Although Claude did say "...The fingernails appear to have a purple or bluish tint, which could be nail polish or possibly a sign of cyanosis..." I assure you, I'm just fine and not growing cyanotic nor consorting with fingernail polish! It didn't seem to consider that the lighting might have had something to do with its perception.
And, oddly enough, as Regular Reader Arthur Weiss pointed out, the AIs don't seem to know how to extract the EXIF metadata with GPS lat/long from the image. If you download the image, you can get that data yourself and find out that the pic of the pastry was in fact taken near Lake Bled in Slovenia. This isn't just a random cubical cake, but it is a Kremšnita!
Here's what GPS info I see when I download the photo and open it in Apple's Preview app, then ask for "More info."
Not so far from Lake Bled itself.
SearchResearch Lessons
1. No surprise--but keep checking the details--they might be right, but maybe not. I was impressed with the overall accuracy, but errors do creep in. (The prices are nowhere noted in British pounds.)
2. If you're looking for something specific, you'll have to ask for it. The prompt I gave ("describe the image...") was intentionally generic, just to see what the top-level description would be. Overall I was impressed with the AI's ability to answer follow-up questions. I asked [what was the price of Riveria 2017] and got correct answers from all of them. That's a nice capability.
Overall, we now have another way to figure out what's in those photos beyond just Search-by-image. Try it out and let us know how well it works for you.
Keep searching!
No comments:
Post a Comment