I wrote a post in February of this year...
... saying that the citations generated by Google's Gemini were not to be believed.
Since I believe in second chances and redemption, I revisited that post and re-did the queries.
It pains me to say that Gemini has not improved--if anything, it's gotten worse.
On the other hand, other LLMs have really stepped up their games. Perplexity, Claude, and ChatGPT 4o have gotten significantly better. They're now to the point where I'm going use them in my daily research. I'll still check their work, but in several cases, I've learned things that I wouldn't have found otherwise.
The Details
Yesterday, I redid my queries to Gemini, Perplexity, Claude, ChatGPT4o, and Meta's AI (which uses Llama 3). The two queries were:
Q1: [ why have house sparrows expanded their range
dramatically since being introduced into the US,
while Eurasian tree sparrows have not?
They're so similar, you'd think they would
expand at a similar rate. ]
and...
Q2: [ can you suggest further reading in the scientific
literature about the differences in range expansion? ]
Here's the summary of what worked and what didn't.
Gemini: The answer to Q1 is short and correct. It did not go very deep into any reasons for the differences. It was a merely okay answer. Oddly, it listed 2 citations for its writing, but somehow neglected to actually give the citations! (That is, the text has reference numbers like this: [1] -- but there's no actual citation for [1]!)
But Gemini's answer to Q2 was terrible. It gave 3 suggestions for further reading, but instead of actually giving the citations or a link to the papers, it totally punted! In two of the three suggestions, it says "[invalid URL removed]" -- what? In the 3rd citation, it shows blue text, as though was a link to a paper, but there's no link there. It's just blue text. WTF?
TLDR--Gemini's answers were short and misleading. No actual citations were produced. A pretty bad shortfall.
Perplexity: Answered Q1 with 7 reasons, all with citations (that worked!) to reasonable literature. Best of all, Perplexity found an answer to the question that all of the other LLMs missed (House sparrows have a really robust immune system, letting them outcompete the Eurasians). I'm impressed.
Perplexity's answer to Q2 listed 4 excellent papers in the scientific literature. Well done.
Even better, Perplexity now has a "Pro Search" capability that allows it to dig more deeply into the literature. That is, it does a kind of "slow search" (a term coined by my friend and fellow SearchResearch scientist Jamie Teevan) digging into extra resources to give a better answer to the query.
Using Perplexity Pro Search found 5 additional papers that are all real papers that are spot on. (What impressed me the most is that Pro Search found a great paper that had eluded me when using traditional search methods. Kudos.)
ChatGPT4o: The overview answer to Q1 is fairly good--accurate in all details. Not especially deep, but fine as an introduction.
But ChatGPT's answer to Q2 wasn't as good as Perplexity's. The citations were often to book chapters that are VERY hard to access. One of them was written in 1951, which is fine, but really hard to access. (And, truthfully, the field has moved on from there!)
And, because hallucinations run deep, one of the citations is fictitious. Alas. It was going well, but then the LLM had to make up one of the papers. Dang.
Claude: Also had a good answer to Q1--deeper than Gemini, about the same as ChatGPT. No citations, but not bad.
The answer to Q2 was much like ChatGPT's answer--slightly dated, with one hallucinated citation. And one of the citations is to an actual chapter in a book, The Birds of North America, but it's a massive book that comes in 18 volumes (and costs more than $2000)... but the citation doesn't say which volume it's in! So close.
Meta AI: A decent answer to Q1, but the answer to Q2 is a bit confused. The answer lists 7 factors that contribute to the difference in success (well done!), but the citations are deeply messed up. Three of the bullet points list citation #1 as their source, but each bullet point describes the citation in a different way! (As a book chapter, as a summary article, or as a journal paper.) It looks like there were supposed to be 3 different citations, but they somehow all got lumped together. Maybe there's good stuff in there, but it's hard to tell--the citations are missing and mixed up with each other.
Bottom line: If I was giving out grades:
Why am I so tough on Gemini? Mostly because, like many teachers, I want to say "you're not living up to expectations." Google has Scholar, for heaven's sake. It should be trivial to check their own results. There's no reason it should be producing "[invalid URL removed]" in the results. This text suggests that they ARE checking, but then not following through when the check fails.
By contrast, I am really impressed with Perplexity. I'll be using it more-and-more! (But, as always, I'll be double-checking everything. Great results on this short test, but I still have my doubts.)
For people who want to delve more deeply into what I saw as the results, here's a link to a PDF with images of all the results. You can check them out if you'd like to see what I saw in my testing.
Keep searching. (And keep double checking those AI results!)
Dan, thank you for your continued testing of these tools! Very useful.
ReplyDeleteAI in Korea - a hallucination? verification is moot - or maybe it should be mute?
ReplyDeletehttps://www.bbc.com/news/articles/c4ngr3r0914o
https://youtu.be/nFYwcndNuOY?si=YHSqj0keWswn26Sk
https://youtu.be/oE8hAQie8fw?si=j7lYuhtoCDK5HvKr
"Girl group Aespa, who have several AI members as well as human ones, also used the technology in their latest music video. "
https://www.npr.org/2023/11/04/1210649637/artificial-intelligence-is-being-used-to-id-goose-faces
https://www.theregister.com/2017/11/02/gisa_pyramid_void_muon/
a sampling (AI is geezer inducing) -
https://www.threads.net/@angelicism/post/C57VP9Otfjq
Dan, I have been impressed with Consensus - https://consensus.app/ - when needing to have AI find and summarize research papers. I asked it "Have house sparrows in the US expanded their range more than Eurasian tree sparrows?" and it found and summarized several studies with links to all the papers.
ReplyDeleteThis is a great resource, thanks for bringing it to our attention! One of the papers that Consensus brought to the fore highlights the possible role of an immunological response as being partly responsible! Quote from the paper: "The more successful house sparrows now exhibit weaker inflammatory responses than the less successful tree sparrows, which supports the possibility that diminished investments in immune defense may have been conducive to the initial colonization by the more successful species." Fascinating.
Delete
ReplyDeletehttps://www.alljournals.cn/relate_search.aspx?pcid=90BA3D13E7F3BC869AC96FB3DA594E3FE34FBF7B8BC0E591&aid=C6B0FCEF0FD427692CDA2D389D8D4920&language=1
https://pubmed.ncbi.nlm.nih.gov/20473621/
麻雀和人工智能在山雾中筑起舒适的巢穴,为勤奋的搜寻者带来好运……
(@Eric - interesting reads - https://www.controlaltachieve.com/)
I commented in this week Challenge that, for me, the best way to know if LLMS work is searching for experts like Dr. Russell.
ReplyDeleteToday I found on YouTube in a channel that I follow: BreakingVlad this video.
Chatgpt vs Chemistry with an expert in the field. I'd love to know how others would answer. According to the expert Chatgpt did it better than expected
In Spanish:
https://youtu.be/xL_5gb-zW-0?si=0Vbb1JnNayFshEaq