Wednesday, January 17, 2024

Answer: How do you use LLMs in your SearchResearch?


P/C Dalle3. [an evocative picture of data, data tables, line charts, histograms]

It looks like many of us are using LLMs (Bard, ChatGPT, etc.) to ask SRS-style questions, especially ones that are a little more difficult to shape into a simple Google-style query.  

That's why I asked this week:  

1.  How have you found yourself using an LLM system (or more generally, any GenAI system) to help solve a real SearchResearch question that you've had?  

I've also gotten LLM help in finding answers to SRS questions that I couldn't figure out how to frame as a "regular" Google search.  Bard has answered a couple such questions for me (which I then fact-checked immediately!).  While the answers might have a few inaccuracies in the answer, the answers are often great lead-ins to regular SRS.  

For instance, I asked Pi (their home page) a question that has been in the news recently:  "What's the percentage of homeless people in the US?"  Here's what it told me: 

My first reaction was the obvious one--that can't possibly be correct!  18%???

The strangeness of the answer drove me to do a regular Google search for some data.  At the Housing and Urban Development website I found this: "The Department of Housing and Urban Development (HUD) counted around 582,000 Americans experiencing homelessness in 2022. That's about 18 per 10,000 people in the US, up about 2,000 people from 2020."  

Or, if you do the math, that's around 0.18% of the US population.  

When you look at multiple sites, you'll keep seeing that figure "18 per 10,000 people," which I could imagine an LLM rendering as "18%" in some twisted way.  Arggh!!  

Interestingly, when I did this on Bard and ChatGPT on the same day, they ALSO quoted 18% as the homeless rate. 

However... if you ask all of the LLMs this same question today (1 week after they told me it was 18%), they all get it right: the real percentage is 0.18%.  

That's great... but also worrying.  I'm glad they're in alignment with more official sources, but the underlying data didn't change--some training must have kicked in.  

You can't count on consistency with an LLM.  

But... I find myself increasingly turning to LLMs to answer questions that really are difficult to frame as a search query.  Previously I wrote about finding "Other words that are like cottagecore" as an example of the kind of task that an LLM can do well.  They can take fairly abstract language ideas (e.g., "what words are like these words...<insert list>") and come up with something fairly useful.  This is a great capability that will let us take SearchResearch questions in new directions.  

You can ask follow-up questions without having to include everything you've found out thus far.  THAT is a pretty big advantage, especially for more complex tasks that take more than 2 or 3 queries to answer.  

This past week I found a really interesting paper on our topic by some folks at Microsoft Research.  In their Arxiv paper, "Comparing Traditional and LLM-based Search for Consumer Choice: A Randomized Experiment" they compared ordinary search tasks where some people used a "traditional" search engine, while another set used an "LLM-based" search engine.  (The LLM was ChatGPT and was primed with an extensive prompt before people started their searching.) 

Intriguingly, they found a couple of interesting things: 

* people spent less time doing LLM search than trad search,

* they did fewer, but more complex, searches that trad search,

* on the whole, the two conditions were just about as accurate (but see below), 

* the searcher's perceptions of accuracy were pretty high, even when the answers weren't great.

They then did a nice intervention to see if they could color-code uncertain results in the LLMs output to see if it would help make everything more accurate.  The result?  Yes, significantly so.  (Although they still had to have a magic oracle know that the results were low-quality!)  

However, one of the Challenges they had their participants do was intentionally a hard question that they knew the LLM would give terrible answers. But it didn't matter much--people believe LLMs beyond what they should. 

That's not a huge surprise, but a good cautionary note for us.  You still need to check to make sure that the results you're getting actually jibe with reality. 

So... what do I use LLM search for these days?  I'm asking it a lot of knowledge summarization questions.  Things like "Give me an overview of how epigenetics works in humans.  Tell me assuming I have a college level degree of genetics knowledge."  

With this prompt, ChatGPT does a pretty good job.  Here's the opening paragraph of its response: 

Epigenetics is a fascinating field of study that explores the mechanisms by which gene expression is regulated without altering the underlying DNA sequence. It plays a crucial role in development, cellular differentiation, and the adaptation of organisms to their environment. Here's an overview of how epigenetics works in humans, tailored to someone with a college-level understanding of genetics... 

On the other hand, if you ask for an explanation at a 6th grade level of biology knowledge, you'll see: 

Alright, let's imagine our body is like a huge library. This library (our body) has lots of books (cells), and each book has the same set of instructions or recipes (DNA) for making you, well, you! But even though all the books have the same recipes, not all recipes are used in every book. Some recipes are for making eye color, some for hair color, and others for how tall you'll be. Now, epigenetics is like having special bookmarks or notes in those books that decide which recipes are going to be used or ignored....

These kinds of summaries are incredibly useful, and much more sophisticated than the abstracts / snippets beneath each web result on the SERP.  The ability to ask for more-or-less detail in the prompt is incredibly useful.  

On the other hand... you still have to be careful about the simplest of things.  

If you ask Bard [ how many cups are in a gallon? ] it will reply: "There are 16 cups in a gallon. This applies to both US liquid gallons and imperial gallons, although they have slightly different volumes." 

Of course, that doesn't make sense. An imperial gallon is 1.2 US gallons, so they can't have the same number of cups!  This is a classic measurement blunder: there are, in fact, 16 Imperial cups in an Imperial gallon.  (For the record: There are 19.2 US cups in an Imperial gallon.)  As always, check your facts... and check your units!  (And for the record, ChatGPT explains this: "However, it's worth noting that the imperial cup and the US cup are not the same in terms of volume: 1 Imperial gallon = 16 Imperial cups,  1 Imperial cup = 10 Imperial fluid ounces.") 

SearchResearch Lessons

1. It's worth reiterating that you shouldn't assume that the output of an LLM is accurate.  Think of it as incredibly handy and useful, but NOT authoritative. Tattoo that on your hand if you need to, but never forget it. 

2. Always double check.  As we saw, LLMs will make bone-headed mistakes that sound good... so since you're not assuming that the output of an LLM is accurate, make sure you double-or-triple source everything. Do it now, more than ever.  

Keep searching!   



  1. Hi Dr Russell.

    It's very interesting reading your answer.

    Do you think regular Search will be used less in the near future? People verify the answers they get with LLMs?

    You mentioned:" I find myself increasingly turning to LLMs to answer questions that really are difficult to frame as a search query. "
    But how you make the questions? I still have problems to find a way to find how to get what we want.

    Have you tried LLMs searching with images?

    In other topics, I feel bad because my comment in the Challenge was the only that survived the system deleting the approved comments.

  2. I've found Bard useful for suggesting lines of enquiry. For instance, I asked "When did people in China first realise dolphins are not fish?" and it suggested some ancient Chinese texts which apparently discuss dolphins. I don't know if this info is accurate, but at least it's a good starting point. ChatGPT couldn't help with this query, and conventional searching didn't get me very far. I also asked Bard "Which art historians believe that Johannes Vermeer painted St Praxedis, and which are sceptical?" and it gave me two lists of names - again a good place to start.

    1. OSS - old school search - not seeing a lot of advantage to LLMs…especially when EVERYTHING has to be verified (and seldom is)… what is the verification process? other than regular search… if there is such a thing.
      All are launching pads/initial probes
      tastes like chicken?
      Wuhan involved again -
      the problem with LLM images & videos - it's always a beach day in the North Sea - or is it the North Atlantic?…
      is that a UAP?

      elsewhere -

  3. didn't have you pegged as a tattoo advocate - need to quad source that…
    no room for the hand tattoo…

    1. Wow.... Zippy the Pinhead? Are we having fun yet?

    2. Zippy? I thought it was Zaggy or Ziggy… nuf fun.

  4. hybrid… on the fly, little research…
    "What is the difference between Perplexity AI and chat GPT?
    In this context, the fundamental difference between Perplexity AI and Chat GPT is that Chat GPT is a language model. Meanwhile, Perplexity AI combines Chat GPT and the Google search engine. In providing answers to questions, Perplexity surfs the internet and retrieves information from certain sources."
    pinhead bunny - different from Zippy, non-microcephalic