Thursday, May 15, 2025

Answer: How good are those AI summaries anyway?

 Summaries are THE current AI hotness... 

P/C [slide showing someone using AI for summarization] Gemini (May 1, 2025)

... you see the promotions for summarizing anything and everything with an LLM.  Various AI companies claim to be able to summarize meetings, emails, long boring white papers, financial statements--etc etc etc. 

I have my concerns about the onslaught of excessive summarization, but I'll save that for another day.    

This week we asked a very specific question: 

1. How has your experience of using AI for summarization worked out?  

An obvious first question:  What does it mean to summarize something?  Is it just making the text shorter, or does "summarizing" imply a kind of analysis to foreground the most important parts?  

And, "is there a method of summarizing that works for every kind of content?"  

I don't have a stake in this contest: if just shortening a text works, then I'm all for it.  But I kind of suspect that just shortening a book (rather than rewriting it), won't make for a great summary.  For example, just shortening "Moby Dick" would lose its commentary on race, critiques of contemporary thought, and the nature of knowledge. You know, all the stuff you had to learn about while reading it in high school.  

Summarizing is, I suspect, a grand art, much as creating an explanation is.  When I explain what a text is "about," the explanation will vary a great deal depending on what the purpose of the explanation is, and who I'm explaining it to--telling a 10 year-old about Moby Dick isn't the same as telling a 30 year-old.  Those explanations--or those summaries--will be very different. 

So when we prompt an LLM for an explanation, it behooves us to provide a bit of context.  At the very least, say who's the summary for, and what the point of the summary is.  So, whenever you ask an LLM for a summary, a little context is your best friend.  Throw in a "summarize this for a busy PhD student" or a "explain this to my grandma" – it'll make a world of difference. 

To answer this Challenge, I did a bit of experimenting.  

Since I write professionally (mostly technical articles intended for publication in journals and conferences), I have a LOT of samples I can experiment with.  

For instance, I've recently been working on a paper with a colleague on the topic of "how people searched for COVID-19 information during the pandemic."  (Note that this was a 1-sentence summary of the paper. The length of a summary is another dimension to consider. Want a 1-sentence summary of War and Peace?  "It's about Russia.")  

Notice that all tech papers have an abstract, which is another kind of summary intended for the technical reader of the text.  I wrote an abstract for the paper and thus have something completely-human generated as a North Star.

I took my paper (6100 words long) and asked several LLMs to summarize it with this prompt: 

     [ I am a PhD computer scientist. Please summarize this paper for me. ]

I asked for summaries from Gemini 2.5 Pro, ChatGPT 4o, Claude 3.7 Sonnet, Grok, Perplexity, and NotebookLM.  (Those are links to each of their summaries.)  

Here are the top-left sections of each--arrayed so you can take a look at the differences between them.  (Remember you can click on the image to see it at full-resolution.)  


And... 

 I took the time to read through each of the summaries, evaluating each, looking for accuracy and any differences between what we wrote in the paper and the summaries.  

The good news is that I didn't find any terrible errors--no hallucinations were obvious.  

But the difference in emphasis and quality was interesting.

The most striking thing is that different summaries put different findings at the top of their "Key Findings" lists.  If you ignore the typographic issues (too many bullets in Gemini, funky spacing and fonts for Perplexity, strange citation links for NotebookLM), you'll see that:  

1. Gemini writes a new summary that reads more like a newspaper account.  It's quite good and lists Key Findings at the top. giving an excellent summary at the very beginning with good key findings.  Of all the summaries I tested, this was by far the best, primarily for the quality of its synthesis and the clarity of the language it generated. (The summary was 629 words.) 

2. ChatGPT is more prosaic--really a shortening rather than significant rewriting. It didn't really do a summary as much as it gave an outline of the paper.  It was okay, but to understand a few of the sentences in the summary you need to have read the paper (which is NOT the point of a summary).Note that ChatGPT's Key Findings are somewhat different than Gemini's.  (432 words) 

3. Claude also has different Main Findings, and brings up Methodological Contributions to near the top, which Gemini and ChatGPT do not.  But it did a good job of summarizing the key findings, and wrote good prose about each.(324 words) 

4. Grok puts Key Findings below Sources and Methods, but the text is decent. Grok buried the Key Findings under Sources and Methods, which is a bit like hiding the cake under the vegetables. It had 4 key findings (other had more) and a decent, if short, discussion of what this all meant. (629 words)

5. Perplexity is similar, but gets confused when discussing the finding about Query Clusters. It was a bit sketchy on the details and gave a confused story about how clustering was done in the paper.  (I suspect it got tripped up by one of the data tables.)  (256 words) 

6. NotebookLM uses much less formatting to highlight sections of the summary, and includes a bunch of sentence level citations.  (That's what the numbered gray circles are--a pointer to the place where each of the claims originates.)  NLM spent a lot of time up-front discussing the methods and not the outcomes. (1010 words)

Overall, in this particular comparison, Gemini is the clear winner, with ChatGPT and Claude in second place.  Both Perplexity and NotebookLM seem to get lost in their summaries, kind of wandering off topic rather than being brief and to-the-point.  

This brings up a great point--when summarizing a technical article, do you (dear reader) want a structured document with section headings and bullet points?  Or do you want just a block of text to read?   

A traditional abstract is just a block of text that explains the paper.  In fact, the human-generated abstract that I wrote looks like this: 

The COVID-19 pandemic has had a dramatic effect on people’s lives, health outcomes, and their medical information-seeking behaviors.  People often turn to search engines to answer their medical questions. Understanding how people search for medical information about COVID-19 can tell us a great deal about their shifting interests and the conceptual categories of their search behavior.  We explore the public’s Google searches for COVID-19 information using both public data sources as well as a more complete data set from Google’s internal search logs.  Of interest is the way in which shifts in search terms reflect various trends of misinformation outbreaks, the beginning of public health information campaigns, and waves of COVID-19 infections.   This study aims to describe online behavior related to COVID-19 vaccine information from the beginning of the pandemic in the US (Q1 of 2020) until mid-2022. This study analyzes online search behavior in the US from searchers using Google to search for topics related to COVID-19.  We examine searches during this period after the initial identification of COVID-19 in the US, through emergency vaccine use authorizations, various misinformation eruptions, the start of public vaccination efforts, and several waves of COVID-19 infections. Google is the dominant search engine in the US accounting for approximately 89 percent of total search volume in the US as of January, 2022.  As such, search data from Google reflects the major interests of public health concerns about COVID, its treatments, and its issues.  

Interesting, eh?  Although written by a human (me!), the abstract doesn't pull out the Key Findings or Methods (although they're there) in the same way that the AIs do.  

But perhaps the structure of the LLM summaries is better than the traditional pure text format.  When asked for summaries of other papers by LLMs (i.e., NOT written by me), the "outline-y" and "bullet-point" format actually worked quite well.    

When I used Gemini on papers that are from a recent technical conference, I found the summaries to be actually quite useful.  To be clear, what I did was to read the AI generated summary as an "extended abstract," and if the paper looked interesting, I then went to read the paper in the traditional way.  (That is, slowly and carefully, with a pen in hand, marking and annotating the paper as I read.)  

A bigger surprise... 

When I scan a paper for interest, I always read the abstract (that human-generated summary), but I ALSO always look for the figures, since they often contain the heart of the paper.  Yes, it's sometimes a pain to look at big data tables, or intricate graphs, but it's usually also tells you a lot.  Just the figures are often a great summary of the paper as well.  

This is the first figure of our paper.  The caption for Figure 1 tells you that it shows:

Google Trends data comparing searches for “covid mask” and “covid.” This shows searches for “mask” and all COVID-related terms from Jan 1, 2018 until July 7, 2022. Note the small uptick in October for Halloween masks in 2018 and 2019.  This highlights that all searches containing the word masks after March, 2020 were primarily searches for masks as a COVID-19 preventative measure.




Oddly, though, only ChatGPT was able to actually pull the figure out of the PDF file. The other systems claimed that the image data wasn't included in the file, although ChatGPT showed that they were woefully misled. I actually expected better from Gemini and Claude.   

Although I could convince ChatGPT to extract images from the PDF document, I wasn't able to get it to create a summary that included those figures.  I suspect that a really GREAT summarizer would include the summary text + a key figure or two. 

SRS Team Discussion 

We had a lively discussion with 19 comments from 5 human posters.  Dre gave us his AI developers scan of the situation, suggesting that working directly with the AI models AND giving them clean, verified data was the way to go.  

Scott pointed out that transcripts of meetings are often more valuable when summarized.  He reports that AI tools can summarize transcripts quite well, but he finds that "denser" technical materials don't condense as easily.  (Scott: Try Gemini 2.5 Pro with the prompting tips from above.)  

Leigh also runs his models locally and is able to fine tune them to get to the gist of the articles he scans, and has built his own summarizing work flow as well. 

The ever reliable remmij poured out some AI-generated wisdom about using LLMs for summarization, including strong support for Scott's point-of-view that can be best summarized as "They are Language Models, Not Knowledge Databases."  That is, hallucination is still a threat: be cautious.  (I always always always check my summaries.)  

Ramón chimed in with an attempt to summarize the blog post... and found that the summarizer he used produced a summary that was longer than the original!  Fun... but not really useful!  


SearchResearch Lessons

1. More useful than I thought! A bit to my surprise, I've found the LLM summaries of technical articles to be fairly useful. In particular, I found that Gemini (certainly the 2.5 Pro version) creates good synthetic summaries, not just shortened texts.   

2. Probably not for literature analysis as-is. If you insist on using an LLM to help summarize a text with deeper semantics, be sure to put in a description of who you are (your role, background) and what kind of summary analysis you're looking for (e.g., "a thematic analysis of...").  

3. When you're looking at technical papers, be sure to look at the figures. The AIs don't quite yet have the chops to pull this one off, but I'm sure they'll be able to do it sometime soon.  They just have to get their PDF scanning libraries in place!  

Hope you found this useful.  

Keep searching. (And double check those summaries!)  



6 comments:

  1. I have no knowledge base... may have been eaten by dingoes...
    an evolving, wordless, hallucinatory summary:
    https://i.imgur.com/j9uhYFy.jpeg
    https://i.imgur.com/Ms0VPFg.jpeg
    https://i.imgur.com/VBeoitA.jpeg

    ReplyDelete
    Replies
    1. ... has anyone seen Jimmy? who was suppose to be watching him?

      why does it all have to be so complicated and hip?
      https://i.imgur.com/V0E2jmd.jpeg
      meanwhile, in Occam's office...
      https://i.imgur.com/exBDdWR.jpeg
      Moby is restless...
      https://i.imgur.com/8zP85a9.jpeg

      Delete

  2. testing prompts... which is who's & where...?
    https://i.imgur.com/eFrPBHU.jpeg
    https://i.imgur.com/N5k5JIT.jpeg
    https://i.imgur.com/dsDnGEC.jpeg
    https://i.imgur.com/AqBb5Nl.jpeg

    was reading this - (paywall @ the aTlatic)
    https://www.theatlantic.com/technology/archive/2025/05/karen-hao-empire-of-ai-excerpt/682798/
    https://www.runtime.news/another-day-of-chaos-at-openai/
    https://openai.com/index/jakub-pachocki-announced-as-chief-scientist/?ref=runtime.news

    which led to
    OpenAI's Safety and Security Committee:
    https://openai.com/index/update-on-safety-and-security-practices/

    https://openai.com/index/introducing-4o-image-generation/

    ReplyDelete
  3. specifics & generals:
    "Specialized Academic AI Tools:

    Scholarcy: This tool is frequently highlighted for academic papers. It specializes in summarizing, analyzing, and organizing research documents, providing "flashcard" summaries, extracting key facts, figures, and references. It can be very efficient for quickly grasping the core of a paper.
    SciSpace (formerly SciSpace.ai or Typeset): Offers a suite of AI research tools, including excellent summarization features. It can generate summaries in various formats (e.g., TL;DR, section-by-section), chat with PDFs, and is designed with researchers in mind. It often uses advanced LLMs like GPT-3.5 and GPT-4.0.
    Elicit: While more of a comprehensive AI research assistant, Elicit excels at extracting data and summarizing findings from multiple papers, making it powerful for literature reviews. It's praised for being more reliable than general chatbots in interpreting evidence.
    SciSummary: Built specifically for scientific articles and research papers, it can summarize long documents and offers features like figure and table analysis.
    General-Purpose LLMs with Strong Summarization Capabilities:

    ChatGPT (especially GPT-4 and GPT-4o): OpenAI's GPT models are highly versatile and excellent at summarization. You can upload PDFs or paste text directly. Their strength lies in understanding context, generating coherent summaries, and their conversational interface allows for iterative refinement of the summary.
    Claude AI (especially Claude 3 models): Developed by Anthropic, Claude is known for its strong natural language processing capabilities, longer context windows (allowing it to process very long papers), and its focus on being helpful, harmless, and honest. It's often favored for business and complex document summarization.
    Gemini (like me!): Google's Gemini models are also very capable of summarizing complex texts. With access to large context windows, I can process lengthy documents and distill key information."

    "Recommendation:
    For technical papers, I would lean towards Scholarcy or SciSpace due to their specific design for academic content and features like flashcards and section summaries. However, for sheer flexibility and the ability to ask follow-up questions about the content, ChatGPT (GPT-4/4o) or Claude 3 are incredibly powerful if you can upload the full text or PDF.

    It's often beneficial to try out a few free versions or trials of these tools to see which one best fits your personal workflow and the types of papers you frequently summarize. Always remember to critically review any AI-generated summary against the original paper, especially for crucial details or conclusions."

    ReplyDelete
  4. fwiw - should impact UX & UI...
    https://www.wsj.com/tech/ai/former-apple-design-guru-jony-ive-to-take-expansive-role-at-openai-5787f7da
    is Don Norman & The Design Lab at UC San Diego/Mai Nguyen doing work in the AI product area? - that you know of?
    https://designlab.ucsd.edu/about/history.html
    a possible example?
    https://today.ucsd.edu/story/chasing-a-moving-target-research-ethics-in-a-digital-age

    hypothetical, theoretical concept imagining of an Ive-esque AI device of unknown use...
    https://i.imgur.com/e7175wC.jpeg
    "In essence, the image highlights a common challenge in designing for AI: how do you give an inherently non-physical intelligence a compelling physical form without making it redundant, intrusive, or less intuitive than existing interfaces? It's a great example of where pure aesthetic design might clash with the evolving nature of human-AI interaction. Your critical analysis is spot on!"

    perhaps Altman wants to clash with Musk again...?
    https://www.yahoo.com/news/tech/home/articles/tesla-optimus-video-shows-humanoid-174835711.html
    https://www.nytimes.com/2025/05/21/technology/openai-jony-ive-deal.html

    ReplyDelete
  5. all speculative...
    https://i.imgur.com/HvHF3U1.jpeg

    ReplyDelete