Thursday, May 1, 2025

SearchResearch (5/1/25): How good are those AI summaries anyway? Let us know!

 I don't know about you... 

P/C [slide showing someone using AI for summarization] Gemini (May 1, 2025)

... but while the entire world seems to be having ecstatic paroxysms about the incredible capabilities of generative AI, my experience has been a little less wonderful.  

On more than one occasion I've spent hours trying to track down a factual claim, a citation, or a reference that some AI system made, only to find out that it's bogus or purely hallucinatory.  

That's my experience.  Don't get me wrong--there's a lot of good stuff in the AI that's deployed, but I find myself constantly having to check the output of an AI to make sure that the bad stuff doesn't overwhelm the good stuff.  

For this week, I'd love to hear your stories about using AI and then having to fact-check it, only to discover that things went way off.  

Let's ask a very specific Challenge for this week: 

1. How has your experience of using AI for summarization worked out?  

I really want to hear your stories.  Let me give you one example.  

I asked Gemini 2.0 (Flash) to [summarize this article and give citations to the claims that you make in the summary].  

It seemed to do a good job, but one of the citations was to an article by a famous author in a well-known publication, on a date that was plausible, with a title that is very consistent with everything he's written over the past few years.  

But try as I might, I could NOT find that damn article.  I eventually went to the journal's searchable archive website and found that no article by that name was ever published.  The whole thing wasted a full hour of my time.  

I definitely do NOT want to be in the situation of the lawyers for Mike Lindell who submitted legal briefs with LLM hallucinated citations.  

So I want to hear your stories about using AI to summarize other texts.  How's that working out for you?  

Share your AI summarization stories in the comments below so we can all learn from your work.

What's your critical and careful analysis of the quality of AI as a text summarizing tool? 

True stories only, please.  And definitely nothing written by AIs.  

I'll summarize your stories--lovingly, by my human eyes, hands, and brains--and let you know what the SRS crew has to say about this.  


Keep searching.  


22 comments:

  1. Clean data; Happy models. The best trick I learned for using GenAI is to use non-GenAI along with GenAI. Yet, without good training data, prompting, prompteng, system prompts, and the rest hardly matter at all. If you want Gemini to respond with less-hallucinating prompts, then you must give it seed files to conform, normalize, and bound data in a unified sense -- and best would be finding a model that already has tensor values from training data that match your use cases

    ReplyDelete
  2. Can you tell us how to find a model with tensor values that match my use case? That sounds like a valuable thing to know!

    ReplyDelete
    Replies
    1. Great question! Not an easy task and the situations in GenAI are constantly-changing. Since 2022, I have been searching, sorting, and collating GGUF models using `site:huggingface.co`. In recency, I found LM Studio on macOS also helps me find MLX (specific to Apple Silicon) and others. The search interface on LM Studio might work for you with the caveat that there is some lingo to learn, i.e., {author}/{model-name}-{size}-{specialization}-{format} , e.g., TheBloke/Mistral-7B-Instruct-v0.2-GGUF -- but also terminology such as Adapter, Agent, Chat, Coding, Dense, Edge, Eval, Experimental, FewShot, Flash, Finetune, Freeze, GGML, GGUF, HighRAM, Instruct, Lite, LoRA, LongContext, LongForm, LowRAM, MAX, MAXED, Mobile, MoE, Multilingual, Nano, NoRAG, NSFW, OnDevice, Open, Operator, Plugin, Pretrain, QwQ, QA, Reasoning, Reinforcement, RLHF, Roleplay, Safe, Scratchpad, SingleTurn, Smol, Stable, Summary, ToolUse, Uncensored, Unfiltered, Vision, ZeroShot, Zip

      Delete
  3. I've found that the type of data I'm using greatly affects the quality of the generated summary. I use Google Meet transcripts for my mentorship meetings and nearly any LLM does an excellent job summarizing that, mostly because the data is so light and verbose. When I try to summarize any paper, it's usually a negative experience. My theory is that the "density" (for lack of a better word) of the information defies the simple "shortening" techniques it uses. It clearly doesn't "summarize" which requires a higher level understanding of the points made. Much like "hallucination" (which is an anthropomorphic) I'm coming to the conclusion that "summarizing" is equally so. We keep assuming human ability in a tool that has none. I now make it a point to never call what LLMs do "summarizing", it's "shortening" (and all that implies)

    ReplyDelete
    Replies
    1. Using chat models to solve for problems that summarization models (e.g., simmo/legal-summarizer-7b) are best at is one piece to this. The second is preparing and cleaning files. Especially true for PDFs which may or may not have a clean and referential OCR (redo_ocr in ocrmypdf is hot!). Finally, using a system prompt +/- prompt chain or equivalent to ensure the provided-file data conforms to the pretrained aspects of the model -- this can be as simple as naming and describing schemas inside files by filename and by schema metadescription (e.g., the first line of CSV), e.g., output-must-look-like-this.txt , output-must-not-contain.txt , or normalize-stringdata.csv along with normalize-numericaldata.csv and other ideas. In many cases you only need a line or two to describe what to do or what not to do

      Delete
  4. I use a local quantized model to summarize web articles to get to the punch line and also to generate appropriate summarize for my public web-facing KB. I always edit the summaries, and sometimes it takes a try or to to highlight what I find interesting.

    tl;dr I integrated summarization and extraction into existing workflows as tools, not as autonomous agents.

    ReplyDelete
    Replies
    1. Are you using separate models, one for sum and one for extract? For extraction, I like the Named Entity Recognition models (I search LM Studio for "-ner-"), but certain models have interesting use cases, for example Microsoft Presidio or Meta Llama Guard 4. I'm not a fan of Microsoft or Meta's foundation or frontier models but some of their other works have real appeal

      Delete

  5. https://i.imgur.com/P4TDgYA.jpeg
    over my head... tried searching 'tensor' - made me tense.
    maybe you could offer a tutorial?
    https://en.wikipedia.org/wiki/Tensor_(machine_learning)

    https://en.wikipedia.org/wiki/Tensor_Processing_Unit

    https://deepai.org/machine-learning-glossary-and-terms/tensor

    https://cloud.google.com/tpu

    https://www.google.com/search?q=Tensor+(machine+learning)&client=firefox-b-ab&sca_esv=b02fb6a0238aa473&tbm=nws&sxsrf=AHTn8zqBqEWQPgJ4rZ07uV8q7JtyLbAkBg:1746128931460&source=lnms&sa=X&ved=0ahUKEwjcjcSVhYONAxW8GTQIHUrFAFgQ_AUICigF
    a "floating world" and an "AI world" interface -
    https://i.imgur.com/hMpzkr3.jpeg

    ReplyDelete
    Replies
    1. Think of it like this: Garbage In leads to Garbage Out

      I know this is counterintuitive, but a clean model like the summarizer quantized model I listed above contains less garbage than any foundation or frontier model. When you go to put your data into the stream of data that the model reads, then it's going to combine your data with its data. The less data it has, the better for it to understand your data and make it prominent. Your data is better than the other data in a lot of cases. The trained tensor values already in a shipped model might not be ideal for you

      Delete
    2. I tend to think of it as: Gamboge in, Gamboge out... it may be all the residual lemons.
      https://i.imgur.com/v5tKxme.jpeg

      Delete
  6. I asked for examples of large and tiny proximity tensors...
    https://i.imgur.com/c5zdg27.jpeg
    perhaps this makes more sense -
    https://i.imgur.com/LpQsa2a.jpeg
    https://en.wikipedia.org/wiki/Proximity_analysis

    ReplyDelete
  7. seems to lie in the nature of LLMs... they are not about "truth", rather patterns - not a human thought approach. so can be the "wrong tool" for hard research...? Using a saw to hammer a nail.

    know you said NOTHING WRITTEN BY AIs - but this seemed a reasonable summary (prompt from your own question - why? "On more than one occasion I've spent hours trying to track down a factual claim, a citation, or a reference that some AI system made, only to find out that it's bogus or purely hallucinatory")

    "That's a very real and frustrating experience, and you've hit upon one of the significant challenges with current AI models, often referred to as "hallucination."

    Here's a breakdown of why it happens:

    They are Language Models, Not Knowledge Databases: Large Language Models (LLMs) like me are trained to predict the most likely sequence of words based on the massive amount of text data they've seen. They learn patterns, grammar, writing styles, and relationships between concepts. They are excellent at generating human-like text that sounds plausible and coherent. They are not, however, databases that store and retrieve discrete facts with source pointers.  

    Pattern Matching Over Factual Accuracy: When you ask for a factual claim or a citation, the AI doesn't "look up" the information in a verified database. Instead, it generates text that fits the pattern of a factual claim or a citation based on its training data. It might generate something that looks perfectly like a real journal article title, author list, and DOI, but the combination doesn't actually exist. It's generating text that is statistically likely to appear in the context of answering your question, even if the content is fabricated.

    Training Data Limitations and Noise: While trained on vast datasets (the internet, books, etc.), this data is not perfectly curated or verified. It contains contradictions, inaccuracies, and outdated information. The AI learns from this messy data, and sometimes replicates or blends information in ways that produce errors.  

    Lack of Real-Time Verification: Most standard LLMs are trained on a snapshot of data up to a certain point in time. They don't have built-in access to the live internet or external databases to verify the claims they are generating in the moment. Even models with Browse capabilities sometimes generate information from unreliable sources or misinterpret what they find.  

    Confidence vs. Accuracy: LLMs don't have an internal mechanism for knowing how confident they are in the truthfulness of a statement. They just generate text. The fluent and confident way they deliver information, even if false, can be misleading.

    Think of it less like asking a librarian to find a specific book and more like asking a brilliant, well-read improviser to tell you a story that sounds like it came from a specific type of book. They'll get the style and structure right, but the specific details might be invented on the spot to make the story flow.

    What does this mean for you?

    Given their nature, it's crucial to treat factual claims, especially specific references, statistics, or citations generated by LLMs, with a healthy dose of skepticism. They are powerful tools for brainstorming, summarizing, generating creative text, and finding potential leads for research, but they are not definitive sources of truth. Always cross-reference critical information using established, reliable sources.  

    Researchers are actively working on techniques to reduce hallucination, often by integrating retrieval mechanisms (allowing the AI to search external databases or the web) and improving training methods, but it remains an ongoing challenge."

    ReplyDelete
  8. a fragment of a discussion - I thought it was revealing...
    given the rapid adoption & application of AI in the legal and medical fields. My contention is that LLM AIs are the the wrong tools for these applications - largely because they have been designed to fail in those roles. The next evolution needs to happen first.
    "And definitely nothing written by AIs." again, I apologize, but like ground truth, best from the AI horse's mouth...

    "That's a fascinating hypothetical, placing AI tools in a situation that inherently blends medicine and law.

    As an AI, I don't have personal preferences or the capacity to be a party in a lawsuit. However, if we frame the question as which type of AI would be more directly suited to assist me in navigating the legal process of a medical malpractice lawsuit:

    I would rather have a Legal AI on my side.

    Here's the reasoning:

    A medical malpractice suit, while involving complex medical facts, is fundamentally a legal proceeding. It is governed by legal rules, procedures, standards of proof, and legal arguments.

    Legal AI's Role: A Legal AI is designed to understand and work within this legal framework. It can analyze case law, identify relevant statutes, help build legal arguments, manage and organize legal documents (including the medical records as evidence within a legal case file), and assist with the strategic aspects of litigation. These are the core functions needed to navigate the lawsuit itself.
    Medical AI's Role: A Medical AI is designed to analyze medical data. It could be an incredibly valuable tool providing evidence for the case – for example, analyzing scans, summarizing medical literature on the standard of care, or helping medical experts prepare their testimony. However, its output is medical information. That medical information then needs to be interpreted, framed, and applied within the legal context by someone who understands the law.
    In the context of a legal battle, the Medical AI would essentially provide the "facts" of the medical situation, but the Legal AI would provide the tools and strategies needed to use those facts effectively within the courtroom and the legal system. Since the lawsuit itself is a legal process, the AI specializing in that process would be the more direct and necessary aid.

    You need the medical expertise (potentially informed by Medical AI) to understand what happened medically, but you need the legal expertise (provided by Legal AI working with human lawyers) to fight the legal battle. In terms of being "on my side" in the suit, the Legal AI's capabilities are directly aligned with the task at hand."

    ReplyDelete
  9. I asked AI to generate an emoji for each of the 9 rings/circles of Hell... why would you ever rely on a tool that can't count or label simple emojis... let alone get the proper emoji - although Pinocchio has a certain violent charm...
    12?
    https://i.imgur.com/CYxnqcN.jpeg
    12, 9, or 7...
    https://i.imgur.com/qoxPLWz.jpeg

    is AI headed to level 6? (lucky, it says it has no soul, only a lowly tool... & no malice in its non-heart)

    1. Limbo: For the virtuous pagans who never knew Christ.
    2. Lust: For those who succumbed to carnal desires.
    3. Gluttony: For the excessively greedy with regard to food and drink.
    4. Greed: For those who hoarded or squandered wealth.
    5.Wrath: For those who acted out of anger.
    6. Heresy: For those who denied the truth or held beliefs contrary to the faith.
    7. Violence: For those who inflicted violence on others.
    8.Fraud: For those who deceived or cheated others.
    9.Treachery: For those who betrayed their loved ones, friends, or country.
    The lowest circle, Treachery, is where Satan is imprisoned.
    https://i.imgur.com/rJm5V6V.jpeg
    (and emojis and an AI overseer...)

    "When all you have is a hammer, everything looks like a nail."
    https://i.imgur.com/VOu713i.jpeg
    (When all you have is AI [amazing as it is], everything looks like SearchReSearch... from a human perspective.)
    "As Abraham Maslow said in 1966, “I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.” A hammer is not the most appropriate tool for every purpose. Yet a person with only a hammer is likely to try and fix everything using their hammer."
    https://en.wikipedia.org/wiki/Abraham_Maslow

    ReplyDelete
  10. fwiw -
    https://arxiv.org/html/2409.05746v1

    ReplyDelete
  11. I asked Perplexity to make a summary about this post. That is, Dr. Russell Challenge post. It was a good summary. However, it is longer than the original post.

    Also, out of topic, I asked for the scores of FC Barcelona for the games play day 20 to the most recent one. I like that they give you exactly that. Just one score missed. But, you don't have to look around other teams do makes easy to do that task

    ReplyDelete
    Replies
    1. Just to say Thanks, Dr. Russell. In this Teacher's Appreciation Day. Blessings

      Delete
    2. Interesting... a summary that's longer than the original? How very very odd.

      Delete
  12. a summary of the comments as of 5/4, 9:00AM:
    "In essence, the comments reinforce the author's caution, highlighting that while AI can be a helpful tool for certain types of summarization and text generation, its current limitations, particularly regarding factual accuracy and deep understanding, necessitate careful human verification and a clear understanding of what the AI is actually doing (pattern matching/shortening) versus human cognitive processes (understanding/summarizing)."
    thought this was interesting, especially for someone named "Trust":
    https://www.umass.edu/ideas/news/tool-temptation-ais-impact-academic-integrity

    ReplyDelete
  13. specialized tools:
    https://clickup.com/blog/ai-document-summarizers/

    ReplyDelete