By using this site, you agree to the Privacy Policy and Terms of Use.
Accept

News Junction

Notification Show More
Font ResizerAa
  • Home
  • World News
    World NewsShow More
    Measles vaccines save millions of lives each year
    Measles vaccines save millions of lives each year
    May 19, 2025
    Bomb blast kills four in southwest Pakistan: Officials
    Bomb blast kills four in southwest Pakistan: Officials
    May 19, 2025
    ‘Napalm Girl’ was in the Vietnam War photo. But who was behind the camera?; documentary The Stringer
    ‘Napalm Girl’ was in the Vietnam War photo. But who was behind the camera?; documentary The Stringer
    May 19, 2025
    Trump to speak with Putin today on ending Ukraine ‘bloodbath’ – after Russia carries out largest drone attack since start of war | World News
    Trump to speak with Putin today on ending Ukraine ‘bloodbath’ – after Russia carries out largest drone attack since start of war | World News
    May 19, 2025
    Portuguese PM’s party set to win general election, fall short of majority | Elections News
    Portuguese PM’s party set to win general election, fall short of majority | Elections News
    May 19, 2025
  • Business
    BusinessShow More
    Ukraine blows up bridges to consolidate its positions in Russia
    Ukraine blows up bridges to consolidate its positions in Russia
    August 18, 2024
    Commentary: AI phones from Google and Apple will erode trust in everything
    Commentary: AI phones from Google and Apple will erode trust in everything
    August 18, 2024
    The most famous Indian Dishes – Insights Success
    The most famous Indian Dishes – Insights Success
    August 18, 2024
    Life on the road as a female long rides cyclist
    Life on the road as a female long rides cyclist
    August 18, 2024
    UK inflation rises to 2.2%
    UK inflation rises to 2.2%
    August 18, 2024
  • Cryptocurrency
    CryptocurrencyShow More
    Tornado Cash Dev Roman Storm’s Defense Team Wants to Know if DOJ Withheld Evidence
    Tornado Cash Dev Roman Storm’s Defense Team Wants to Know if DOJ Withheld Evidence
    May 19, 2025
    XRP price risks falling to  after classic bearish chart pattern confirms
    XRP price risks falling to $2 after classic bearish chart pattern confirms
    May 19, 2025
    Tether surpasses Germany’s 1B of US Treasury holdings
    Tether surpasses Germany’s $111B of US Treasury holdings
    May 19, 2025
    Russia arrests Blum co-founder Vladimir Smerkis on fraud charges
    Russia arrests Blum co-founder Vladimir Smerkis on fraud charges
    May 19, 2025
    Bitcoin blasts past 6K: is Trump’s remittance tax bill crypto’s new rocket fuel?
    Bitcoin blasts past $106K: is Trump’s remittance tax bill crypto’s new rocket fuel?
    May 19, 2025
  • Technology
    TechnologyShow More
    How to Improve Your Spotify Recommendations
    How to Improve Your Spotify Recommendations
    August 18, 2024
    X says it’s closing operations in Brazil
    X says it’s closing operations in Brazil
    August 18, 2024
    Supermoon set to rise: Top tips for amateur photographers | Science & Tech News
    Supermoon set to rise: Top tips for amateur photographers | Science & Tech News
    August 18, 2024
    Scientists Want to See Videos of Your Cat for a New Study
    Scientists Want to See Videos of Your Cat for a New Study
    August 18, 2024
    OpenAI’s new voice mode let me talk with my phone, not to it
    OpenAI’s new voice mode let me talk with my phone, not to it
    August 18, 2024
  • Entertainment
  • Sports News
  • People
  • Trend
Reading: Study suggests that even the best AI models hallucinate a bunch
Share
Font ResizerAa

News Junction

  • World News
  • Business
  • Technology
  • Cryptocurrency
  • Trend
  • Entertainment
Search
  • Recent Headlines in Entertainment, World News, and Cryptocurrency – NewsJunction
  • World News
  • Business
  • Cryptocurrency
  • Technology
  • Entertainment
  • Sports News
  • People
  • Trend
Have an existing account? Sign In
Follow US
News Junction > Blog > Technology > Study suggests that even the best AI models hallucinate a bunch
Study suggests that even the best AI models hallucinate a bunch
Technology

Study suggests that even the best AI models hallucinate a bunch

Published August 14, 2024
Share
7 Min Read
SHARE

All generative AI models hallucinate, from Google’s Gemini to Anthropic’s Claude to the latest stealth release of OpenAI’s GPT-4o. The models are unreliable narrators in other words — sometimes to hilarious effect, other times problematically so.

But not all models make things up at the same rate. And the kinds of mistruths they spout depend on which sources of info they’ve been exposed to.

A recent study from researchers at Cornell, the universities of Washington and Waterloo and the nonprofit research institute AI2 sought to benchmark hallucinations by fact-checking models like GPT-4o against authoritative sources on topics ranging from law and health to history and geography. They found that no model performed exceptionally well across all topics, and that models that hallucinated the least did so partly because they refused to answer questions they’d otherwise get wrong.

“The most important takeaway from our work is that we cannot yet fully trust the outputs of model generations,” Wenting Zhao, a doctorate student at Cornell and a co-author on the research, told TechCrunch. “At present, even the best models can generate hallucination-free text only about 35% of the time.”

There’s been other academic attempts at probing the “factuality” of models, including one by a separate AI2-affiliated team. But Zhao notes that these earlier tests asked models questions with answers easily found on Wikipedia — not exactly the toughest ask, considering most models are trained on Wikipedia data.

To make their benchmark more challenging — and to more accurately reflect the types of questions people ask of models — the researchers identified topics around the web that don’t have a Wikipedia reference. Just over half the questions in their test can’t be answered using Wikipedia (they included some Wikipedia-sourced ones for good measure), and touch on topics including culture, geography, astronomy, pop culture, finance, medicine, computer science and celebrities.

For their study, the researchers evaluated over a dozen different popular models, many of which were released in the past year. In addition to GPT-4o, they tested “open” models such as Meta’s Llama 3 70B, Mistral’s Mixtral 8x22B and Cohere’s Command R+, as well as gated-behind-API models like Perplexity’s Sonar Large (which is based on Llama), Google’s Gemini 1.5 Pro and Anthropic’s Claude 3 Opus.

The results suggest that models aren’t hallucinating much less these days, despite claims to the contrary from OpenAI, Anthropic and the other big generative AI players.

GPT-4o and OpenAI’s much older flagship GPT-3.5 performed about the same in terms of the percentage of questions they answered factually correctly on the benchmark. (GPT-4o was marginally better.) OpenAI’s models were the least hallucinatory overall, followed by Mixtral 8x22B, Command R and Perplexity’s Sonar models.

Questions pertaining to celebrities and finance gave the models the hardest time, but questions about geography and computer science were easiest for the models to answer (perhaps because their training data contained more references to these). In cases where the source of an answer wasn’t Wikipedia, every model answered less factually on average (but especially GPT-3.5 and GPT-4o), suggesting that they’re all informed heavily by Wikipedia content.

Even models that can search the web for information, like Command R and Perplexity’s Sonar models, struggled with “non-Wiki” questions in the benchmark. Model size didn’t matter much; smaller models (e.g. Anthropic’s Claude 3 Haiku) hallucinated roughly as frequently as larger, ostensibly more capable models (e.g. Claude 3 Opus).

So what does all this mean — and where are the improvements that vendors promised?

Well, we wouldn’t put it past vendors to exaggerate their claims. But a more charitable take is the benchmarks they’re using aren’t fit for this purpose. As we’ve written about before, many, if not most, AI evaluations are transient and devoid of important context, doomed to fall victim to Goodhart’s law.

Regardless, Zhao says that she expects the issue of hallucinations to “persist for a long time.”

“Empirical results in our paper indicate that, despite the promise of certain methods to reduce or eliminate hallucinations, the actual improvement achievable with these methods is limited,” she said. “Additionally, our analysis reveals that even the knowledge found on the internet can often be conflicting, partly because the training data — authored by humans — can also contain hallucinations.”

An interim solution could be simply programming models to refuse to answer more often — the technical equivalent to telling a know-it-all to knock it off.

In the researchers’ testing, Claude 3 Haiku answered only around 72% of the questions it was asked, choosing to abstain from the rest. When accounting for the abstentions, Claude 3 Haiku was in fact the most factual model of them all — at least in the sense that it lied least often.

But will people use a model that doesn’t answer many questions? Zhao thinks not and says vendors should focus more of their time and efforts on hallucination-reducing research. Eliminating hallucinations entirely may not be possible, but they can be mitigated through human-in-the-loop fact-checking and citation during a model’s development, she asserts.

“Policies and regulations need to be developed to ensure that human experts are always involved in the process to verify and validate the information generated by generative AI models,” Zhao added. “There are still numerous opportunities to make significant impacts in this field, such as developing advanced fact-checking tools for any free text, providing citations for factual content and offering corrections for hallucinated texts.”

#Study #suggests #models #hallucinate #bunch

TAGGED:AiAI2Allen InstitutebunchGenerative AIhallucinatehallucinationsModelsresearchStudysuggests
Share This Article
Facebook Twitter Pinterest Whatsapp Whatsapp LinkedIn Email Copy Link Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Taylor Swift ticket prices plummet by 90% in last-minute resale market, patience proves key Taylor Swift ticket prices plummet by 90% in last-minute resale market, patience proves key
Next Article Mpox declared a global public health emergency by WHO Mpox declared a global public health emergency by WHO
- Advertisement -

Latest Post

Tornado Cash Dev Roman Storm’s Defense Team Wants to Know if DOJ Withheld Evidence
Tornado Cash Dev Roman Storm’s Defense Team Wants to Know if DOJ Withheld Evidence
Cryptocurrency
Measles vaccines save millions of lives each year
Measles vaccines save millions of lives each year
World News
XRP price risks falling to  after classic bearish chart pattern confirms
XRP price risks falling to $2 after classic bearish chart pattern confirms
Cryptocurrency
Tether surpasses Germany’s 1B of US Treasury holdings
Tether surpasses Germany’s $111B of US Treasury holdings
Cryptocurrency
Bomb blast kills four in southwest Pakistan: Officials
Bomb blast kills four in southwest Pakistan: Officials
World News
Russia arrests Blum co-founder Vladimir Smerkis on fraud charges
Russia arrests Blum co-founder Vladimir Smerkis on fraud charges
Cryptocurrency
- Advertisement -

You Might Also Like

The Maui Fires Are Messing With Hawaii’s Prized Coral Reefs
Technology

The Maui Fires Are Messing With Hawaii’s Prized Coral Reefs

August 19, 2023
A growth framework for reaching M ARR
Technology

A growth framework for reaching $1M ARR

August 17, 2023
Lineaje raises M to help organizations combat software supply chain threats
Technology

Lineaje raises $20M to help organizations combat software supply chain threats

July 31, 2024
Crashed Russian spacecraft likely cause of new crater on the moon – as NASA releases images | Science & Tech News
Technology

Crashed Russian spacecraft likely cause of new crater on the moon – as NASA releases images | Science & Tech News

September 2, 2023

About Us

NEWS JUNCTION (NewsJunction.xyz) Your trusted destination for global news. Stay informed with our timely and accurate reporting on diverse topics, including politics, technology, science, entertainment, sports, and more. Count on us for unbiased and reliable updates at your fingertips.

Quick Link

  • About
  • Disclaimer
  • Privacy Policy
  • Terms of Use
  • Contact

Top Categories

  • World News
  • Business
  • Technology
  • Entertainment
  • Cryptocurrency
  • Sports News
  • Trend
  • People

Subscribe

Subscribe to our newsletter to get our newest articles instantly!

    © 2023 News Junction.
    • Blog
    • Advertise
    • Contact
    Welcome Back!

    Sign in to your account

    Username or Email Address
    Password

    Lost your password?