Skip links
ChatGPT

Grok “Hallucinates” Far Less Than ChatGPT and Gemini — but What Exactly Was Measured Matters

4PDA, citing a “reliability study,” reports that Grok’s share of so-called hallucinations is allegedly around 8%, compared to 35% for ChatGPT and 38% for Google Gemini. The numbers sound striking — and that is precisely why they should be read with caution. Without a clear description of the methodology (what questions were asked, how errors were counted, how “I don’t know” responses were evaluated, how many runs were performed, and which model versions were used), such percentages easily turn into viral headlines but do little to explain the real reliability of a model for a specific task.

The core issue is that “hallucinations” are not a single metric, but an entire class of errors, and different tests capture different types of falsehoods. For example, Vectara’s Hallucination Leaderboard measures “faithfulness” in summarization tasks — how often a model introduces facts that were not present in the source. In that setting, many models score in the low single-digit percentages, because it is a different task regime with a different evaluation criterion.

See also  Neuroscientists Sound the Alarm: TikTok and Reels Damage the Brain More Than Alcohol

There are also tests of another type — “omniscience” benchmarks — where the model is expected to honestly say “I don’t know.” Under these conditions, a familiar weakness of LLMs becomes apparent: they tend to guess confidently rather than admit uncertainty. On the independent Artificial Analysis benchmark (AA-Omniscience), for instance, TechRadar notes a very high level of “fabrication” for Gemini 3 Flash in situations where the model should have declined to answer.

The practical conclusion is simple: such comparisons are useful as a signal that the market has started to measure reliability publicly, but any “8% versus 35%” figures should be treated not as absolute truth, but as the outcome of a specific test. If FUTURUM returns to this topic (and it will), the key question will always be the same: which scenario was tested — summarization, factual QA, “I don’t know” behavior, RAG, code, agent actions — and which model versions were involved.

This website uses cookies to improve your web experience.
Explore
Drag