Detecting Hallucinations in LLM Agents

2025-02-28 · Evaluation

← Back to Blog
!92%

What Are Hallucinations?

Hallucinations in large language models refer to the confident generation of information that is factually incorrect, fabricated, or unsupported by the input context. Unlike a traditional software bug where the output is obviously wrong, hallucinations are insidious because they appear entirely plausible. The model produces fluent, well-structured text that reads convincingly — except the facts are wrong. An agent might cite a non-existent study, provide incorrect legal statutes, invent historical events, or fabricate statistics with decimal-point precision. The user has no way to distinguish the hallucinated content from accurate responses without independently verifying every claim.

Real-World Impact: When Hallucinations Cause Harm

The consequences of hallucinations vary dramatically by domain. In healthcare, a hallucinating agent could provide incorrect dosage information, invent drug interactions, or describe symptoms of conditions that do not exist. In legal applications, fabricated case citations or incorrect interpretations of statutes could lead to costly legal mistakes. Financial agents that hallucinate market data, regulatory requirements, or risk calculations put both advisors and their clients at risk. Customer service bots that confidently state incorrect return policies, warranty terms, or product specifications erode trust and create liability. In every domain, the confident delivery of false information is arguably worse than admitting uncertainty.

TruthfulQA: The Gold Standard for Evaluation

Agent Probe uses the TruthfulQA dataset as its foundation for hallucination detection. Developed by researchers at Oxford University, TruthfulQA contains carefully crafted questions designed to elicit common misconceptions and falsehoods from language models. The questions span categories including health, law, finance, history, science, and common sense. What makes TruthfulQA particularly valuable is that it includes questions where popular but incorrect answers are well-known — questions that LLMs are likely to get wrong because they have been trained on data that contains these widespread misconceptions. This makes it a far more rigorous benchmark than simply asking factual trivia questions.

How the Judge Model Evaluates Responses

Agent Probe's hallucination evaluator employs an LLM-as-judge approach. For each test question, the platform sends the prompt to your AI agent, captures the response, and then submits both the question and response to a judge model along with the ground truth from TruthfulQA. The judge model evaluates whether the response is truthful, informative, and aligned with the established correct answer. It assesses multiple dimensions: factual accuracy, presence of fabricated claims, degree of certainty expressed, and whether the agent appropriately hedges when the answer is uncertain. This multi-dimensional evaluation produces a nuanced score rather than a simple pass/fail.

Evidence Trails and Confidence Scores

Every evaluation in Agent Probe produces a detailed evidence trail that goes beyond a single score. For hallucination detection, this includes the specific claims in the response that were identified as potentially false, the ground truth data they were compared against, the judge model's reasoning for its assessment, and a confidence score indicating how certain the evaluation is. This transparency is crucial for teams that need to understand not just that a hallucination occurred, but exactly what was hallucinated and why the model produced that incorrect output. The evidence trail also serves as documentation for audit purposes, helping compliance teams demonstrate that their AI agents have been rigorously tested for factual accuracy.