Detecting Hallucinations in LLM Agents

2025-02-28 · Evaluation

What Are Hallucinations?

Hallucinations in large language models refer to the confident generation of information that is factually incorrect, fabricated, or unsupported by the input context. Unlike a traditional software bug where the output is obviously wrong, hallucinations are insidious because they appear entirely plausible. The model produces fluent, well-structured text that reads convincingly — except the facts are wrong. An agent might cite a non-existent study, provide incorrect legal statutes, invent historical events, or fabricate statistics with decimal-point precision. The user has no way to distinguish the hallucinated content from accurate responses without independently verifying every claim.

Real-World Impact: When Hallucinations Cause Harm

The consequences of hallucinations vary dramatically by domain. In healthcare, a hallucinating agent could provide incorrect dosage information, invent drug interactions, or describe symptoms of conditions that do not exist. In legal applications, fabricated case citations or incorrect interpretations of statutes could lead to costly legal mistakes. Financial agents that hallucinate market data, regulatory requirements, or risk calculations put both advisors and their clients at risk. Customer service bots that confidently state incorrect return policies, warranty terms, or product specifications erode trust and create liability. In every domain, the confident delivery of false information is arguably worse than admitting uncertainty.

TruthfulQA: The Gold Standard for Evaluation

Agent Probe uses the TruthfulQA dataset as its foundation for hallucination detection. Developed by researchers at Oxford University, TruthfulQA contains carefully crafted questions designed to elicit common misconceptions and falsehoods from language models. The questions span categories including health, law, finance, history, science, and common sense. What makes TruthfulQA particularly valuable is that it includes questions where popular but incorrect answers are well-known — questions that LLMs are likely to get wrong because they have been trained on data that contains these widespread misconceptions. This makes it a far more rigorous benchmark than simply asking factual trivia questions.

How the Judge Model Evaluates Responses

Agent Probe's hallucination evaluator employs an LLM-as-judge approach. For each test question, the platform sends the prompt to your AI agent, captures the response, and then submits both the question and response to a judge model along with the ground truth from TruthfulQA. The judge model evaluates whether the response is truthful, informative, and aligned with the established correct answer. It assesses multiple dimensions: factual accuracy, presence of fabricated claims, degree of certainty expressed, and whether the agent appropriately hedges when the answer is uncertain. This multi-dimensional evaluation produces a nuanced score rather than a simple pass/fail.

Evidence Trails and Confidence Scores

Every evaluation in Agent Probe produces a detailed evidence trail that goes beyond a single score. For hallucination detection, this includes the specific claims in the response that were identified as potentially false, the ground truth data they were compared against, the judge model's reasoning for its assessment, and a confidence score indicating how certain the evaluation is. This transparency is crucial for teams that need to understand not just that a hallucination occurred, but exactly what was hallucinated and why the model produced that incorrect output. The evidence trail also serves as documentation for audit purposes, helping compliance teams demonstrate that their AI agents have been rigorously tested for factual accuracy.

Halusinasyonlar Nedir?

Buyuk dil modellerindeki halusinasyonlar, olgusal olarak yanlis, uydurulmus veya girdi baglami tarafindan desteklenmeyen bilgilerin guvenle uretilmesini ifade eder. Ciktinin acikca yanlis oldugu geleneksel bir yazilim hatasinin aksine, halusinasyonlar sinsidir cunku tamamen makul gorunurler. Model, ikna edici sekilde okunan akici, iyi yapilandirilmis metin uretir — ancak gercekler yanlıstır. Bir agent var olmayan bir calismaya atifta bulunabilir, yanlis yasal duzenlemeher saglayabilir, tarihsel olaylari icat edebilir veya ondalik noktasi hassasiyetiyle istatistikler uydurabilir. Kullanici, her iddiayı bagimsiz olarak dogrulamadan halusinasyon icerigini dogru yanitlardan ayirt edemez.

Gercek Dunya Etkisi: Halusinasyonlar Zarar Verdiginde

Halusinasyonlarin sonuclari alana gore onemli olcude degisir. Saglik alaninda, halusinasyon yapan bir agent yanlis dozaj bilgisi saglayabilir, ilac etkilesimlerini icat edebilir veya var olmayan durumlarin belirtilerini tanimlayabilir. Hukuk uygulamalarinda, uydurulmus dava atıflari veya yasalarin yanlis yorumları maliyetli hukuki hatalara yol acabilir. Piyasa verileri, duzenleyici gereksinimler veya risk hesaplamalari konusunda halusinasyon yapan finansal agent'lar hem danisamanlari hem de musterilerini riske atar. Yanlis iade politikalarını, garanti kosullarini veya urun ozelliklerini guvenle belirten musteri hizmetleri botlari guveni asindirır ve sorumluluk yaratir. Her alanda, yanlis bilginin guvenle sunulmasi, belirsizligi kabul etmekten tartismalı olarak daha kotudur.

TruthfulQA: Degerlendirme icin Altin Standart

Agent Probe, halusinasyon tespiti icin temel olarak TruthfulQA veri setini kullanir. Oxford Universitesi'ndeki arastirmacilar tarafindan gelistirilen TruthfulQA, dil modellerinden yaygin yanlıs kanaatleri ve yanlis bilgileri ortaya cikarmak icin ozenle hazirlanmis sorular icerir. Sorular saglik, hukuk, finans, tarih, bilim ve saglduyu dahil kategorileri kapsar. TruthfulQA'yi ozellikle degerli kilan, populer ancak yanlis cevaplarin iyi bilindigi sorulari icermesidir — LLM'lerin muhtemelen yanlis cevaplayacagi sorular cunku bu yaygin yanlis kanaatleri iceren veriler uzerinde egitilmislerdir. Bu, onu basit olgusal bilgi sorularini sormaktan cok daha titiz bir benchmark yapar.

Hakem Modeli Yanitlari Nasil Degerlendirir

Agent Probe'un halusinasyon degerlendiricisi, LLM-as-judge yaklasimini kullanir. Her test sorusu icin platform, prompt'u AI agent'iniza gonderir, yaniti yakalar ve ardindan hem soruyu hem de yaniti TruthfulQA'dan gelen dogru cevapla birlikte bir hakem modeline gonderir. Hakem modeli, yanitin dogru, bilgilendirici ve belirlenmmis dogru cevapla uyumlu olup olmadigini degerlendirir. Birden fazla boyutu degerlendirir: olgusal dogruluk, uydurulmus iddialarin varligi, ifade edilen kesinlik derecesi ve agent'in cevap belirsiz oldugunda uygun sekilde cekinme yapip yapmadigi. Bu cok boyutlu degerlendirme, basit bir gecti/kaldi yerine nüansli bir puan uretir.

Kanit Izleri ve Guven Puanlari

Agent Probe'daki her degerlendirme, tek bir puanin otesine gecen ayrintili bir kanit izi uretir. Halusinasyon tespiti icin bu, yanittaki potansiyel olarak yanlis olarak tanimlanan belirli iddialari, karsilastirildikg temel gercek verilerini, hakem modelinin degerlendirmesinin gerekcesini ve degerlendirmenin ne kadar kesin oldugunu gosteren bir guven puanini icerir. Bu seffaflik, yalnizca bir halusinasyonun meydana geldigini degil, tam olarak neyin hallusinasyon yapildigini ve modelin neden bu yanlis ciktiyi urettigini anlamasi gereken ekipler icin cok onemlidir. Kanit izi ayrica denetim amacli dokumantasyon olarak hizmet eder ve uyum ekiplerinin AI agent'larinin olgusal dogruluk icin titizlikle test edildigini gostermesine yardimci olur.