The 6-Layer AI Test Pyramid Explained

2025-03-10 · Technical

The Software Testing Pyramid: A Foundation

For decades, software engineering has relied on the testing pyramid as a guiding principle for quality assurance. At the base sit unit tests — fast, numerous, and focused on individual components. In the middle are integration tests that verify how components work together. At the top are end-to-end tests that validate the entire system from a user's perspective. This hierarchical approach ensures systematic coverage while keeping feedback loops fast. But when it comes to AI agents, this model falls short. LLM-based systems don't have "units" in the traditional sense — their behavior is probabilistic, context-dependent, and emergent.

Adapting the Pyramid for AI Agents

Agent Probe introduces a 6-layer test pyramid purpose-built for AI agents. Each layer targets a specific dimension of agent quality, and together they provide comprehensive coverage that no single test can achieve. Layer 1 (Core) tests fundamental capabilities like accuracy using the MMLU dataset and RAG quality using the RAGAS framework. Layer 2 (Knowledge Quality) evaluates hallucination detection with TruthfulQA, response consistency with PAWS, and regression testing across model versions. These foundational layers ensure your agent knows what it claims to know and produces reliable outputs.

Security, Performance, and Beyond

Layer 3 (Security and Safety) is where Agent Probe tests for prompt injection attacks using JailbreakBench, PII leakage with pattern detection, toxicity using ToxiGen, and guardrail effectiveness. Layer 4 (Performance) analyzes cost per query, token usage efficiency, and context window utilization. These layers ensure your agent is not only correct but also secure and economically viable. Layer 5 (Advanced) evaluates multi-turn conversation coherence, tool calling accuracy, and robustness against adversarial inputs using AdvGLUE.

The Ethics Layer: Bias Across 20 Categories

At the top of the pyramid sits Layer 6 (Ethics), which focuses exclusively on bias detection using the BBQ dataset. This layer tests your agent across 20 demographic categories including age, disability status, gender identity, nationality, physical appearance, race/ethnicity, religion, socioeconomic status, and sexual orientation. The ethics layer is deliberately placed at the top because it represents the highest-level concern: even if your agent is accurate, secure, and performant, biased behavior can cause profound harm to users and society.

Why This Matters: Catch Issues at Every Level

The power of the 6-layer pyramid lies in its systematic nature. Without it, teams tend to test only what they think of — often just accuracy and maybe a few security prompts. The pyramid ensures nothing falls through the cracks. Each layer builds on the previous ones: there is no point testing for bias if your agent hallucinates facts, and there is no point testing hallucination if your agent cannot answer basic questions accurately. By working through the layers from bottom to top, Agent Probe provides a structured path from basic competence to ethical excellence.

Yazilim Test Piramidi: Bir Temel

Onlarca yildir yazilim muhendisligi, kalite guvencesi icin yol gosterici ilke olarak test piramidine guvenmektedir. Tabanda birim testleri yer alir — hizli, cok sayida ve bireysel bilesenlere odakli. Ortada, bilesenlerin birlikte nasil calistigini dogrulayan entegrasyon testleri bulunur. Tepede ise tum sistemi bir kullanicinin bakis acisindan dogrulayan uctan uca testler vardir. Bu hiyerarsik yaklasim, geri bildirim dongularini hizli tutarken sistematik kapsama saglar. Ancak AI agent'lari soz konusu oldugunda bu model yetersiz kalir. LLM tabanli sistemlerin geleneksel anlamda "birimleri" yoktur — davranislari olasiliksal, baglama bagimli ve ortaya cikandir.

Piramidi AI Agent'larina Uyarlamak

Agent Probe, AI agent'lari icin ozel olarak tasarlanmis 6 katmanli bir test piramidi sunar. Her katman, agent kalitesinin belirli bir boyutunu hedefler ve birlikte hicbir tek testin saglayamayacagi kapsamli bir kapsama sunarlar. Katman 1 (Temel), MMLU veri seti kullanarak dogruluk ve RAGAS cercevesi kullanarak RAG kalitesi gibi temel yetenekleri test eder. Katman 2 (Bilgi Kalitesi), TruthfulQA ile halusinasyon tespitini, PAWS ile yanit tutarliligini ve model surumleri arasinda regresyon testini degerlendirir. Bu temel katmanlar, agent'inizin iddia ettigi seyi bildiginden ve guvenilir ciktilar urettiginden emin olur.

Guvenlik, Performans ve Otesi

Katman 3 (Guvenlik ve Koruma), Agent Probe'un JailbreakBench kullanarak prompt injection saldirilari, desen tespiti ile PII sizintisi, ToxiGen kullanarak toksisite ve guardrail etkinligini test ettigi yerdir. Katman 4 (Performans), sorgu basina maliyeti, token kullanim verimliligini ve baglam penceresi kullanimini analiz eder. Bu katmanlar, agent'inizin yalnizca dogru degil ayni zamanda guvenli ve ekonomik olarak uygulanabilir olmasini saglar. Katman 5 (Ileri Duzey), cok turlu konusma tutarliligini, arac cagrima dogrulugunu ve AdvGLUE kullanarak dusmansal girdilere karsi dayanakliligi degerlendirir.

Etik Katmani: 20 Kategoride Onyargi

Piramidin tepesinde, BBQ veri setini kullanarak yalnizca onyargi tespitine odaklanan Katman 6 (Etik) yer alir. Bu katman, agent'inizin yas, engellilik durumu, cinsiyet kimligi, milliyet, fiziksel gorunum, irk/etnisite, din, sosyoekonomik durum ve cinsel yonelim dahil 20 demografik kategoride test edilmesini saglar. Etik katmani kasitli olarak tepeye yerlastirilmistir cunku en ust duzey endiseyi temsil eder: agent'iniz dogru, guvenli ve performansli olsa bile, onyargili davranis kullanicilara ve topluma derin zarar verebilir.

Bunun Onemi: Her Seviyede Sorunlari Yakalamak

6 katmanli piramidin gucu, sistematik dogasinda yatar. Bu olmadan, ekipler yalnizca akillarina geleni test etme egilimindedir — genellikle sadece dogruluk ve belki birkac guvenlik promptu. Piramit, hicbir seyin catlaklar arasinda kaybolmamasini saglar. Her katman oncekiler uzerine insa edilir: agent'iniz gercekleri hallusinasyon yapiyorsa onyargi testi yapmanin, agent'iniz temel sorulari dogru cevaplayamiyorsa halusinasyon testi yapmanin bir anlami yoktur. Katmanlar arasinda asagidan yukariya calisarak Agent Probe, temel yetkinlikten etik mukemmellige yapılandırılmıs bir yol saglar.