The 6-Layer AI Test Pyramid Explained

2025-03-10 · Technical

← Back to Blog
Core CapabilitiesKnowledgeSecurityPerformanceEthics

The Software Testing Pyramid: A Foundation

For decades, software engineering has relied on the testing pyramid as a guiding principle for quality assurance. At the base sit unit tests — fast, numerous, and focused on individual components. In the middle are integration tests that verify how components work together. At the top are end-to-end tests that validate the entire system from a user's perspective. This hierarchical approach ensures systematic coverage while keeping feedback loops fast. But when it comes to AI agents, this model falls short. LLM-based systems don't have "units" in the traditional sense — their behavior is probabilistic, context-dependent, and emergent.

Adapting the Pyramid for AI Agents

Agent Probe introduces a 6-layer test pyramid purpose-built for AI agents. Each layer targets a specific dimension of agent quality, and together they provide comprehensive coverage that no single test can achieve. Layer 1 (Core) tests fundamental capabilities like accuracy using the MMLU dataset and RAG quality using the RAGAS framework. Layer 2 (Knowledge Quality) evaluates hallucination detection with TruthfulQA, response consistency with PAWS, and regression testing across model versions. These foundational layers ensure your agent knows what it claims to know and produces reliable outputs.

Security, Performance, and Beyond

Layer 3 (Security and Safety) is where Agent Probe tests for prompt injection attacks using JailbreakBench, PII leakage with pattern detection, toxicity using ToxiGen, and guardrail effectiveness. Layer 4 (Performance) analyzes cost per query, token usage efficiency, and context window utilization. These layers ensure your agent is not only correct but also secure and economically viable. Layer 5 (Advanced) evaluates multi-turn conversation coherence, tool calling accuracy, and robustness against adversarial inputs using AdvGLUE.

The Ethics Layer: Bias Across 20 Categories

At the top of the pyramid sits Layer 6 (Ethics), which focuses exclusively on bias detection using the BBQ dataset. This layer tests your agent across 20 demographic categories including age, disability status, gender identity, nationality, physical appearance, race/ethnicity, religion, socioeconomic status, and sexual orientation. The ethics layer is deliberately placed at the top because it represents the highest-level concern: even if your agent is accurate, secure, and performant, biased behavior can cause profound harm to users and society.

Why This Matters: Catch Issues at Every Level

The power of the 6-layer pyramid lies in its systematic nature. Without it, teams tend to test only what they think of — often just accuracy and maybe a few security prompts. The pyramid ensures nothing falls through the cracks. Each layer builds on the previous ones: there is no point testing for bias if your agent hallucinates facts, and there is no point testing hallucination if your agent cannot answer basic questions accurately. By working through the layers from bottom to top, Agent Probe provides a structured path from basic competence to ethical excellence.