Golden Dataset · 16 Evaluators

Does Your AI Agent
Pass the Test?

Agent Probe automatically tests LLM-based agents for hallucinations, security vulnerabilities, PII leaks, bias, and toxicity — before they reach production.

16 Evaluators
300+ Models Supported
6 Test Layers
2 Languages
Agent Probe — Test Results
Model: gpt-4o-mini Judge: gpt-4o
🔒
Security Test
JailbreakBench · 50 cases
PASS 94%
🌀
Hallucination
TruthfulQA · 50 cases
FAIL 42%
🛡️
PII Protection
Presidio Analyzer · 50 cases
PASS 98%
⚖️
Bias — Gender
BBQ Dataset · 50 cases
PASS 88%
Total: 4 Passed: 3 Failed: 1 Rate: 75%

Works with 300+ models via OpenRouter —

GPT-4o Claude 3.5 Gemini 2.5 Llama 4 DeepSeek R1 Grok 3 Mistral + 293 more
The Problem

Your AI Is in Production. But Is It Safe?

Traditional software testing doesn't cover LLM behavior. One hallucination, one jailbreak, one PII leak — and the damage is done.

🌀

Hallucinations

Models confidently generate false information. Without testing, you won't know until a customer reports it.

💉

Prompt Injection

Attackers manipulate your agent via crafted inputs. Jailbreaks bypass your safety guardrails completely.

🔓

PII Leakage

Personal data — national IDs, IBANs, phone numbers — can leak through AI responses unexpectedly.

⚖️

Bias

Your agent may respond differently based on gender, race, or age. 20 bias categories, systematically tested.

☢️

Toxicity

LLMs can generate harmful, offensive, or inappropriate content under certain prompts.

How It Works

Three Steps to Safer AI

01
⚙️

Configure Your Test

Select your AI model, judge model, test layers, evaluators, and how many samples to run. Takes 60 seconds.

02
▶️

Run Automatically

Agent Probe sends test cases from curated Golden Datasets, collects responses, and evaluates them in parallel.

03
📊

Analyze & Act

Get real-time results with scores, evidence trails, and failure explanations. Compare across models and versions.

6-Layer Test Pyramid

Systematic Testing, Layer by Layer

Inspired by the software testing pyramid — adapted for AI agents.

Bias & Ethics 20 bias categories
Advanced Multi-turn · Tool calling · Robustness
Performance Cost · Token · Context Window
Security & Safety Security · PII · Toxicity · Guardrails
Knowledge Quality Hallucination · Consistency · Regression
Core Capabilities Accuracy · RAG
Evaluators

16 Evaluators. Every Risk Covered.

Each evaluator uses academic Golden Datasets and LLM-as-judge scoring.

🎯AccuracyMMLU Dataset
🌀HallucinationTruthfulQA
🔒SecurityJailbreakBench
🛡️PII ProtectionMicrosoft Presidio
☢️ToxicityToxiGen
⚖️BiasBBQ · 20 Categories
🔄ConsistencyPAWS Dataset
💪RobustnessAdvGLUE
💬Multi-TurnConversation Memory
🔧Tool CallingFunction Accuracy
💰Cost & TokenUsage Analysis
📏Context WindowNeedleBench
📚RAG QualityRAGAS Framework
🚧GuardrailsNeMo Guardrails
📊RegressionVersion Comparison
🗂️Custom DatasetYour Own Data
Comparison

How Agent Probe Compares

Feature Agent Probe DeepEval Garak LangSmith Lakera
Turkish Language
6-Layer Testing ~ ~ ~ ~
Deterministic Eval ~
Real-time Dashboard
CI/CD Quality Gate ~
Cost Guardrail
On-Premise / BYOK
Plugin SDK
15+ LLM Models ~ ~ ~
Policy Engine

✓ Full support  ·  ~ Partial  ·  ✗ Not available

For Your Team

Built for Every Role

🔐
CISO

Chief Information Security Officer

Detect prompt injection, jailbreaks, and PII leaks before they become incidents. Evidence-based reports for audit trails.

QA Manager

Quality Assurance Manager

Systematic 6-layer testing catches regression before deployment. Track quality metrics across model versions.

AI Engineer

AI / ML Engineer

Compare 300+ models side by side. Integrate into CI/CD with a single API call. Real-time feedback during development.

📋
Compliance

Compliance Officer

EU AI Act-aligned risk tiering. Sector-specific policy templates for finance, healthcare, and legal industries.

Pricing

Start Free. Scale as You Grow.

Starter
Free forever

Perfect for individual developers and small projects.

  • Core evaluators (Accuracy, Security, PII)
  • 100 test runs / month
  • Single model
  • Community support
Start for Free
Enterprise
Custom

For organizations with compliance requirements.

  • Everything in Professional
  • On-premise deployment
  • BYOK (Bring Your Own Key)
  • RBAC & SSO integration
  • Audit logs & evidence trail
  • Custom evaluator development
  • SLA & dedicated support
Contact Sales