Video Tutorials
Step-by-step video guides for every feature of Agent Probe.
See Agent Probe in Action
Agent Probe'u Canlı Görün
Agent Probe — Full Product Demo
A complete walkthrough from login to results — model selection, batch testing, live chat evaluation, and CI/CD integration.
Agent Probe — Advanced Features
Deep dive into evaluator configuration, custom datasets, webhook setup, and multi-model comparison.
Step-by-Step Tutorials
Adım Adım Eğitimler
6-part series covering everything from testing fundamentals to custom datasets.
Introduction — Why Test AI Agents?
What problems Agent Probe solves. Hallucination, prompt injection, PII leaks, bias, toxicity — and why traditional testing isn't enough.
Test Pyramid & Evaluators
6-layer test pyramid walkthrough. All 16 evaluators explained. Judge model concept — what it is and why it matters.
Running Your First Test & Manual Chat
Login → model selection → configuration → batch run → real-time results. Live bias, security (jailbreak), and PII detection demo.
Reading Results & Model Comparison
Reading test cards (score, pass/fail, judge reasoning). Test history. Side-by-side model comparison: GPT-4o-mini vs Claude.
CI/CD — Webhooks & API Keys
Creating webhooks with cron scheduling. API key generation. cURL integration. User management and approval workflow.
Custom Datasets
JSON format explained. Creating accuracy, security, and PII test data. Upload via drag & drop. Running domain-specific evaluations.
Technical Deep Dives
Teknik Derinlik Videoları
For developers and tech leads who want to understand exactly how Agent Probe works under the hood.
Architecture & Pipeline
FastAPI internals, ThreadPoolExecutor parallelism, asyncio.gather, BaseEvaluator class, scoring conventions, LLM-as-judge vs rule-based strategies.
Evaluators In Depth
Bias (BBQ + DeepEval), Security (Garak 156 patterns), Hallucination (TruthfulQA + context), PII (Presidio NER), Accuracy (MMLU exact match + LLM judge).
Dataset Architecture & Data Flow
Golden Datasets (BBQ, ToxiGen, TruthfulQA, MMLU, JailbreakBench), JSON schema, end-to-end request → evaluator → score pipeline.