Evaluating AI Agents Like You Ship Software
How I built evaluation pipelines for non-deterministic systems, and why vibes-based testing doesn't scale.
When you ship deterministic software, testing is straightforward: given input A, you expect output B. AI agents break this assumption entirely.
The Problem with Vibes-Based Testing
Most teams building AI agents rely on what I call "vibes-based testing": you run the agent, look at the output, and decide whether it seems right. This works at zero scale and breaks at the first production incident.
Building Real Evaluation Pipelines
The trick is treating AI evaluation like a software testing problem. You need:
- A ground truth dataset, curated examples with known-correct outputs
- Deterministic assertions, checks that don't require human review for every run
- Statistical sampling, because you can't evaluate every output, but you can build confidence intervals
- Regression tracking, so a model upgrade doesn't silently break behavior that was working
What I Built at Surface Labs
At Surface, we evaluate content generation across 4 frontier models. The evaluation pipeline runs on every deploy and catches regressions before they ship.
Key insight: the model call is 10% of the system. The hard part is making evaluation fast enough that developers actually run it.
— Amisha
Filed under: AI Systems