All writing
AI SystemsMay 20267 min read

Evaluating AI Agents Like You Ship Software

How I built evaluation pipelines for non-deterministic systems, and why vibes-based testing doesn't scale.

When you ship deterministic software, testing is straightforward: given input A, you expect output B. AI agents break this assumption entirely.

The Problem with Vibes-Based Testing

Most teams building AI agents rely on what I call "vibes-based testing": you run the agent, look at the output, and decide whether it seems right. This works at zero scale and breaks at the first production incident.

Building Real Evaluation Pipelines

The trick is treating AI evaluation like a software testing problem. You need:

  1. A ground truth dataset, curated examples with known-correct outputs
  2. Deterministic assertions, checks that don't require human review for every run
  3. Statistical sampling, because you can't evaluate every output, but you can build confidence intervals
  4. Regression tracking, so a model upgrade doesn't silently break behavior that was working

What I Built at Surface Labs

At Surface, we evaluate content generation across 4 frontier models. The evaluation pipeline runs on every deploy and catches regressions before they ship.

Key insight: the model call is 10% of the system. The hard part is making evaluation fast enough that developers actually run it.

— Amisha

Filed under: AI Systems

Next in this series
Your Agent Has Amnesia. Here's the Dict That Fakes Memory.
Week 2