Evaluating AI Agents Like You Ship Software

How I built evaluation pipelines for non-deterministic systems, and why vibes-based testing doesn't scale.

When you ship deterministic software, testing is straightforward: given input A, you expect output B. AI agents break this assumption entirely.

The Problem with Vibes-Based Testing

Most teams building AI agents rely on what I call "vibes-based testing": you run the agent, look at the output, and decide whether it seems right. This works at zero scale and breaks at the first production incident.

Building Real Evaluation Pipelines

The trick is treating AI evaluation like a software testing problem. You need:

A ground truth dataset, curated examples with known-correct outputs
Deterministic assertions, checks that don't require human review for every run
Statistical sampling, because you can't evaluate every output, but you can build confidence intervals
Regression tracking, so a model upgrade doesn't silently break behavior that was working

What I Built at Surface Labs

At Surface, we evaluate content generation across 4 frontier models. The evaluation pipeline runs on every deploy and catches regressions before they ship.

Key insight: the model call is 10% of the system. The hard part is making evaluation fast enough that developers actually run it.

— Amisha

Filed under: AI Systems

Evaluating AI Agents Like You Ship Software

01 The Problem with Vibes-Based Testing

02 Building Real Evaluation Pipelines

03 What I Built at Surface Labs

The Problem with Vibes-Based Testing

Building Real Evaluation Pipelines

What I Built at Surface Labs