← Back

Harness Engineering for AI Systems

2026-04-22AItestingevaluationengineering

Harness Engineering for AI Systems

TL;DR

A harness in AI engineering is the scaffolding that wraps a model or agent — providing inputs, capturing outputs, enforcing format contracts, and measuring quality at scale. Without a harness, AI systems cannot be tested, evaluated, or deployed reliably. Build a harness before you build anything that depends on model output in production.

Quick facts:


What Is a Harness in AI Engineering?

In traditional software, a test harness runs code against known inputs and asserts expected outputs. In AI engineering the outputs are non-deterministic, so the harness does something different: it scores outputs rather than asserting exact matches.

A minimal AI harness has four components:

  1. Dataset — a set of input prompts with optional reference answers
  2. Runner — calls the model or agent and collects raw outputs
  3. Evaluator — scores each output (exact match, LLM-as-judge, embedding similarity, or custom rubric)
  4. Reporter — aggregates scores into pass/fail metrics and surfaces regressions

Harnesses are used at three stages: during development (prompt iteration), before release (eval gates in CI), and in production (shadow scoring against live traffic).


Harness Types Compared

| Type | When to Use | Evaluator Method | Speed | Cost | |------|-------------|-----------------|-------|------| | Unit harness | Single prompt or tool test | Exact match / regex | Fast | Low | | Eval harness | Quality regression across a dataset | LLM-as-judge | Medium | Medium | | Agent harness | Multi-step task completion | Task success / tool call trace | Slow | High | | Shadow harness | Production traffic scoring | Async LLM judge | Async | Medium | | A/B harness | Model or prompt comparison | Pairwise preference | Medium | Medium |

Recommendation: Start with a unit harness for each prompt template you ship. Add an eval harness once you have 20+ representative examples. Graduate to an agent harness only when your agent handles production traffic.


Choosing an Evaluation Method

| Scenario | Recommended Evaluator | |----------|----------------------| | Output must match a fixed schema | JSON schema validation | | Answer is factual and verifiable | Exact match or keyword check | | Answer quality is subjective | LLM-as-judge (GPT-4o or Claude) | | Code output must execute | Sandboxed execution + test suite | | Agent must complete a multi-step task | Task success binary + step trace diff | | Comparing two prompts | Pairwise LLM preference vote | | Latency and cost matter | Embedding cosine similarity (fast, cheap) | | Hallucination detection | Entailment model or citation check |


Building a Minimal Eval Harness

A harness for a content generation pipeline (like this site's article generator) looks like this:

import json
from openai import OpenAI

DATASET = [
    {"input": "Explain RAG in one paragraph", "criteria": "mentions retrieval and generation"},
    {"input": "List 3 LLM providers", "criteria": "includes at least OpenAI, Anthropic, or Google"},
]

client = OpenAI()

def judge(output: str, criteria: str) -> bool:
    result = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"Does this output satisfy: '{criteria}'?\nOutput: {output}\nReply YES or NO only."
        }]
    )
    return result.choices[0].message.content.strip().upper().startswith("YES")

scores = []
for case in DATASET:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": case["input"]}]
    )
    output = response.choices[0].message.content
    passed = judge(output, case["criteria"])
    scores.append(passed)
    print(f"{'PASS' if passed else 'FAIL'}: {case['input'][:50]}")

print(f"\nScore: {sum(scores)}/{len(scores)}")

The evaluator is itself a model call — this is the LLM-as-judge pattern. It generalises to any quality criterion you can express in natural language.


FAQ

Why not just use unit tests for AI outputs? Unit tests assert exact values. LLM outputs vary even with the same prompt — phrasing changes, word order shifts, equivalent facts are expressed differently. A harness scores the semantic quality of the output rather than its literal form, which is what actually matters.

How many test cases do I need? Start with 20–50 representative examples covering the main use cases and known failure modes. More cases improve statistical confidence but increase cost per eval run. Prioritize diversity over volume — 50 varied cases beat 500 near-duplicates.

What is LLM-as-judge and is it reliable? LLM-as-judge uses a strong model (GPT-4o, Claude Opus) to score the output of a weaker or different model. It correlates well with human judgement at 80–90% agreement on most tasks. It is not reliable for tasks requiring precise numerical verification or up-to-date factual claims — use rule-based checks for those.

How do I integrate a harness into CI? Run the eval harness as a step in your pipeline. Exit with code 1 if the score drops below a threshold (e.g., below 85%). Block the merge if the step fails. Store scores as artifacts to track trends over time.

Can a harness test agents, not just single prompts? Yes, but agent harnesses are more complex. You need to simulate tool responses, capture the full action trace, and define task-level success criteria rather than per-output criteria. Frameworks like LangSmith, Braintrust, and PromptFoo have built-in agent harness support.


Further Reading

To understand the models being tested, see Understanding Large Language Models. For the agent patterns that harnesses are designed to evaluate, see AI Agents Explained. If you are new to building with AI, Building AI-Powered Applications covers the API patterns you will want to wrap with a harness before shipping.