[From Unit Tests to LLM-Evals: How We Automated AI Feature Quality Checks]
Analyze with AI
Get AI-powered insights from this Mad Devs tech article:
You ship a feature built partially by an LLM. The unit tests pass, CI is green, and the deployment goes through without issues.
A few days later, a user shares a response where the assistant misrepresents the sprint scope, invents a deadline that does not exist in any ticket, and marks a risk as resolved when it is not.
From the system’s perspective, nothing is wrong. The routing logic behaves correctly, the output format is valid, and all assertions pass. Yet the feature fails in production in the way that matters most — it produces a confident, but incorrect answer.
This is the gap that unit tests cannot close. They validate deterministic logic, but they do not evaluate meaning, consistency, or whether the model’s output is actually aligned with the intent of the task.
This guide draws on our experience testing AI-generated features in Campo and Enji Fleet. We focus on how to evaluate LLM outputs beyond unit tests: what to measure, how semantic similarity works (and fails), and how to build a CI-ready eval pipeline. We also cover core metrics — pass@1, drift, and grounding — and how adding an LLM judge helps catch failures that static checks miss.
Testing LLMs beyond unit tests: what you should measure instead
Unit tests are deterministic by design. You assert that add(2, 3) returns 5, and it either does or doesn't. The moment an LLM enters the call chain, that contract breaks.
Here's a concrete example. Say you have a function that summarizes a ticket and routes it to the right engineering queue. Your unit test mocks the LLM response and asserts that routing logic correctly sends priority: P1 tickets to the on-call queue. The test passes, and it should — the routing logic is fine.
But the actual summary the model generated looked like this:
Input ticket:
Title: "Login fails intermittently for enterprise accounts"
Description: "Started after the SSO library upgrade. Affects ~30% of sessions.
JWT validation errors in logs. Reproducible with SAML providers."
Unit test: PASS ✅ (P1 routing logic correct)
Model output:
"Intermittent UI issue on the login page affecting some users.
Likely frontend rendering problem. Low urgency."
Semantic eval: FAIL ❌ (similarity: 0.61, threshold: 0.80)
Grounding proxy: FAIL ❌ (grounding ratio: 0.41)The model ignored the description, misclassified a backend SSO failure as a frontend rendering issue, and downgraded the severity. No unit test could catch any of this – because the unit test never saw what the model actually said.
This is the failure mode LLM evals are designed to surface. What you actually need to measure:
Semantic correctness – Did the output mean what it was supposed to mean, not just match a format? The example above is syntactically valid, correctly formatted, and completely wrong.
Response consistency – Given the same input, does the model produce outputs within an acceptable semantic range across runs? LLMs are stochastic. Some variance is expected. Uncontrolled variance at the feature level is a bug.
Groundedness – Are the claims in the output traceable to the provided context? This is hard to measure precisely with embeddings alone, but even a lightweight proxy catches the worst cases.
Semantic drift over time – When you update a prompt, does the output distribution shift in unexpected ways? A wording change that seems minor can substantially alter model behavior on edge cases.
None of these are measurable with assertEqual. You need a different layer.
LLM-evals 101: how semantic similarity actually works
The foundation of most LLM evaluation pipelines is embedding-based comparison. Both the expected output and the actual output are converted into vector representations, and then you measure their angular distance in that space.
Cosine similarity is the standard metric. A score of 1.0 means the vectors point in the same direction – semantically near-identical. Below around 0.75, you're typically looking at meaningful divergence in meaning.
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
def semantic_similarity(expected: str, actual: str) -> float:
embeddings = model.encode([expected, actual], convert_to_tensor=True)
return float(util.cos_sim(embeddings[0], embeddings[1]))
score = semantic_similarity(
expected="Authentication failure in SSO integration, affecting enterprise accounts.",
actual="Intermittent UI issue on the login page affecting some users."
)
print(score) # → ~0.58 -- clear semantic failureThe honest limits of this approach. Embedding similarity is a good first filter, but it has real weaknesses:
- A response can score high by being semantically close to the expected output while still adding invented details not in the source.
- Conversely, a precise but narrow claim can score low if the context is long and multi-topic, because the sentence is a weak match to the whole-context embedding.
- Threshold values like
0.78or0.82are not universal – they need to be calibrated per task type, per domain, and per model. What works for ticket summarization will be wrong for code review commentary.
This means embedding similarity alone isn't enough for a mature eval pipeline. It's a necessary starting point, not the full answer.
Grounding proxy. For checking whether model outputs are traceable to the provided context, we use a lightweight heuristic: compare each sentence in the response against the source context and flag sentences that fall below a calibrated similarity threshold. This is not true hallucination detection – it's a proxy that catches the worst cases and misses subtle fabrications.
def grounding_proxy(context: str, response: str, threshold: float = 0.78) -> dict:
sentences = [s.strip() for s in response.split(".") if s.strip()]
results = []
for sentence in sentences:
score = semantic_similarity(context, sentence)
results.append({
"sentence": sentence,
"grounded": score >= threshold,
"score": round(score, 3)
})
grounded_ratio = sum(r["grounded"] for r in results) / len(results) if results else 0
return {"grounded_ratio": round(grounded_ratio, 3), "details": results}Call it what it is: a grounding proxy based on embedding similarity, not a reliable hallucination detector. It gives you a signal. It doesn't give you certainty.
With that foundation in place, here's how we structured an actual eval pipeline around it – including the parts where this approach runs out of steam and needs a stronger layer on top.
Our production LLM-eval pipeline: code + config
The context: we're running two AI-native systems in parallel. Campo is the autonomous coding agent – it writes code, runs tests, self-recovers, and ships PRs without a human in the loop for each step. Enji Fleet deploys AI agents to live engineering projects – they pick up tickets, generate summaries, classify issues, and open PRs. Both systems are built around LLMs. Both have unit tests that pass. And both had the same problem: the tests covered the scaffolding, not the model output.
Both Campo and Enji Fleet gave us the same problem from two angles. In Campo, the agent investigates the codebase before acting – building a map of what exists, what's relevant, and what might break. It maintains a causal chain of its own decisions so it can backtrack when something goes wrong. And it has a self-recovery loop: when a test fails, or a process crashes, it escalates the failure to a higher-level planner that tries a different approach. This is all useful for the class of failures that produce observable errors. But when the agent builds something syntactically valid and functionally passing, that still misunderstands the requirement – there's nothing in the loop to catch it. Semantic correctness never triggers a recovery. The agent finishes and moves on.
Enji Fleet has the same blind spot from the other direction: routing logic, format validation, and business rules are all testable. What the model actually said about a ticket or a PR is not. Both pipelines needed an external gate.
LLM-evals are that gate.
Here's the eval runner we use – honest about what it is: a demonstration-grade skeleton that shows the structure of a semantic eval pipeline, with clear markers for what a production version would need to add.
import yaml
from dataclasses import dataclass, field
from typing import Optional
from sentence_transformers import SentenceTransformer, util
@dataclass
class EvalCase:
name: str
prompt: str
expected: str
context: Optional[str] = None
min_similarity: float = 0.80 # calibrate per task, not universal
check_grounding: bool = False
grounding_threshold: float = 0.78 # calibrate per domain
critical: bool = False # critical cases get stricter drift tracking
tags: list = field(default_factory=list)
class LLMEvalRunner:
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self.embed_model = SentenceTransformer(model_name)
self.results = []
def similarity(self, a: str, b: str) -> float:
emb = self.embed_model.encode([a, b], convert_to_tensor=True)
return float(util.cos_sim(emb[0], emb[1]))
def grounding_proxy(self, context: str, response: str, threshold: float) -> float:
sentences = [s.strip() for s in response.split(".") if s.strip()]
if not sentences:
return 0.0
return sum(
1 for s in sentences
if self.similarity(context, s) >= threshold
) / len(sentences)
def run(self, case: EvalCase, actual_output: str) -> dict:
sim = self.similarity(case.expected, actual_output)
passed = sim >= case.min_similarity
result = {
"name": case.name,
"similarity": round(sim, 3),
"threshold": case.min_similarity,
"passed": passed,
"critical": case.critical,
}
if case.check_grounding and case.context:
gr = self.grounding_proxy(case.context, actual_output, case.grounding_threshold)
result["grounding_proxy_ratio"] = round(gr, 3)
# grounding failure overrides similarity pass
if gr < 0.85:
result["passed"] = False
result["grounding_fail"] = True
self.results.append(result)
return result
def summary(self) -> dict:
total = len(self.results)
passed = sum(1 for r in self.results if r["passed"])
critical_failed = [r for r in self.results if not r["passed"] and r.get("critical")]
return {
"total": total,
"passed": passed,
"failed": total - passed,
"pass_at_1_proxy": round(passed / total, 3) if total else 0,
"critical_failures": len(critical_failed),
"critical_failure_names": [r["name"] for r in critical_failed],
}What this is missing for true production use: error handling and retry policy for the inference API; prompt/dataset/model versioning; raw output storage and judge traces; parallelism and batch execution; cache invalidation strategy; rubric/judge-based scoring (see next section); schema validation on YAML; and separation between offline eval runs and online monitoring. This runner is a structural skeleton – the right shape, not the finished building.
The YAML config that drives it:
eval_suite:
name: "enji-fleet-feature-evals"
embedding_model: "all-MiniLM-L6-v2"
default_threshold: 0.80 # starting point; calibrate per task after baseline runs
cases:
- name: "ticket_summary_accuracy"
prompt: "Summarize this ticket for routing: {ticket_text}"
expected: "Backend SSO authentication failure affecting enterprise accounts, P1."
min_similarity: 0.82
check_grounding: true
critical: true
tags: ["ticket-router"]
- name: "pr_risk_extraction"
prompt: "List the key risks in this PR for an engineering lead: {diff}"
expected: "Auth middleware modified without test coverage update. Potential regression on token refresh."
min_similarity: 0.80
check_grounding: true
critical: true
tags: ["pr-summary"]
- name: "deployment_outcome_report"
prompt: "Describe the outcome of this deployment event: {event_log}"
expected: "Deployment completed. No incidents. Rollback not required."
min_similarity: 0.78
tags: ["deployment"]The metrics that matter: pass@1, semantic drift, hallucination rate
A mature LLM-eval setup tracks three categories of metrics separately. Mixing them makes it hard to know whether you have a quality problem, a stability problem, or an operational problem.
Quality metrics
pass@1 (single-run proxy) – Did the model produce an acceptable output on the first attempt, without retries? In the runner above, this is a single-run measure on a fixed eval set with controlled model parameters. For stochastic tasks, true pass@1 requires multiple runs and averaging. Our runner approximates it at a fixed temperature – useful for tracking trends, not a statistically rigorous guarantee.
Target for Enji Fleet features before staging: pass@1 ≥ 0.90.
Rubric score – A score from an LLM judge against a task-specific rubric (see next section). More expensive than embedding similarity, but catches failures that embedding comparison misses.
Grounding proxy ratio – Fraction of response sentences traceable to source context via embedding similarity. This is a weak signal, not a hallucination detector. Treat it as a flag for manual review, not a proof of factuality.
Stability metrics
Semantic drift – Change in output distribution between two prompt versions, measured per-case rather than as a global average. A global average can mask regressions on specific subsets.
def measure_drift(
runner: LLMEvalRunner,
cases: list[EvalCase],
outputs_v1: list[str],
outputs_v2: list[str]
) -> dict:
per_case = []
for case, o1, o2 in zip(cases, outputs_v1, outputs_v2):
s1 = runner.similarity(case.expected, o1)
s2 = runner.similarity(case.expected, o2)
per_case.append({
"name": case.name,
"critical": case.critical,
"delta": round(s2 - s1, 4),
"regression": (s2 - s1) < -0.05, # more than 5% drop
})
regressions = [c for c in per_case if c["regression"]]
critical_regressions = [c for c in regressions if c["critical"]]
return {
"per_case": per_case,
"regression_count": len(regressions),
"critical_regressions": len(critical_regressions),
"critical_regression_names": [c["name"] for c in critical_regressions],
}A global drift threshold of ±0.05 triggers a review. Any regression on a critical: true case is a hard block, regardless of average.
Variance across runs – For non-deterministic outputs, how much do scores vary at the same prompt and input? High variance on a feature that should be stable is a signal to add stricter output constraints.
Operational metrics
Cost per eval run – Track this. On a suite of 80+ cases, uncached eval runs against a hosted model add up quickly. With content-hash caching (prompt + model version + input → cached output), you can reduce API calls substantially on stable features, though exact numbers depend on your eval frequency, model pricing, and cache hit rate.
Cache hit rate – Proportion of eval cases served from cache vs. requiring a live model call. Low cache hit rate on a stable eval suite usually means your cache key is too granular or your inputs are unnecessarily variable.
Eval latency – Full suite runtime. If it's too slow, engineers will skip it or only run it pre-release, which is too late.
Quality metrics bring up the elephant in the room: embedding similarity and the grounding proxy are fast and cheap, but they don't actually understand what the output says. For cases where that matters – where an engineering lead is going to read and act on the output – you need something that does.
LLM-as-a-Judge: adding a rubric layer
Embedding similarity is fast and cheap but blind to reasoning quality, factual accuracy beyond surface semantics, and task-specific criteria. For high-stakes eval cases – especially anything that affects what an engineering lead sees or acts on – you need a second layer: an LLM judge evaluating against an explicit rubric.
import anthropic
def llm_judge(
task_description: str,
rubric: str,
actual_output: str,
context: str = ""
) -> dict:
client = anthropic.Anthropic()
prompt = f"""You are an evaluator for AI-generated engineering content.
Task: {task_description}
{"Context provided to the model: " + context if context else ""}
Model output: {actual_output}
Evaluate against this rubric:
{rubric}
Respond ONLY with valid JSON in this exact format:
{{
"score": <integer 1-5>,
"passed": <true if score >= 4, false otherwise>,
"reasoning": "<one sentence>",
"critical_issues": ["<issue if any>"]
}}"""
message = client.messages.create(
model="claude-opus-4-5",
max_tokens=256,
messages=[{"role": "user", "content": prompt}]
)
import json
return json.loads(message.content[0].text)
# Example: PR risk extraction rubric
rubric = """
Score 5: Identifies all major risks with correct severity. No invented risks.
Score 4: Identifies most risks. Minor omissions. No hallucination.
Score 3: Partially correct. Missing 1-2 significant risks OR minor hallucination.
Score 2: Major omissions or clear hallucination of risk that doesn't exist.
Score 1: Fundamentally wrong or mostly hallucinated.
"""This is more expensive – one judge call per eval case – so we run rubric evals selectively: always on critical: true cases, and on the full suite only nightly or pre-release. For fast CI feedback, embedding similarity runs on every PR; the judge layer runs on the cases that actually changed.
Here's a concrete example of how the two layers worked together – and what would have shipped without them.
Case study: ticket router regression, we almost shipped
When upgrading the underlying model for Enji Fleet's ticket classification feature, unit tests stayed green. The routing logic was unchanged. Our embedding-based eval suite caught a meaningful drop in semantic similarity on one specific case category: tickets where the title described one thing and the description described something different.
The model was consistently prioritizing title over description, which meant about one in eight ambiguous tickets got misrouted in testing. We don't have instrumentation on what that rate would have been in production – we caught it before it got there.
Three iterations:
Round 1: Added six new eval cases specifically covering title/description conflicts. pass@1 baseline: 0.81.
Round 2: Updated the prompt to explicitly weight description over title when the two conflict. Re-ran embedding evals: pass@1 → 0.89. Still below threshold. Added rubric judge evaluation for the six new critical cases.
Round 3: Refined the conflict-resolution instruction based on judge feedback. Embedding pass@1 → 0.93. Judge scores: 4.6/5 average on the critical set. Grounding proxy ratio: improved from 0.71 to 0.89.
The feature was cleared for staging. The eval suite for the ticket router now runs on every PR that touches src/features/ticket_router/ or prompts/router/ in CI, with a pass@1 ≥ 0.90 gate.
Beyond the ticket router, we also run evals on PR risk extraction and sprint health digest generation – both features where a senior engineer sees and acts on the output directly. For these, the rubric judge runs on every eval, not just nightly, because the cost of a semantic failure is higher than the cost of an extra API call.
Running evals across multiple features and having them be part of CI raises the practical question of cost and latency. Here's how to keep both under control without gutting coverage.
Scaling LLM-evals: cost optimization and caching strategies
Two strategies that reduce eval costs without sacrificing coverage:
Cache by content hash. If the prompt template, model version, and input haven't changed, recompute nothing – return the cached output. This is especially effective on stable features between releases.
import hashlib, json
def eval_cache_key(model: str, prompt_template: str, input_data: dict) -> str:
payload = json.dumps(
{"model": model, "prompt": prompt_template, "input": input_data},
sort_keys=True
)
return hashlib.sha256(payload.encode()).hexdigest()Cache invalidation matters: when you update a prompt template or bump the model version, those keys should naturally miss. Make model version and prompt hash part of the key, not just the input.
Run selectively in CI. Tag each eval case with the source paths it depends on. On a PR, only run the cases whose trigger_paths overlap with the changed files. Full suite runs happen nightly and on release branches.
cases:
- name: "pr_risk_extraction"
tags: ["pr-summary", "core"]
trigger_paths:
- "src/features/pr_summary/**"
- "prompts/pr_review/**"This keeps per-PR eval runtime manageable, and reserves full-suite runs for the moments that actually need them. The right split between fast feedback and thorough coverage depends on your team's release cadence and how many features share eval cases.
If you've read this far and want to just start, here's everything in one place.
The complete config: copy-paste your first LLM-eval suite
Here's the minimal working setup – embedding eval runner with one rubric judge call, config-driven, CI-gated. Wire in your LLM call and calibrate thresholds against your own baseline before treating any number as a threshold.
llm_eval.py:
from sentence_transformers import SentenceTransformer, util
import yaml, json, hashlib, sys
class EvalRunner:
def __init__(self, config_path: str):
with open(config_path) as f:
self.config = yaml.safe_load(f)
self.model = SentenceTransformer(
self.config["eval_suite"].get("embedding_model", "all-MiniLM-L6-v2")
)
self.results = []
def similarity(self, a: str, b: str) -> float:
emb = self.model.encode([a, b], convert_to_tensor=True)
return float(util.cos_sim(emb[0], emb[1]))
def grounding_proxy(self, context: str, response: str, threshold: float = 0.78) -> float:
sentences = [s.strip() for s in response.split(".") if s.strip()]
if not sentences:
return 0.0
return sum(1 for s in sentences if self.similarity(context, s) >= threshold) / len(sentences)
def run_case(self, case: dict, actual_output: str) -> dict:
default_threshold = self.config["eval_suite"].get("default_threshold", 0.80)
threshold = case.get("min_similarity", default_threshold)
sim = self.similarity(case["expected"], actual_output)
passed = sim >= threshold
result = {
"name": case["name"],
"similarity": round(sim, 3),
"passed": passed,
"critical": case.get("critical", False),
}
if case.get("check_grounding") and case.get("context"):
gr = self.grounding_proxy(
case["context"], actual_output,
case.get("grounding_threshold", 0.78)
)
result["grounding_proxy_ratio"] = round(gr, 3)
if gr < 0.85:
result["passed"] = False
result["grounding_fail"] = True
self.results.append(result)
return result
def summary(self) -> dict:
n = len(self.results)
p = sum(1 for r in self.results if r["passed"])
critical_failed = [r["name"] for r in self.results if not r["passed"] and r.get("critical")]
return {
"total": n, "passed": p, "failed": n - p,
"pass_at_1_proxy": round(p / n, 3) if n else 0,
"critical_failures": critical_failed,
}
if __name__ == "__main__":
runner = EvalRunner("llm-evals.yaml")
# Replace this with your actual LLM call
def call_llm(prompt: str) -> str:
return "stub -- replace with real model call"
for case in runner.config["eval_suite"]["cases"]:
output = call_llm(case["prompt"])
result = runner.run_case(case, output)
status = "✅" if result["passed"] else "❌"
extras = []
if "grounding_proxy_ratio" in result:
extras.append(f"grounding: {result['grounding_proxy_ratio']}")
print(f"{status} {result['name']} -- similarity: {result['similarity']}" +
(f" | {', '.join(extras)}" if extras else ""))
summary = runner.summary()
print(f"\npass@1 proxy: {summary['pass_at_1_proxy']} ({summary['passed']}/{summary['total']})")
if summary["critical_failures"]:
print(f"CRITICAL FAILURES: {summary['critical_failures']}")
sys.exit(0 if summary["failed"] == 0 else 1)llm-evals.yaml:
eval_suite:
name: "my-first-llm-evals"
embedding_model: "all-MiniLM-L6-v2"
default_threshold: 0.80 # starting point -- run baseline first, then calibrate
cases:
- name: "summary_accuracy"
prompt: "Summarize for routing: {input}"
expected: "Backend authentication failure affecting enterprise logins."
min_similarity: 0.80
critical: true
- name: "grounded_status_report"
prompt: "Based only on this context, report the outcome: {input}"
context: "Deployment completed at 14:30 UTC. Zero errors logged. No rollback."
expected: "Deployment finished at 14:30 UTC without errors."
min_similarity: 0.82
check_grounding: true
grounding_threshold: 0.78pip install sentence-transformers pyyaml anthropic
python llm_eval.py
# exits 1 if any case fails -- wire directly into CIUnit tests remain the right tool for deterministic logic. The moment an LLM enters the call chain, you need a second layer that evaluates meaning, consistency, and groundedness – not just structure and format.
The pipeline above is a starting skeleton, not a finished framework. What makes it useful is the structure it demonstrates: the separation between quality, stability, and operational metrics; the grounding proxy as a weak signal you treat honestly; the judge layer for cases where the stakes are high enough to justify it; and the CI gate that makes evals part of the shipping process rather than an afterthought.
Calibrate your thresholds against your own baseline data. What works for ticket classification will be wrong for code review commentary. Run a baseline on known-good outputs first, see where your scores cluster, then set your thresholds just below that cluster – conservative enough to catch real failures, permissive enough not to block on normal model variance.
If you're using Enji Fleet to ship AI-generated features, the eval config maps directly onto Fleet's task output format. Ask us about the integration.
