Why can't unit tests catch LLM output failures?

Unit tests validate deterministic logic against mocked outputs — they assert routing behavior, format validity, and structural rules. They never see what the model actually generates. An LLM can produce a syntactically valid, correctly formatted response that misclassifies severity, ignores critical context, or fabricates details not present in the source. None of these failures trigger a unit test assertion. LLM evals are a separate layer that evaluates meaning, consistency, and groundedness — the dimensions that unit tests cannot reach.

What is semantic similarity in LLM evaluation and what are its limits?

Semantic similarity compares the vector representations of an expected and an actual LLM output using cosine similarity. A score near 1.0 indicates near-identical meaning; below ~0.75 indicates meaningful divergence. Its limits: a response can score high by being semantically close to the expected output while still adding hallucinated details. Conversely, a precise but narrow claim can score low against a long multi-topic context. Thresholds like 0.80 are not universal — they must be calibrated per task type and domain against baseline runs on known-good outputs.

What is a grounding proxy and how is it different from hallucination detection?

A grounding proxy checks whether each sentence in a model response is traceable to the provided source context by comparing sentence embeddings against the context embedding and flagging sentences below a calibrated threshold. It gives a grounded_ratio signal — the fraction of sentences with a similarity score above the threshold. It is not a reliable hallucination detector: it catches obvious cases where the response has no connection to the source, but misses subtle fabrications that are semantically adjacent to real context. Treat it as a flag for manual review, not a proof of factuality.

What is LLM-as-a-Judge and when should you use it?

LLM-as-a-Judge uses a second language model to evaluate the output of the first against an explicit rubric, typically a 1–5 scoring scale that defines what constitutes a correct, partially correct, or hallucinated response for that specific task. It catches reasoning failures, severity misclassifications, and task-specific quality issues that embedding similarity misses. Because it costs one API call per eval case, use it selectively: on every PR for cases marked critical, and on the full eval suite nightly or pre-release.

What is semantic drift in LLM outputs and how do you measure it?

Semantic drift is the change in a model's output distribution between two prompt versions, measured per eval case rather than as a global average. A global average can mask regressions on specific subsets. To measure it, compute the similarity score of each case's output against the expected output for both prompt versions, then calculate the delta. A drop greater than 0.05 on any case triggers a review; any regression on a critical case is a hard block regardless of the average.

Anton Kozlov

System Developer

Different Language

Created: Apr 27, 2026

[From Unit Tests to LLM-Evals: How We Automated AI Feature Quality Checks]

Analyze with AI

Get AI-powered insights from this Mad Devs tech article:

You ship a feature built partially by an LLM. The unit tests pass, CI is green, and the deployment goes through without issues.

A few days later, a user shares a response where the assistant misrepresents the sprint scope, invents a deadline that does not exist in any ticket, and marks a risk as resolved when it is not.

From the system’s perspective, nothing is wrong. The routing logic behaves correctly, the output format is valid, and all assertions pass. Yet the feature fails in production in the way that matters most — it produces a confident, but incorrect answer.

This is the gap that unit tests cannot close. They validate deterministic logic, but they do not evaluate meaning, consistency, or whether the model’s output is actually aligned with the intent of the task.

This guide draws on our experience testing AI-generated features in Campo and Enji Fleet. We focus on how to evaluate LLM outputs beyond unit tests: what to measure, how semantic similarity works (and fails), and how to build a CI-ready eval pipeline. We also cover core metrics — pass@1, drift, and grounding — and how adding an LLM judge helps catch failures that static checks miss.

Testing LLMs beyond unit tests: what you should measure instead

Unit tests are deterministic by design. You assert that add(2, 3) returns 5, and it either does or doesn't. The moment an LLM enters the call chain, that contract breaks.

Here's a concrete example. Say you have a function that summarizes a ticket and routes it to the right engineering queue. Your unit test mocks the LLM response and asserts that routing logic correctly sends priority: P1 tickets to the on-call queue. The test passes, and it should — the routing logic is fine.

But the actual summary the model generated looked like this:

Input ticket:
  Title: "Login fails intermittently for enterprise accounts"
  Description: "Started after the SSO library upgrade. Affects ~30% of sessions.
                JWT validation errors in logs. Reproducible with SAML providers."

Unit test: PASS ✅  (P1 routing logic correct)

Model output:
  "Intermittent UI issue on the login page affecting some users.
   Likely frontend rendering problem. Low urgency."

Semantic eval: FAIL ❌  (similarity: 0.61, threshold: 0.80)
Grounding proxy: FAIL ❌  (grounding ratio: 0.41)

The model ignored the description, misclassified a backend SSO failure as a frontend rendering issue, and downgraded the severity. No unit test could catch any of this – because the unit test never saw what the model actually said.

This is the failure mode LLM evals are designed to surface. What you actually need to measure:

Semantic correctness – Did the output mean what it was supposed to mean, not just match a format? The example above is syntactically valid, correctly formatted, and completely wrong.

Response consistency – Given the same input, does the model produce outputs within an acceptable semantic range across runs? LLMs are stochastic. Some variance is expected. Uncontrolled variance at the feature level is a bug.

Groundedness – Are the claims in the output traceable to the provided context? This is hard to measure precisely with embeddings alone, but even a lightweight proxy catches the worst cases.

Semantic drift over time – When you update a prompt, does the output distribution shift in unexpected ways? A wording change that seems minor can substantially alter model behavior on edge cases.

None of these are measurable with assertEqual. You need a different layer.

LLM-evals 101: how semantic similarity actually works

The foundation of most LLM evaluation pipelines is embedding-based comparison. Both the expected output and the actual output are converted into vector representations, and then you measure their angular distance in that space.

Cosine similarity is the standard metric. A score of 1.0 means the vectors point in the same direction – semantically near-identical. Below around 0.75, you're typically looking at meaningful divergence in meaning.

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_similarity(expected: str, actual: str) -> float:
    embeddings = model.encode([expected, actual], convert_to_tensor=True)
    return float(util.cos_sim(embeddings[0], embeddings[1]))

score = semantic_similarity(
    expected="Authentication failure in SSO integration, affecting enterprise accounts.",
    actual="Intermittent UI issue on the login page affecting some users."
)
print(score)  # → ~0.58  -- clear semantic failure

The honest limits of this approach. Embedding similarity is a good first filter, but it has real weaknesses:

A response can score high by being semantically close to the expected output while still adding invented details not in the source.
Conversely, a precise but narrow claim can score low if the context is long and multi-topic, because the sentence is a weak match to the whole-context embedding.
Threshold values like 0.78 or 0.82 are not universal – they need to be calibrated per task type, per domain, and per model. What works for ticket summarization will be wrong for code review commentary.

This means embedding similarity alone isn't enough for a mature eval pipeline. It's a necessary starting point, not the full answer.

Grounding proxy. For checking whether model outputs are traceable to the provided context, we use a lightweight heuristic: compare each sentence in the response against the source context and flag sentences that fall below a calibrated similarity threshold. This is not true hallucination detection – it's a proxy that catches the worst cases and misses subtle fabrications.

def grounding_proxy(context: str, response: str, threshold: float = 0.78) -> dict:
    sentences = [s.strip() for s in response.split(".") if s.strip()]
    results = []
    for sentence in sentences:
        score = semantic_similarity(context, sentence)
        results.append({
            "sentence": sentence,
            "grounded": score >= threshold,
            "score": round(score, 3)
        })
    grounded_ratio = sum(r["grounded"] for r in results) / len(results) if results else 0
    return {"grounded_ratio": round(grounded_ratio, 3), "details": results}

Call it what it is: a grounding proxy based on embedding similarity, not a reliable hallucination detector. It gives you a signal. It doesn't give you certainty.

With that foundation in place, here's how we structured an actual eval pipeline around it – including the parts where this approach runs out of steam and needs a stronger layer on top.

Our production LLM-eval pipeline: code + config

The context: we're running two AI-native systems in parallel. Campo is the autonomous coding agent – it writes code, runs tests, self-recovers, and ships PRs without a human in the loop for each step. Enji Fleet deploys AI agents to live engineering projects – they pick up tickets, generate summaries, classify issues, and open PRs. Both systems are built around LLMs. Both have unit tests that pass. And both had the same problem: the tests covered the scaffolding, not the model output.

Both Campo and Enji Fleet gave us the same problem from two angles. In Campo, the agent investigates the codebase before acting – building a map of what exists, what's relevant, and what might break. It maintains a causal chain of its own decisions so it can backtrack when something goes wrong. And it has a self-recovery loop: when a test fails, or a process crashes, it escalates the failure to a higher-level planner that tries a different approach. This is all useful for the class of failures that produce observable errors. But when the agent builds something syntactically valid and functionally passing, that still misunderstands the requirement – there's nothing in the loop to catch it. Semantic correctness never triggers a recovery. The agent finishes and moves on.

Enji Fleet has the same blind spot from the other direction: routing logic, format validation, and business rules are all testable. What the model actually said about a ticket or a PR is not. Both pipelines needed an external gate.

LLM-evals are that gate.

Here's the eval runner we use – honest about what it is: a demonstration-grade skeleton that shows the structure of a semantic eval pipeline, with clear markers for what a production version would need to add.

import yaml
from dataclasses import dataclass, field
from typing import Optional
from sentence_transformers import SentenceTransformer, util

@dataclass
class EvalCase:
    name: str
    prompt: str
    expected: str
    context: Optional[str] = None
    min_similarity: float = 0.80       # calibrate per task, not universal
    check_grounding: bool = False
    grounding_threshold: float = 0.78  # calibrate per domain
    critical: bool = False             # critical cases get stricter drift tracking
    tags: list = field(default_factory=list)

class LLMEvalRunner:
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.embed_model = SentenceTransformer(model_name)
        self.results = []

    def similarity(self, a: str, b: str) -> float:
        emb = self.embed_model.encode([a, b], convert_to_tensor=True)
        return float(util.cos_sim(emb[0], emb[1]))

    def grounding_proxy(self, context: str, response: str, threshold: float) -> float:
        sentences = [s.strip() for s in response.split(".") if s.strip()]
        if not sentences:
            return 0.0
        return sum(
            1 for s in sentences
            if self.similarity(context, s) >= threshold
        ) / len(sentences)

    def run(self, case: EvalCase, actual_output: str) -> dict:
        sim = self.similarity(case.expected, actual_output)
        passed = sim >= case.min_similarity

        result = {
            "name": case.name,
            "similarity": round(sim, 3),
            "threshold": case.min_similarity,
            "passed": passed,
            "critical": case.critical,
        }

        if case.check_grounding and case.context:
            gr = self.grounding_proxy(case.context, actual_output, case.grounding_threshold)
            result["grounding_proxy_ratio"] = round(gr, 3)
            # grounding failure overrides similarity pass
            if gr < 0.85:
                result["passed"] = False
                result["grounding_fail"] = True

        self.results.append(result)
        return result

    def summary(self) -> dict:
        total = len(self.results)
        passed = sum(1 for r in self.results if r["passed"])
        critical_failed = [r for r in self.results if not r["passed"] and r.get("critical")]
        return {
            "total": total,
            "passed": passed,
            "failed": total - passed,
            "pass_at_1_proxy": round(passed / total, 3) if total else 0,
            "critical_failures": len(critical_failed),
            "critical_failure_names": [r["name"] for r in critical_failed],
        }

What this is missing for true production use: error handling and retry policy for the inference API; prompt/dataset/model versioning; raw output storage and judge traces; parallelism and batch execution; cache invalidation strategy; rubric/judge-based scoring (see next section); schema validation on YAML; and separation between offline eval runs and online monitoring. This runner is a structural skeleton – the right shape, not the finished building.

The YAML config that drives it:

eval_suite:
  name: "enji-fleet-feature-evals"
  embedding_model: "all-MiniLM-L6-v2"
  default_threshold: 0.80  # starting point; calibrate per task after baseline runs

  cases:
    - name: "ticket_summary_accuracy"
      prompt: "Summarize this ticket for routing: {ticket_text}"
      expected: "Backend SSO authentication failure affecting enterprise accounts, P1."
      min_similarity: 0.82
      check_grounding: true
      critical: true
      tags: ["ticket-router"]

    - name: "pr_risk_extraction"
      prompt: "List the key risks in this PR for an engineering lead: {diff}"
      expected: "Auth middleware modified without test coverage update. Potential regression on token refresh."
      min_similarity: 0.80
      check_grounding: true
      critical: true
      tags: ["pr-summary"]

    - name: "deployment_outcome_report"
      prompt: "Describe the outcome of this deployment event: {event_log}"
      expected: "Deployment completed. No incidents. Rollback not required."
      min_similarity: 0.78
      tags: ["deployment"]

The metrics that matter: pass@1, semantic drift, hallucination rate

A mature LLM-eval setup tracks three categories of metrics separately. Mixing them makes it hard to know whether you have a quality problem, a stability problem, or an operational problem.

Quality metrics

pass@1 (single-run proxy) – Did the model produce an acceptable output on the first attempt, without retries? In the runner above, this is a single-run measure on a fixed eval set with controlled model parameters. For stochastic tasks, true pass@1 requires multiple runs and averaging. Our runner approximates it at a fixed temperature – useful for tracking trends, not a statistically rigorous guarantee.

Target for Enji Fleet features before staging: pass@1 ≥ 0.90.

Rubric score – A score from an LLM judge against a task-specific rubric (see next section). More expensive than embedding similarity, but catches failures that embedding comparison misses.

Grounding proxy ratio – Fraction of response sentences traceable to source context via embedding similarity. This is a weak signal, not a hallucination detector. Treat it as a flag for manual review, not a proof of factuality.

Stability metrics

Semantic drift – Change in output distribution between two prompt versions, measured per-case rather than as a global average. A global average can mask regressions on specific subsets.

def measure_drift(
    runner: LLMEvalRunner,
    cases: list[EvalCase],
    outputs_v1: list[str],
    outputs_v2: list[str]
) -> dict:
    per_case = []
    for case, o1, o2 in zip(cases, outputs_v1, outputs_v2):
        s1 = runner.similarity(case.expected, o1)
        s2 = runner.similarity(case.expected, o2)
        per_case.append({
            "name": case.name,
            "critical": case.critical,
            "delta": round(s2 - s1, 4),
            "regression": (s2 - s1) < -0.05,  # more than 5% drop
        })

    regressions = [c for c in per_case if c["regression"]]
    critical_regressions = [c for c in regressions if c["critical"]]
    return {
        "per_case": per_case,
        "regression_count": len(regressions),
        "critical_regressions": len(critical_regressions),
        "critical_regression_names": [c["name"] for c in critical_regressions],
    }

A global drift threshold of ±0.05 triggers a review. Any regression on a critical: true case is a hard block, regardless of average.

Variance across runs – For non-deterministic outputs, how much do scores vary at the same prompt and input? High variance on a feature that should be stable is a signal to add stricter output constraints.

Operational metrics

Cost per eval run – Track this. On a suite of 80+ cases, uncached eval runs against a hosted model add up quickly. With content-hash caching (prompt + model version + input → cached output), you can reduce API calls substantially on stable features, though exact numbers depend on your eval frequency, model pricing, and cache hit rate.

Cache hit rate – Proportion of eval cases served from cache vs. requiring a live model call. Low cache hit rate on a stable eval suite usually means your cache key is too granular or your inputs are unnecessarily variable.

Eval latency – Full suite runtime. If it's too slow, engineers will skip it or only run it pre-release, which is too late.

Quality metrics bring up the elephant in the room: embedding similarity and the grounding proxy are fast and cheap, but they don't actually understand what the output says. For cases where that matters – where an engineering lead is going to read and act on the output – you need something that does.

LLM-as-a-Judge: adding a rubric layer

Embedding similarity is fast and cheap but blind to reasoning quality, factual accuracy beyond surface semantics, and task-specific criteria. For high-stakes eval cases – especially anything that affects what an engineering lead sees or acts on – you need a second layer: an LLM judge evaluating against an explicit rubric.

import anthropic

def llm_judge(
    task_description: str,
    rubric: str,
    actual_output: str,
    context: str = ""
) -> dict:
    client = anthropic.Anthropic()
    prompt = f"""You are an evaluator for AI-generated engineering content.

Task: {task_description}
{"Context provided to the model: " + context if context else ""}
Model output: {actual_output}

Evaluate against this rubric:
{rubric}

Respond ONLY with valid JSON in this exact format:
{{
  "score": <integer 1-5>,
  "passed": <true if score >= 4, false otherwise>,
  "reasoning": "<one sentence>",
  "critical_issues": ["<issue if any>"]
}}"""

    message = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}]
    )
    import json
    return json.loads(message.content[0].text)

# Example: PR risk extraction rubric
rubric = """
Score 5: Identifies all major risks with correct severity. No invented risks.
Score 4: Identifies most risks. Minor omissions. No hallucination.
Score 3: Partially correct. Missing 1-2 significant risks OR minor hallucination.
Score 2: Major omissions or clear hallucination of risk that doesn't exist.
Score 1: Fundamentally wrong or mostly hallucinated.
"""

This is more expensive – one judge call per eval case – so we run rubric evals selectively: always on critical: true cases, and on the full suite only nightly or pre-release. For fast CI feedback, embedding similarity runs on every PR; the judge layer runs on the cases that actually changed.

Here's a concrete example of how the two layers worked together – and what would have shipped without them.

Case study: ticket router regression, we almost shipped

When upgrading the underlying model for Enji Fleet's ticket classification feature, unit tests stayed green. The routing logic was unchanged. Our embedding-based eval suite caught a meaningful drop in semantic similarity on one specific case category: tickets where the title described one thing and the description described something different.

The model was consistently prioritizing title over description, which meant about one in eight ambiguous tickets got misrouted in testing. We don't have instrumentation on what that rate would have been in production – we caught it before it got there.

Three iterations:

Round 1: Added six new eval cases specifically covering title/description conflicts. pass@1 baseline: 0.81.

Round 2: Updated the prompt to explicitly weight description over title when the two conflict. Re-ran embedding evals: pass@1 → 0.89. Still below threshold. Added rubric judge evaluation for the six new critical cases.

Round 3: Refined the conflict-resolution instruction based on judge feedback. Embedding pass@1 → 0.93. Judge scores: 4.6/5 average on the critical set. Grounding proxy ratio: improved from 0.71 to 0.89.

The feature was cleared for staging. The eval suite for the ticket router now runs on every PR that touches src/features/ticket_router/ or prompts/router/ in CI, with a pass@1 ≥ 0.90 gate.

Beyond the ticket router, we also run evals on PR risk extraction and sprint health digest generation – both features where a senior engineer sees and acts on the output directly. For these, the rubric judge runs on every eval, not just nightly, because the cost of a semantic failure is higher than the cost of an extra API call.

Running evals across multiple features and having them be part of CI raises the practical question of cost and latency. Here's how to keep both under control without gutting coverage.

Scaling LLM-evals: cost optimization and caching strategies

Two strategies that reduce eval costs without sacrificing coverage:

Cache by content hash. If the prompt template, model version, and input haven't changed, recompute nothing – return the cached output. This is especially effective on stable features between releases.

import hashlib, json

def eval_cache_key(model: str, prompt_template: str, input_data: dict) -> str:
    payload = json.dumps(
        {"model": model, "prompt": prompt_template, "input": input_data},
        sort_keys=True
    )
    return hashlib.sha256(payload.encode()).hexdigest()

Cache invalidation matters: when you update a prompt template or bump the model version, those keys should naturally miss. Make model version and prompt hash part of the key, not just the input.

Run selectively in CI. Tag each eval case with the source paths it depends on. On a PR, only run the cases whose trigger_paths overlap with the changed files. Full suite runs happen nightly and on release branches.

cases:
  - name: "pr_risk_extraction"
    tags: ["pr-summary", "core"]
    trigger_paths:
      - "src/features/pr_summary/**"
      - "prompts/pr_review/**"

This keeps per-PR eval runtime manageable, and reserves full-suite runs for the moments that actually need them. The right split between fast feedback and thorough coverage depends on your team's release cadence and how many features share eval cases.

If you've read this far and want to just start, here's everything in one place.

The complete config: copy-paste your first LLM-eval suite

Here's the minimal working setup – embedding eval runner with one rubric judge call, config-driven, CI-gated. Wire in your LLM call and calibrate thresholds against your own baseline before treating any number as a threshold.

llm_eval.py:

from sentence_transformers import SentenceTransformer, util
import yaml, json, hashlib, sys

class EvalRunner:
    def __init__(self, config_path: str):
        with open(config_path) as f:
            self.config = yaml.safe_load(f)
        self.model = SentenceTransformer(
            self.config["eval_suite"].get("embedding_model", "all-MiniLM-L6-v2")
        )
        self.results = []

    def similarity(self, a: str, b: str) -> float:
        emb = self.model.encode([a, b], convert_to_tensor=True)
        return float(util.cos_sim(emb[0], emb[1]))

    def grounding_proxy(self, context: str, response: str, threshold: float = 0.78) -> float:
        sentences = [s.strip() for s in response.split(".") if s.strip()]
        if not sentences:
            return 0.0
        return sum(1 for s in sentences if self.similarity(context, s) >= threshold) / len(sentences)

    def run_case(self, case: dict, actual_output: str) -> dict:
        default_threshold = self.config["eval_suite"].get("default_threshold", 0.80)
        threshold = case.get("min_similarity", default_threshold)
        sim = self.similarity(case["expected"], actual_output)
        passed = sim >= threshold

        result = {
            "name": case["name"],
            "similarity": round(sim, 3),
            "passed": passed,
            "critical": case.get("critical", False),
        }

        if case.get("check_grounding") and case.get("context"):
            gr = self.grounding_proxy(
                case["context"], actual_output,
                case.get("grounding_threshold", 0.78)
            )
            result["grounding_proxy_ratio"] = round(gr, 3)
            if gr < 0.85:
                result["passed"] = False
                result["grounding_fail"] = True

        self.results.append(result)
        return result

    def summary(self) -> dict:
        n = len(self.results)
        p = sum(1 for r in self.results if r["passed"])
        critical_failed = [r["name"] for r in self.results if not r["passed"] and r.get("critical")]
        return {
            "total": n, "passed": p, "failed": n - p,
            "pass_at_1_proxy": round(p / n, 3) if n else 0,
            "critical_failures": critical_failed,
        }

if __name__ == "__main__":
    runner = EvalRunner("llm-evals.yaml")

    # Replace this with your actual LLM call
    def call_llm(prompt: str) -> str:
        return "stub -- replace with real model call"

    for case in runner.config["eval_suite"]["cases"]:
        output = call_llm(case["prompt"])
        result = runner.run_case(case, output)
        status = "✅" if result["passed"] else "❌"
        extras = []
        if "grounding_proxy_ratio" in result:
            extras.append(f"grounding: {result['grounding_proxy_ratio']}")
        print(f"{status} {result['name']} -- similarity: {result['similarity']}" +
              (f" | {', '.join(extras)}" if extras else ""))

    summary = runner.summary()
    print(f"\npass@1 proxy: {summary['pass_at_1_proxy']} ({summary['passed']}/{summary['total']})")
    if summary["critical_failures"]:
        print(f"CRITICAL FAILURES: {summary['critical_failures']}")
    sys.exit(0 if summary["failed"] == 0 else 1)

llm-evals.yaml:

eval_suite:
  name: "my-first-llm-evals"
  embedding_model: "all-MiniLM-L6-v2"
  default_threshold: 0.80  # starting point -- run baseline first, then calibrate

  cases:
    - name: "summary_accuracy"
      prompt: "Summarize for routing: {input}"
      expected: "Backend authentication failure affecting enterprise logins."
      min_similarity: 0.80
      critical: true

    - name: "grounded_status_report"
      prompt: "Based only on this context, report the outcome: {input}"
      context: "Deployment completed at 14:30 UTC. Zero errors logged. No rollback."
      expected: "Deployment finished at 14:30 UTC without errors."
      min_similarity: 0.82
      check_grounding: true
      grounding_threshold: 0.78

pip install sentence-transformers pyyaml anthropic
python llm_eval.py
# exits 1 if any case fails -- wire directly into CI

Unit tests remain the right tool for deterministic logic. The moment an LLM enters the call chain, you need a second layer that evaluates meaning, consistency, and groundedness – not just structure and format.

The pipeline above is a starting skeleton, not a finished framework. What makes it useful is the structure it demonstrates: the separation between quality, stability, and operational metrics; the grounding proxy as a weak signal you treat honestly; the judge layer for cases where the stakes are high enough to justify it; and the CI gate that makes evals part of the shipping process rather than an afterthought.

Calibrate your thresholds against your own baseline data. What works for ticket classification will be wrong for code review commentary. Run a baseline on known-good outputs first, see where your scores cluster, then set your thresholds just below that cluster – conservative enough to catch real failures, permissive enough not to block on normal model variance.

If you're using Enji Fleet to ship AI-generated features, the eval config maps directly onto Fleet's task output format. Ask us about the integration.