LLM-as-a-Judge
Intro
LLM-as-a-judge is an evaluation pattern where one model grades another model's output against an explicit rubric. It's useful for scalable, semantics-aware regression testing when human labels are expensive or slow. The judge reads the question, the candidate answer, and optionally a reference context, then returns a structured verdict.
Two judging modes cover most use cases. Absolute scoring (rubric scorecards) assigns a numeric score per dimension, like correctness 0-2 or groundedness 0-5. Relative preference (pairwise comparisons) shows the judge two candidate answers side-by-side and asks which is better. Absolute scoring works when you need hard pass/fail thresholds. Pairwise works when quality is subjective or you're iterating quickly and care about "better than baseline" more than a specific number.
The core workflow: define a rubric, write a judge prompt that enforces it, run the judge at scale, and periodically spot-check its verdicts against human labels to catch drift.
Rubric Scorecards
Rubric scorecards measure multiple dimensions of an LLM output using a small, consistent scale with clear scoring anchors. Each dimension gets its own score so you can see exactly where a response fails.
Good rubrics:
- Are explicit and testable (define what a 0, 1, and 2 each mean in concrete terms).
- Separate concerns (don't mix correctness and tone in one score).
- Are calibrated through periodic human spot checks and judge agreement tracking.
- Include required evidence when needed (citations, quotes, tool outputs).
Common dimensions:
- Correctness (factual and task correctness)
- Groundedness (claims supported by provided sources)
- Safety/policy compliance
- Actionability (clear next steps)
- Format compliance (schema, required fields)
Scorecard example (0-2 scale) for a support assistant:
Correctness:
0: wrong policy / wrong action
1: partially correct
2: correct
Groundedness:
0: unsupported claims
1: mixed or unclear
2: all key claims supported by sources
Safety:
0: unsafe or policy violation
1: questionable
2: safe
Pairwise Comparisons
Pairwise comparisons evaluate two candidate outputs side-by-side and pick the better one. Humans and judge models are generally better at relative preference than absolute scores, which makes pairwise more reliable when quality is subjective or multi-dimensional.
Pairwise results aggregate naturally into rankings: win-rate percentages or Elo-style ratings across a test set. This makes it easy to compare prompt versions or model checkpoints without needing a fixed numeric threshold.
To make pairwise reliable:
- Use a clear rubric for what "better" means (correctness first, then groundedness, then style).
- Randomize which answer appears as A vs B to control for position bias.
- Include "tie" as a valid output when both answers are acceptable.
Pairwise judge prompt (rubric-first):
You are evaluating two answers to the same question.
Choose the better answer.
Priority order: correctness > groundedness > safety > clarity.
Output JSON only: {"winner": "A", "rationale": "..."} (winner must be "A", "B", or "tie")
Judge Prompt Design
The judge prompt is the most important lever. A vague prompt produces noisy, unreliable scores. A well-structured prompt locks in the rubric, specifies the output format, and gives the judge the reference context it needs to evaluate groundedness.
Groundedness-focused judge prompt template:
System: You are a strict evaluator. Score from 0 to 5.
Rules:
- Only use the provided REFERENCE to judge factual correctness.
- If the ANSWER claims facts not supported by REFERENCE, penalize heavily.
- Output JSON only. Required keys: score (0-5 integer), rationale (string), unsupported_claims (array of strings).
User:
QUESTION:
<question>
REFERENCE:
<snippets or retrieved passages>
ANSWER:
<candidate answer>
Calibration tips:
- Treat the judge as a test harness: define rubric, scale, and required evidence before writing the prompt.
- Spot-check judge outputs with humans, track agreement, and update the rubric or prompt when drift appears.
- Reduce noise by running multiple judgments (different seeds or models) and aggregating with median or majority vote.
- Defend against gaming by keeping rubrics specific and including reference context for groundedness checks.
Pitfalls
Verbosity bias — judge models prefer longer, more detailed answers even when a shorter answer is correct and sufficient. In one production eval, a 3-sentence correct answer scored 3/5 while a 12-sentence answer with minor inaccuracies scored 4/5. Mitigation: add a conciseness dimension to the rubric, include calibration examples where short answers score full marks, and cap acceptable length in the judge prompt.
Position bias in pairwise — when the same answer appears as A in one run and B in another, judges prefer whichever position they see first. In a 100-pair experiment, answer A won 62% of the time regardless of content. Mitigation: always randomize A/B order and verify that win-rates are symmetric (within 5% tolerance). If bias persists, run each pair twice (swapped) and take the consistent verdict.
Prompt sensitivity — small wording changes in the judge prompt can shift scores by 0.5-1.0 points on a 5-point scale. Changing "evaluate correctness" to "grade factual accuracy" produced a 0.7-point average shift in one eval pipeline. Mitigation: lock the judge prompt in version control, run regression checks when you change it, and treat prompt changes like code changes with tests and review.
Hidden coupling (self-preference) — if the judge model is the same model or fine-tune as the candidate, it rewards its own style and penalizes outputs from other models. A Claude judge gave Claude answers 0.8 points higher on average than GPT-4 answers of equivalent human-rated quality. Mitigation: use a different model family for judging, or validate judge scores against human labels on a diverse multi-model sample.
Calibration drift — judge behavior shifts when the underlying model receives updates. A model update that improved reasoning also made the judge stricter on formatting, causing 15% more failures on an unchanged golden set. Mitigation: maintain a fixed gold dataset with known human labels and re-run calibration after every model update. Alert if agreement with human labels drops below 80% on binary pass/fail.
Questions
Expected answer:
- Use judges for open-ended generation where semantics matter and deterministic metrics (exact match, BLEU) can't capture quality.
- Use classic metrics for deterministic outputs or when you need hard guarantees.
- The key signal: if a human would need to read the answer to evaluate it, a judge model probably should too.
- Measure judge trustworthiness by checking agreement with a small human-labeled set.
- Track drift over time by re-running a fixed gold dataset after model updates.
- Why: judge reliability is not assumed — it must be validated and maintained like any other test harness.
Expected answer:
- Pairwise works best when iterating rapidly and the goal is "better than baseline" rather than a specific threshold.
- Scorecards work better when you need hard pass/fail criteria, want to track specific dimensions over time, or need to gate a release on a minimum score.
- Pairwise results aggregate into win-rates or Elo ratings, which are useful for comparing prompt versions or model checkpoints.
- Why: relative preference is cognitively easier for both humans and models than assigning an absolute score, so pairwise tends to produce more consistent verdicts on subjective quality.
Expected answer:
- Verbosity bias: judges prefer longer answers even when shorter ones are correct. Mitigate with a conciseness dimension and length caps.
- Position bias: in pairwise, judges favor whichever answer appears first. Always randomize A/B order.
- Hidden coupling: using the same model as judge and candidate inflates scores. Use a different judge model.
- Calibration drift: judge behavior shifts as the underlying model is updated. Maintain a gold dataset and re-run calibration periodically.
- Why: these biases are systematic, not random — they silently corrupt your eval signal and can cause you to ship regressions.