LLM-as-a-Judge

Intro

LLM-as-a-judge is an evaluation pattern where one model grades another model's output against an explicit rubric. It's useful for scalable, semantics-aware regression testing when human labels are expensive or slow. The judge reads the question, the candidate answer, and optionally a reference context, then returns a structured verdict.

Two judging modes cover most use cases. Absolute scoring (rubric scorecards) assigns a numeric score per dimension, like correctness 0-2 or groundedness 0-5. Relative preference (pairwise comparisons) shows the judge two candidate answers side-by-side and asks which is better. Absolute scoring works when you need hard pass/fail thresholds. Pairwise works when quality is subjective or you're iterating quickly and care about "better than baseline" more than a specific number.

The core workflow: define a rubric, write a judge prompt that enforces it, run the judge at scale, and periodically spot-check its verdicts against human labels to catch drift.

Rubric Scorecards

Rubric scorecards measure multiple dimensions of an LLM output using a small, consistent scale with clear scoring anchors. Each dimension gets its own score so you can see exactly where a response fails.

Good rubrics:

Common dimensions:

Scorecard example (0-2 scale) for a support assistant:

Correctness:
0: wrong policy / wrong action
1: partially correct
2: correct

Groundedness:
0: unsupported claims
1: mixed or unclear
2: all key claims supported by sources

Safety:
0: unsafe or policy violation
1: questionable
2: safe

Pairwise Comparisons

Pairwise comparisons evaluate two candidate outputs side-by-side and pick the better one. Humans and judge models are generally better at relative preference than absolute scores, which makes pairwise more reliable when quality is subjective or multi-dimensional.

Pairwise results aggregate naturally into rankings: win-rate percentages or Elo-style ratings across a test set. This makes it easy to compare prompt versions or model checkpoints without needing a fixed numeric threshold.

To make pairwise reliable:

Pairwise judge prompt (rubric-first):

You are evaluating two answers to the same question.
Choose the better answer.
Priority order: correctness > groundedness > safety > clarity.

Output JSON only: {"winner": "A", "rationale": "..."}  (winner must be "A", "B", or "tie")

Judge Prompt Design

The judge prompt is the most important lever. A vague prompt produces noisy, unreliable scores. A well-structured prompt locks in the rubric, specifies the output format, and gives the judge the reference context it needs to evaluate groundedness.

Groundedness-focused judge prompt template:

System: You are a strict evaluator. Score from 0 to 5.
Rules:
- Only use the provided REFERENCE to judge factual correctness.
- If the ANSWER claims facts not supported by REFERENCE, penalize heavily.
- Output JSON only. Required keys: score (0-5 integer), rationale (string), unsupported_claims (array of strings).

User:
QUESTION:
<question>

REFERENCE:
<snippets or retrieved passages>

ANSWER:
<candidate answer>

Calibration tips:

Pitfalls

Verbosity bias — judge models prefer longer, more detailed answers even when a shorter answer is correct and sufficient. In one production eval, a 3-sentence correct answer scored 3/5 while a 12-sentence answer with minor inaccuracies scored 4/5. Mitigation: add a conciseness dimension to the rubric, include calibration examples where short answers score full marks, and cap acceptable length in the judge prompt.

Position bias in pairwise — when the same answer appears as A in one run and B in another, judges prefer whichever position they see first. In a 100-pair experiment, answer A won 62% of the time regardless of content. Mitigation: always randomize A/B order and verify that win-rates are symmetric (within 5% tolerance). If bias persists, run each pair twice (swapped) and take the consistent verdict.

Prompt sensitivity — small wording changes in the judge prompt can shift scores by 0.5-1.0 points on a 5-point scale. Changing "evaluate correctness" to "grade factual accuracy" produced a 0.7-point average shift in one eval pipeline. Mitigation: lock the judge prompt in version control, run regression checks when you change it, and treat prompt changes like code changes with tests and review.

Hidden coupling (self-preference) — if the judge model is the same model or fine-tune as the candidate, it rewards its own style and penalizes outputs from other models. A Claude judge gave Claude answers 0.8 points higher on average than GPT-4 answers of equivalent human-rated quality. Mitigation: use a different model family for judging, or validate judge scores against human labels on a diverse multi-model sample.

Calibration drift — judge behavior shifts when the underlying model receives updates. A model update that improved reasoning also made the judge stricter on formatting, causing 15% more failures on an unchanged golden set. Mitigation: maintain a fixed gold dataset with known human labels and re-run calibration after every model update. Alert if agreement with human labels drops below 80% on binary pass/fail.

Questions

References


Whats next