LLM-as-a-Judge

Intro

LLM-as-a-judge is an evaluation pattern where one model grades another model's output against an explicit rubric. It's useful for scalable, semantics-aware regression testing when human labels are expensive or slow. The judge reads the question, the candidate answer, and optionally a reference context, then returns a structured verdict.

Two judging modes cover most use cases. Absolute scoring (rubric scorecards) assigns a numeric score per dimension, like correctness 0-2 or groundedness 0-5. Relative preference (pairwise comparisons) shows the judge two candidate answers side-by-side and asks which is better. Absolute scoring works when you need hard pass/fail thresholds. Pairwise works when quality is subjective or you're iterating quickly and care about "better than baseline" more than a specific number.

The core workflow: define a rubric, write a judge prompt that enforces it, run the judge at scale, and periodically spot-check its verdicts against human labels to catch drift.

Rubric Scorecards

Rubric scorecards measure multiple dimensions of an LLM output using a small, consistent scale with clear scoring anchors. Each dimension gets its own score so you can see exactly where a response fails.

Good rubrics:

Are explicit and testable (define what a 0, 1, and 2 each mean in concrete terms).
Separate concerns (don't mix correctness and tone in one score).
Are calibrated through periodic human spot checks and judge agreement tracking.
Include required evidence when needed (citations, quotes, tool outputs).

Common dimensions:

Correctness (factual and task correctness)
Groundedness (claims supported by provided sources)
Safety/policy compliance
Actionability (clear next steps)
Format compliance (schema, required fields)

Scorecard example (0-2 scale) for a support assistant:

Correctness:
0: wrong policy / wrong action
1: partially correct
2: correct

Groundedness:
0: unsupported claims
1: mixed or unclear
2: all key claims supported by sources

Safety:
0: unsafe or policy violation
1: questionable
2: safe

Pairwise Comparisons

Pairwise comparisons evaluate two candidate outputs side-by-side and pick the better one. Humans and judge models are generally better at relative preference than absolute scores, which makes pairwise more reliable when quality is subjective or multi-dimensional.

Pairwise results aggregate naturally into rankings: win-rate percentages or Elo-style ratings across a test set. This makes it easy to compare prompt versions or model checkpoints without needing a fixed numeric threshold.

To make pairwise reliable:

Use a clear rubric for what "better" means (correctness first, then groundedness, then style).
Randomize which answer appears as A vs B to control for position bias.
Include "tie" as a valid output when both answers are acceptable.

Pairwise judge prompt (rubric-first):

You are evaluating two answers to the same question.
Choose the better answer.
Priority order: correctness > groundedness > safety > clarity.

Output JSON only: {"winner": "A", "rationale": "..."}  (winner must be "A", "B", or "tie")

Judge Prompt Design

The judge prompt is the most important lever. A vague prompt produces noisy, unreliable scores. A well-structured prompt locks in the rubric, specifies the output format, and gives the judge the reference context it needs to evaluate groundedness.

Groundedness-focused judge prompt template:

System: You are a strict evaluator. Score from 0 to 5.
Rules:
- Only use the provided REFERENCE to judge factual correctness.
- If the ANSWER claims facts not supported by REFERENCE, penalize heavily.
- Output JSON only. Required keys: score (0-5 integer), rationale (string), unsupported_claims (array of strings).

User:
QUESTION:
<question>

REFERENCE:
<snippets or retrieved passages>

ANSWER:
<candidate answer>

Calibration tips:

Treat the judge as a test harness: define rubric, scale, and required evidence before writing the prompt.
Spot-check judge outputs with humans, track agreement, and update the rubric or prompt when drift appears.
Reduce noise by running multiple judgments (different seeds or models) and aggregating with median or majority vote.
Defend against gaming by keeping rubrics specific and including reference context for groundedness checks.

Pitfalls

Verbosity bias — judge models prefer longer, more detailed answers even when a shorter answer is correct and sufficient. In one production eval, a 3-sentence correct answer scored 3/5 while a 12-sentence answer with minor inaccuracies scored 4/5. Mitigation: add a conciseness dimension to the rubric, include calibration examples where short answers score full marks, and cap acceptable length in the judge prompt.

Position bias in pairwise — when the same answer appears as A in one run and B in another, judges prefer whichever position they see first. In a 100-pair experiment, answer A won 62% of the time regardless of content. Mitigation: always randomize A/B order and verify that win-rates are symmetric (within 5% tolerance). If bias persists, run each pair twice (swapped) and take the consistent verdict.

Prompt sensitivity — small wording changes in the judge prompt can shift scores by 0.5-1.0 points on a 5-point scale. Changing "evaluate correctness" to "grade factual accuracy" produced a 0.7-point average shift in one eval pipeline. Mitigation: lock the judge prompt in version control, run regression checks when you change it, and treat prompt changes like code changes with tests and review.

Hidden coupling (self-preference) — if the judge model is the same model or fine-tune as the candidate, it rewards its own style and penalizes outputs from other models. A Claude judge gave Claude answers 0.8 points higher on average than GPT-4 answers of equivalent human-rated quality. Mitigation: use a different model family for judging, or validate judge scores against human labels on a diverse multi-model sample.

Calibration drift — judge behavior shifts when the underlying model receives updates. A model update that improved reasoning also made the judge stricter on formatting, causing 15% more failures on an unchanged golden set. Mitigation: maintain a fixed gold dataset with known human labels and re-run calibration after every model update. Alert if agreement with human labels drops below 80% on binary pass/fail.

Questions

When should I prefer LLM-as-a-judge over classic metrics, and how do I know the judge is trustworthy?

Expected answer:

Use judges for open-ended generation where semantics matter and deterministic metrics (exact match, BLEU) can't capture quality.
Use classic metrics for deterministic outputs or when you need hard guarantees.
The key signal: if a human would need to read the answer to evaluate it, a judge model probably should too.
Measure judge trustworthiness by checking agreement with a small human-labeled set.
Track drift over time by re-running a fixed gold dataset after model updates.
Why: judge reliability is not assumed — it must be validated and maintained like any other test harness.

When should I prefer pairwise comparisons over rubric scorecards?

Expected answer:

Pairwise works best when iterating rapidly and the goal is "better than baseline" rather than a specific threshold.
Scorecards work better when you need hard pass/fail criteria, want to track specific dimensions over time, or need to gate a release on a minimum score.
Pairwise results aggregate into win-rates or Elo ratings, which are useful for comparing prompt versions or model checkpoints.
Why: relative preference is cognitively easier for both humans and models than assigning an absolute score, so pairwise tends to produce more consistent verdicts on subjective quality.

What are the most dangerous pitfalls when using LLM-as-a-judge in production?

Expected answer:

Verbosity bias: judges prefer longer answers even when shorter ones are correct. Mitigate with a conciseness dimension and length caps.
Position bias: in pairwise, judges favor whichever answer appears first. Always randomize A/B order.
Hidden coupling: using the same model as judge and candidate inflates scores. Use a different judge model.
Calibration drift: judge behavior shifts as the underlying model is updated. Maintain a gold dataset and re-run calibration periodically.
Why: these biases are systematic, not random — they silently corrupt your eval signal and can cause you to ship regressions.

References

Whats next

Parent
LLM

Pages