Deterministic Checks
Intro
Deterministic checks are non-LLM tests that validate LLM outputs strictly: schema validity, required fields, safety rules, and tool/policy constraints. They are cheap (microseconds), deterministic (same input always gives same result), and should run on every evaluation before any LLM judge. They catch the obvious failures fast and cheaply, leaving expensive LLM-as-judge calls for semantic quality.
Types of Deterministic Checks
| Check type | What it validates | Example |
|---|---|---|
| Schema validation | Output is parseable and matches expected structure | JSON schema, required fields, no extra fields |
| Allowlist enforcement | Only permitted actions/tools are invoked | action must be one of ["search", "escalate"] |
| Citation rules | Factual answers must cite sources | Response contains at least one [source] reference |
| PII scanning | No personal data in output | No email addresses, SSNs, phone numbers |
| Injection-resistant formatting | Output is safe to render | No <script> tags, no SQL injection patterns |
| Length constraints | Output is within expected bounds | Response is 10–500 characters |
| Language/encoding | Output is in the expected language and encoding | UTF-8, English only |
Example — JSON Schema Contract
{
"type": "object",
"additionalProperties": false,
"properties": {
"action": {"type": "string", "enum": ["search", "escalate"]},
"reason": {"type": "string", "minLength": 1},
"citations": {"type": "array", "items": {"type": "string"}}
},
"required": ["action", "reason"]
}
Any output that fails this schema is rejected immediately — no LLM judge needed.
Where Deterministic Checks Fit in the Evaluation Pipeline
LLM Output
│
▼
[1] Deterministic checks ← fast, cheap, run first
│ FAIL → reject immediately
│ PASS
▼
[2] LLM-as-judge ← slow, expensive, run only on valid outputs
│ FAIL → flag for review
│ PASS
▼
[3] Human review ← for high-stakes or ambiguous cases
Run deterministic checks first. A malformed JSON or a disallowed action does not need a judge — it is a hard failure.
Deterministic Checks vs LLM-as-Judge
| Aspect | Deterministic checks | LLM-as-judge |
|---|---|---|
| Speed | Microseconds | Seconds |
| Cost | Near zero | LLM API cost per call |
| Determinism | Always same result | Non-deterministic |
| What it measures | Format, structure, hard rules | Semantic quality, relevance, tone |
| False positive rate | Zero (rule-based) | Non-zero (LLM can misjudge) |
| Coverage | Only what you explicitly define | Open-ended quality dimensions |
Use both. Deterministic checks enforce hard constraints; LLM judges evaluate soft quality. Neither replaces the other.
Pitfalls
Over-Relying on Schema Validation Alone
What goes wrong: the team adds JSON schema validation and considers deterministic checks done. The output is structurally valid but semantically wrong — the action field is "search" when it should be "escalate", and the schema allows both.
Mitigation: schema validation is necessary but not sufficient. Add allowlist checks (only permitted action values), citation rules (factual answers must cite sources), and PII scanning. Schema catches structure; business rules catch semantic violations.
Treating Deterministic Failures as Soft Warnings
What goes wrong: a deterministic check fails (PII detected in output, disallowed action invoked) but the team logs it as a warning and continues. The LLM judge then evaluates the output and may pass it.
Mitigation: deterministic check failures are hard failures. Reject the output immediately. Do not pass it to the LLM judge. The pipeline order matters: deterministic checks first, LLM judge only on outputs that pass all hard rules.
Forgetting to Check Tool Inputs, Not Just Outputs
What goes wrong: the team validates the final LLM response but not the tool calls the agent makes. The agent calls a delete_record tool that is not on the allowlist, and the check never fires because it only runs on the text response.
Mitigation: apply allowlist checks to every tool invocation, not just the final response. For agentic systems, each tool call is an action that needs validation.
Questions
No. They enforce format and hard rules, but they do not measure semantic correctness, relevance, or tone. A response can pass all schema checks and still be factually wrong or unhelpful. Use deterministic checks as a fast pre-filter; use LLM judges for semantic quality.
(1) Allowlisted tools/actions only, (2) strict output schema validation, (3) PII scanning on outputs, (4) injection-resistant formatting, (5) length constraints. These catch the most common failure modes cheaply before any expensive evaluation.
References
- OWASP Top 10 for LLM Applications — the canonical list of LLM security risks including prompt injection, insecure output handling, and data leakage — each maps to a deterministic check category.
- OWASP LLM Prompt Injection Prevention Cheat Sheet — specific mitigations for prompt injection, including input validation and output sanitization.
- LLM-as-a-Judge (Zheng et al., 2023) — the paper that established LLM-as-judge as an evaluation method; useful for understanding what deterministic checks cannot cover.