Classification Evaluation
Intro
Classification evaluation is how you measure whether a model assigns the right label (or set of labels) for an input. In software terms: you want to quantify the failure modes (false alarms vs misses), pick an operating point (threshold), and prevent regressions when data/model changes.
Precision Recall and F1 in one page
Confusion matrix first
Everything starts from four counts:
| Actual positive | Actual negative | |
|---|---|---|
| Predicted positive | TP | FP |
| Predicted negative | FN | TN |
TP: you flagged positive and it really was positive.FP: false alarm.FN: miss.TN: correctly ignored.
The three formulas to remember
precision = TP / (TP + FP)
recall = TP / (TP + FN)
F1 = 2 * (precision * recall) / (precision + recall)
- Precision: from predicted positives, how many are truly positive.
- Recall: from real positives, how many you found.
- F1: one score that is high only when both precision and recall are high.
Memory hook:
- Precision is hurt by
FPfalse alarms. - Recall is hurt by
FNmisses.
Threshold tradeoff
flowchart LR L[Low threshold] --> M[More predicted positives] M --> R1[Recall usually up] M --> P1[Precision usually down] H[High threshold] --> F[Fewer predicted positives] F --> R2[Recall usually down] F --> P2[Precision usually up]
Real world examples
Content moderation:
- Low threshold catches more unsafe posts, but blocks more safe posts.
- High threshold blocks fewer safe posts, but lets more unsafe posts pass.
Fraud detection:
- High recall means fewer fraud cases slip through.
- High precision means fewer legit users get flagged.
Worked example
Binary classifier on 100 cases:
TP = 32
FP = 8
TN = 50
FN = 10
precision = 32 / (32 + 8) = 0.80
recall = 32 / (32 + 10) = 0.76
F1 = 2 * (0.80 * 0.76) / (0.80 + 0.76) = 0.78
Same model family at two thresholds:
| Threshold | TP | FP | FN | Precision | Recall |
|---|---|---|---|---|---|
| 0.30 | 90 | 60 | 10 | 0.60 | 0.90 |
| 0.80 | 55 | 10 | 45 | 0.85 | 0.55 |
Pitfalls
F1 hides asymmetric failures — an F1 of 0.78 could be precision 0.95 / recall 0.66 or precision 0.66 / recall 0.95. These have completely different operational impact. A fraud detection model with recall 0.66 misses a third of fraud cases — a $2M/year loss at a mid-size payment processor. Always report precision and recall separately alongside F1.
Comparing models at different thresholds — if Model A runs at threshold 0.3 and Model B at 0.7, comparing their precision/recall is meaningless. Fix the threshold policy first (e.g., "recall ≥ 0.95"), then compare precision at that fixed operating point. Better yet, compare PR-AUC or ROC-AUC for threshold-invariant comparison.
Optimizing a single metric — a spam filter optimized purely for precision (blocking only obvious spam) lets 40% of spam through. The same filter optimized purely for recall blocks 15% of legitimate emails. Neither is deployable. Always optimize under a constraint: "maximize precision subject to recall ≥ X" (or vice versa).
Class imbalance distortion — accuracy is misleading on imbalanced datasets. A model that always predicts "not fraud" on a dataset with 0.1% fraud rate achieves 99.9% accuracy but catches zero fraud. Use precision/recall/F1 (which ignore TN) or balanced accuracy. In production, track per-class metrics separately.
Multi-class Averaging
When extending precision/recall to multi-class problems, the averaging method changes the number you see:
| Method | Formula | When to Use |
|---|---|---|
| Macro | Average metric across classes equally | All classes are equally important regardless of size (e.g., rare disease detection) |
| Micro | Aggregate TP/FP/FN globally, then compute | Overall correctness matters most, classes are roughly balanced |
| Weighted | Average weighted by class support (count) | Classes have different sizes and you want proportional representation |
Decision rule: use macro-average when minority classes matter (medical, safety). Use micro-average for balanced datasets or when you want a single aggregate number. Use weighted-average for reporting to stakeholders who think in terms of "percentage of all predictions correct."
Questions
Optimize precision when false positives are expensive: blocking legitimate users (payment fraud → customer churn at $50/incident), creating costly manual review queues (content moderation team costs $35/hour per reviewer), or triggering expensive downstream actions (automated account lockouts). Optimize recall when misses are dangerous: fraud slipping through ($500 average loss per undetected case), safety violations in content moderation (regulatory fines), or medical screening (missed diagnosis). Operationalize by setting a hard constraint on the priority metric (e.g., "recall ≥ 0.95"), then maximizing the other. Freeze the threshold and monitor both metrics daily with alerting on ≥5% drift.
- Define the business constraint: "recall must be ≥ 0.95" or "FP rate must be ≤ 0.02." 2. Plot the precision-recall curve on a held-out validation set. 3. Find the threshold that satisfies your constraint while maximizing the complementary metric. 4. Validate on a separate golden test set (not the validation set used for selection). 5. Freeze the threshold in production config (not hardcoded). 6. Set up monitoring: if the metric drops ≥5% on weekly golden-set runs, trigger re-evaluation. Tradeoff: a static threshold degrades as data distribution shifts — schedule quarterly threshold reviews or implement dynamic thresholding with confidence calibration.
Links
- Scikit-learn: Classification metrics — comprehensive reference for precision, recall, F1, confusion matrix, ROC, and multi-class variants with code examples.
- Google ML Crash Course: Accuracy, precision, recall — interactive tutorial with threshold visualization and worked examples.
- Precision-recall tradeoff in production ML (Eugene Yan) — practitioner perspective on choosing operating points, monitoring metric drift, and when simpler models with better-understood failure modes beat complex ones.
- Beyond Accuracy: Behavioral Testing of NLP Models (Ribeiro et al., ACL 2020) — introduces CheckList methodology for testing classification models beyond aggregate metrics, with per-capability precision/recall analysis.