Classification Evaluation

Intro

Classification evaluation is how you measure whether a model assigns the right label (or set of labels) for an input. In software terms: you want to quantify the failure modes (false alarms vs misses), pick an operating point (threshold), and prevent regressions when data/model changes.

Precision Recall and F1 in one page

Confusion matrix first

Everything starts from four counts:

Actual positive Actual negative
Predicted positive TP FP
Predicted negative FN TN

The three formulas to remember

precision = TP / (TP + FP)
recall    = TP / (TP + FN)
F1        = 2 * (precision * recall) / (precision + recall)

Memory hook:

Threshold tradeoff

flowchart LR
  L[Low threshold] --> M[More predicted positives]
  M --> R1[Recall usually up]
  M --> P1[Precision usually down]
  H[High threshold] --> F[Fewer predicted positives]
  F --> R2[Recall usually down]
  F --> P2[Precision usually up]

Real world examples

Content moderation:

Fraud detection:

Worked example

Binary classifier on 100 cases:

TP = 32
FP = 8
TN = 50
FN = 10
precision = 32 / (32 + 8)  = 0.80
recall    = 32 / (32 + 10) = 0.76
F1        = 2 * (0.80 * 0.76) / (0.80 + 0.76) = 0.78

Same model family at two thresholds:

Threshold TP FP FN Precision Recall
0.30 90 60 10 0.60 0.90
0.80 55 10 45 0.85 0.55

Pitfalls

F1 hides asymmetric failures — an F1 of 0.78 could be precision 0.95 / recall 0.66 or precision 0.66 / recall 0.95. These have completely different operational impact. A fraud detection model with recall 0.66 misses a third of fraud cases — a $2M/year loss at a mid-size payment processor. Always report precision and recall separately alongside F1.

Comparing models at different thresholds — if Model A runs at threshold 0.3 and Model B at 0.7, comparing their precision/recall is meaningless. Fix the threshold policy first (e.g., "recall ≥ 0.95"), then compare precision at that fixed operating point. Better yet, compare PR-AUC or ROC-AUC for threshold-invariant comparison.

Optimizing a single metric — a spam filter optimized purely for precision (blocking only obvious spam) lets 40% of spam through. The same filter optimized purely for recall blocks 15% of legitimate emails. Neither is deployable. Always optimize under a constraint: "maximize precision subject to recall ≥ X" (or vice versa).

Class imbalance distortion — accuracy is misleading on imbalanced datasets. A model that always predicts "not fraud" on a dataset with 0.1% fraud rate achieves 99.9% accuracy but catches zero fraud. Use precision/recall/F1 (which ignore TN) or balanced accuracy. In production, track per-class metrics separately.

Multi-class Averaging

When extending precision/recall to multi-class problems, the averaging method changes the number you see:

Method Formula When to Use
Macro Average metric across classes equally All classes are equally important regardless of size (e.g., rare disease detection)
Micro Aggregate TP/FP/FN globally, then compute Overall correctness matters most, classes are roughly balanced
Weighted Average weighted by class support (count) Classes have different sizes and you want proportional representation

Decision rule: use macro-average when minority classes matter (medical, safety). Use micro-average for balanced datasets or when you want a single aggregate number. Use weighted-average for reporting to stakeholders who think in terms of "percentage of all predictions correct."

Questions


Whats next