ROC-AUC and PR-AUC

Intro

ROC-AUC means Receiver Operating Characteristic Area Under the Curve. PR-AUC means Precision Recall Area Under the Curve. Both are threshold-free metrics for binary classifiers.

Use ROC-AUC for general ranking quality when classes are fairly balanced. Use PR-AUC for imbalanced data where false positives are expensive.

This note fits the evaluation stage of Machine Learning and is most relevant for Types like binary classification and rare event detection.

Deeper Explanation

Mental Model

Both curves come from sweeping a score threshold from strict to loose.

---
config:
  themeVariables:
    xyChart:
      plotColorPalette: "#9CA3AF, #EF4444, #22C55E"
---
xychart-beta
  title ROC curve intuition
  x-axis False positive rate 0 --> 1
  y-axis True positive rate 0 --> 1
  line [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
  line [0.0, 0.28, 0.46, 0.63, 0.79, 1.0]
  line [0.0, 0.65, 0.82, 0.90, 0.96, 1.0]

What this ROC diagram shows:

---
config:
  themeVariables:
    xyChart:
      plotColorPalette: "#9CA3AF, #EF4444, #22C55E"
---
xychart-beta
  title PR curve intuition
  x-axis Recall 0 --> 1
  y-axis Precision 0 --> 1
  line [0.02, 0.02, 0.02, 0.02, 0.02, 0.02]
  line [0.22, 0.14, 0.09, 0.06, 0.04, 0.03]
  line [1.0, 0.86, 0.68, 0.52, 0.33, 0.14]

What this PR diagram shows:

Area under the curve is average performance across thresholds.

In ML.NET, BinaryClassificationMetrics exposes both AreaUnderRocCurve and AreaUnderPrecisionRecallCurve directly after calling mlContext.BinaryClassification.Evaluate.

When to Use Which

Use this quick rule:

Why this matters in production:

Example

ML.NET example that prints ROC-AUC and PR-AUC side by side:

using Microsoft.ML;
using Microsoft.ML.Data;

var mlContext = new MLContext(seed: 42);

var data = mlContext.Data.LoadFromTextFile<ModelInput>(
    "transactions.csv", hasHeader: true, separatorChar: ',');

var split = mlContext.Data.TrainTestSplit(data, testFraction: 0.3);

var pipeline = mlContext.Transforms
    .NormalizeMinMax("Features")
    .Append(mlContext.BinaryClassification.Trainers
        .SdcaLogisticRegression(labelColumnName: "Label", featureColumnName: "Features"));

var model = pipeline.Fit(split.TrainSet);
var predictions = model.Transform(split.TestSet);

var metrics = mlContext.BinaryClassification.Evaluate(predictions, labelColumnName: "Label");

Console.WriteLine($"ROC-AUC:  {metrics.AreaUnderRocCurve:F3}");
Console.WriteLine($"PR-AUC:   {metrics.AreaUnderPrecisionRecallCurve:F3}");
Console.WriteLine($"F1:       {metrics.F1Score:F3}");
Console.WriteLine($"Accuracy: {metrics.Accuracy:F3}");

// Input schema
public class ModelInput
{
    [LoadColumn(0)]
    public bool Label { get; set; }

    [LoadColumn(1, 20), VectorType(20)]
    public float[] Features { get; set; } = default!;
}

How to read the output:

Reading the Curves

Anchor points:

For threshold selection, look for the knee:

Practical threshold tuning pattern:

Pitfalls

Tradeoffs

Metric Measures Fits when Misleads when
ROC-AUC Ranking positives above negatives across all thresholds Balanced-ish classes, you want a general ranking metric, you compare rankers Extreme imbalance, you care about precision at a specific operating point
PR-AUC Precision vs recall tradeoff for positives across thresholds Rare positives, alerting and review pipelines, positive class is what matters Prevalence changes between datasets, you need a globally comparable score
F1 Single point tradeoff of precision and recall at one threshold You have a chosen threshold and want a simple alert quality number Threshold is not fixed, costs are asymmetric, you care about probability quality
Log loss Quality of predicted probabilities with heavy penalty for confident mistakes You optimize calibrated probabilities, you compare probabilistic models Labels are noisy, you only care about ranking not probability magnitude

Questions


Whats next