The fallacy of predict_proba
The first line of half the classification code shipped this year is some variant of this:
probs = model.predict_proba(X)And then those numbers get treated as probabilities. They get fed into thresholds, into cost functions, into expected-value calculations, into A/B test reports.
They are not probabilities. They are numbers between 0 and 1.
The two are not the same thing, and the gap between them is where most “the model performed worse in production” stories come from.
What predict_proba actually returns
predict_proba is a method. Different models compute it different ways. For random forest classification it is the fraction of trees voting for the positive class. For logistic regression it is the sigmoid of the linear score. For a gradient boosting classifier it is the sigmoid of the sum of leaf values. For a neural network it is the softmax of the final layer.
None of those four numbers is, by construction, a probability in the sense most users want. They are transformations of model scores that happen to live in [0, 1] and happen to be monotone in the score.
That is not nothing. A higher predict_proba does mean the model is more confident, in the sense that more trees voted, or the linear score is more positive, or the logit is larger. Ranking by predict_proba is a sensible thing to do.
But “monotone in the score, lives in [0, 1]” is not the same as “matches the long-run frequency of being correct.” That stronger property is calibration, and it has to be measured, not assumed.
What calibration actually means
Take a classifier that outputs a probability for each row. Group the test set by the predicted probability — say, all the rows where the model said 0.70. Look at how many of those rows were actually positive.
For a calibrated classifier, that fraction is 70%. Not approximately 70% across folds, not in expectation — 70% in the data you have, give or take sampling noise.
A miscalibrated classifier hits 70% predicted but, say, 55% actual in that bucket. It is over-confident in that region. The bucket at 0.95 might actually be positive 80% of the time. The score still ranks correctly — the 0.95 rows are more positive than the 0.70 rows — but the numerical interpretation is wrong.
The standard test for this is the Spiegelhalter Z statistic (Spiegelhalter, 1986). It compares the observed Brier score on a held-out set to the Brier score expected under perfect calibration:
Under perfect calibration, Z ~ N(0, 1). |Z| > 1.96 rejects calibration at 5%; |Z| > 2.58 at 1%. On real-world classifiers, |Z| is often well above that — sometimes in the double digits — and predict_proba is the culprit.
Why this matters specifically
Three places where the gap between “score in [0, 1]” and “calibrated probability” stops being academic:
Threshold decisions. “Flag fraud when P(fraud) > 0.05.” If
predict_proba = 0.05actually corresponds to a 12% empirical rate of fraud, you are flagging far less than you think and missing the events you most need to catch.Cost-sensitive scoring. “Expected revenue per impression = predicted CTR × price.” If predicted CTR is over-confident at the high end, the bid optimiser systematically over-pays for traffic that under-delivers.
Risk reporting. Any number that gets quoted to a non-technical stakeholder as “the probability of X” carries a calibration claim. If the underlying model isn’t calibrated, the number is fictional, and the fact that it lives in [0, 1] is what makes it sound credible.
What to do instead — what the empirical record says
The default advice in most ML tutorials is Platt scaling (a logistic squash on raw scores) or isotonic regression (a non-parametric monotone fit). Both are easy to apply on top of any classifier with a held-out calibration set.
The empirical record on whether they actually help is more uncomfortable than the tutorials let on. In our recent benchmark — Classifier Calibration at Scale, Manokhin & Grønhaug, arXiv:2601.19944 — we evaluated five calibrators (Isotonic, Platt, Beta, Venn–Abers, Pearsonify) on 21 classifiers across the TabArena-v0.1 binary tasks with stratified 5-fold cross-validation, scoring with log-loss, Brier, Spiegelhalter Z and ECE. Two findings from that paper directly contradict the textbook framing:
Platt scaling exhibits weaker and less consistent effects than the more flexible alternatives.
Commonly used calibration procedures — most notably Platt scaling and isotonic regression — can systematically degrade proper scoring performance for strong modern tabular models. Reaching for Platt/isotonic on a well-tuned modern classifier can leave you worse off than the un-calibrated
predict_proba.
The two methods that did the work in the benchmark:
Venn–Abers predictors achieved the largest average reductions in log-loss across the suite. Venn–Abers also outputs not a single calibrated probability but a pair — a lower and an upper bound — that bracket the true calibrated probability under exchangeability, with a frequentist validity guarantee that Platt and isotonic do not have. This is the principled choice when you need a coverage statement on the probability, not just an empirical improvement.
Beta calibration improved log-loss most frequently across tasks — i.e. it was the most-likely-to-help calibrator on any individual problem. Beta is a 3-parameter generalisation of Platt that fits an asymmetric S-curve; it is roughly as cheap as Platt to apply.
Two practical reads from the paper:
No method dominates uniformly. The calibration effect varies substantially across datasets and architectures. The benchmark exists precisely because picking a calibrator a priori is the wrong move.
The improvements in discrimination (AUC) from any of these calibrators are marginal — under a percentage point on average. The point of post-hoc calibration is the proper-scoring side, not the classification side.
Beyond post-hoc calibration there is a different option entirely:
Conformal classification. Don’t ask the model for a single calibrated probability at all. Ask it for a prediction set — the set of classes that the conformal procedure guarantees contains the true label with probability ≥ 1 − α under exchangeability. The score behind the set can still be a raw predict_proba; the conformal wrapper turns it into a decision-grade object. Applied Conformal Prediction spends most of chapter 3 on this construction.
These are not alternatives that compete one-for-one. The honest summary, on the evidence of Classifier Calibration at Scale and the conformal literature:
If you need a calibrated probability and you care about the proper-scoring guarantee, Venn–Abers is the default. Beta calibration is a reasonable cheaper alternative when you need a single number rather than a bracket.
Platt and isotonic are not safe defaults on modern tabular classifiers — they can degrade calibration as often as they help.
If you need a coverage statement on the outcome rather than the probability, conformal classification is the cleaner construction.
The honest sentence
If you are using predict_proba and you have not measured calibration on held-out data, the numbers you are passing downstream are model-internal scores, not probabilities. They may behave like probabilities in some regions of the input space. They almost certainly do not behave like probabilities everywhere. Whether this matters for your specific application depends on whether the downstream decision needs the score to be a probability or only needs it to rank.
For most production systems, it needs to be a probability, and the gap is what’s quietly costing you accuracy points in deployment.
Where this goes
This is the headline post the Applied CP series was building toward. The deeper treatment — the empirical evidence across model families, the formal coverage proofs for Venn–Abers, and the conformal-classification construction — is in Applied Conformal Prediction:
https://valeman.gumroad.com/l/applied_conformal_prediction
And for forecasting practitioners specifically, Modern Forecasting Mastery on Maven covers calibration in the time-series setting:
https://maven.com/valeriy-manokhin/modern-forecasting-mastery
Thursday I write the paid counterpart: Bayesian credible intervals are not coverage intervals. Same family of confusion, different camp.
— Valeriy


building a simulation using a set of multinomial lightgbm boosters [IT thinks a sinister man in a telnyashka will hack us if we use catboost :D].
Doing CP calibration means selling the bosses on it and burning credibility capital, possible but not a first choice. I've been getting good results by picking HPs with an eval score that evenly weights 1. rank ordering performance and 2. calibration (replication of rates), both out-of-time. It is a panel/longitudinal data problem and I think I got lucky that the powerful age effect anchors calibration of rates on certain HP values.