Where CatBoost beats XGBoost and LightGBM — and what the book is honest about
This is the closing post of the two-week arc. I have spent six posts on what CatBoost does and why it does it. The right way to close is with the head-to-head — under what conditions does CatBoost win, where does it lose, and what does the book actually claim with evidence.
I want to do this carefully because the book is careful. Specifically: §9.6 of the book is not a populated head-to-head benchmark table. It is a “how to set up your own fair comparison” section with code. The book’s stance, quoting directly from §9.6:
The honest answer is that no single library dominates across all regression tasks. Performance depends on dataset characteristics.
The third-party benchmark numbers I will cite are from §1.6, which references published independent studies.
The architectural comparison the book makes (§3.9.1, Table 3.1)
That is the architectural comparison the book stakes out. Note it is qualitative, not numerical — and that is the right level of precision for a structural comparison.
What the independent benchmarks report (§1.6)
These are the numbers the book actually cites for “how do they compare in practice.”
Shmuel et al., 111 datasets [ref 57]:
- CatBoost: 19 best-score wins (17.1% of datasets), average rank 4.9, median rank 4
- LightGBM: 15 best-score wins (13.5%)
- XGBoost: 5 best-score wins (4.5%)
- In regression specifically: CatBoost best on 15 datasets, average rank 3.7
- Overall winner across the benchmark: AutoGluon (an ensemble that includes CatBoost) — 39 of 111 best scores (35.1%)
McElfresh et al., NeurIPS’23, 176 datasets [ref 41]:
- Mean rank: CatBoost 5.06, XGBoost 6.38, LightGBM 7.80
- The paper compared 19 algorithms including MLPs, ResNets, and TabNet — GBDTs (including CatBoost) consistently led on datasets with irregular, heavy-tailed, or skewed feature distributions; neural networks occasionally won on large, dense, fully numeric datasets.
TabArena 2025 [ref 16]:
- A continuously-updated benchmark with controlled hardware and hyperparameter budgets.
- CatBoost was the top GBDT in the conventional tuning regime.
- Particularly strong on: high-cardinality categoricals, mixed-type features, moderate-to-high noise.
- LightGBM remained competitive in raw training speed; rankings tighten after post-hoc ensembling.
Ye et al. 2024, 300 datasets [ref 67]:
- Tree-based ensembles (including CatBoost) remained strong, reliable baselines.
- Recent tabular foundation models (TabPFN v2, TabICL) now match or surpass GBDTs on many tasks.
- CatBoost’s native categorical handling specifically noted as a methodological advantage that reduces preprocessing requirements.
Grinsztajn et al., NeurIPS’22, 45 datasets [ref 27]:
- Tree ensembles decisively outperformed deep learning on this benchmark.
- CatBoost was not in this specific study; the result validates the GBDT paradigm CatBoost is part of.
Kaggle Playground Series S4E10 [ref 56]:
- A single CatBoost model, trained with minimal feature engineering, won outright in this 2024 Loan Approval Prediction competition, outperforming ensembles of XGBoost. This is one cited example, not a generalisation about all Kaggle competitions.
Where each library is the right pick
CatBoost is the right default when:
- Your data has categorical features (which is most real-world tabular data). The native ordered-target-statistics encoding closes a leak that XGBoost and LightGBM users typically have to address externally.
- You want strong defaults — the book §9.6 lists “minimal tuning required” as a CatBoost advantage.
- You need ordered boosting’s bias correction on small-to-medium datasets.
- Robust quantile regression is part of the workload (CatBoost’s multi-quantile loss).
LightGBM is the right default when (per book §9.6):
- Training speed on large datasets with many features dominates the workflow.
- Leaf-wise growth can capture complex patterns that depth-bounded symmetric trees miss.
- Low memory footprint is important.
- Reference: Ke et al. introduced gradient-based one-side sampling and exclusive feature bundling specifically for this regime.
XGBoost is the right default when (per book §9.6):
- Community adoption and ecosystem maturity matter.
- You need flexible tree-construction methods (histogram, exact, approximate).
- Strong GPU implementation is required.
- Reference: Chen and Guestrin introduced the scalable tree-boosting system.
The book is unusually balanced here. It does not claim CatBoost wins everywhere — and the §1.6 third-party studies bear that out. CatBoost leads the GBDT ranking in three of the cited benchmarks (Shmuel, McElfresh, TabArena), but it is not always the absolute winner, and ensembles can beat any single library.
What you can and cannot extrapolate
Three things the cited evidence supports:
On a population of ~100+ diverse tabular datasets, CatBoost has the strongest mean rank among GBDTs.
On datasets with high-cardinality categorical features, CatBoost has a particular advantage from the ordered-target-statistics encoding.
The architectural choices that drive these results are documented qualitatively in §3.9.1 Table 3.1.
Three things the cited evidence does not support, and that I would be wrong to claim:
“CatBoost is X percentage points more accurate than XGBoost on your specific dataset.” — Per-dataset variation is large. Measure it.
“CatBoost inference is X times faster than LightGBM.” — The book §1.6 and §3.7 explicitly say speedup factor depends on tree depth, hardware, implementation, and workload. Benchmark on your deployment.
“CatBoost will win your Kaggle competition” — The book cites one specific 2024 Kaggle win (Playground Series S4E10). Generalising that to all competitions is not supported by anything in the book or the cited studies.
The honest closing
CatBoost is the result of three engineering decisions documented in §1.5:
Solve the categorical-encoding leak inside the algorithm via ordered target statistics, not outside via preprocessing.
Solve the residual-reuse leak inside the algorithm via ordered boosting, not outside via leave-one-out.
Use a symmetric tree architecture for built-in regularisation and a branch-regular inference path.
None of these is a silver bullet in isolation. The combination, as documented in the §1.6 third-party studies, produces a library that leads the GBDT ranking on aggregate tabular benchmarks and is particularly strong on data with categorical features.
For your problem specifically, the right move is the protocol from §9.6: train CatBoost, XGBoost, and LightGBM with identical preprocessing budgets, comparable hyperparameter searches, the same evaluation protocol, and measure the result. The book provides the code for the comparison. The conclusion is yours.
Where the series leaves the audience
Six posts of detailed mechanism (ordered boosting, ordered target statistics, symmetric trees, tuning hierarchy, CatBoost + Conformal) plus one head-to-head architectural comparison. The next moves are practical:
If you have an XGBoost or LightGBM model in production with categorical features, run a CatBoost equivalent and compare train-test gaps (not test metrics).
If you build forecasting models, Modern Forecasting Mastery on Maven applies these ideas to time-series:
https://maven.com/valeriy-manokhin/modern-forecasting-masteryIf you want the long-form reference, Mastering CatBoost — The Hidden Gem of Tabular AI is the source for this series. The companion book on uncertainty (the chapter 5 reference) is Applied Conformal Prediction:
https://valeman.gumroad.com/l/applied_conformal_prediction
Next week the series pivots back to conformal prediction proper, picking up where the Applied CP arc left off — starting with the “predict_proba fallacy” headline that the May 11 data refresh flagged as the highest-share-potential post in the queue.
— Valeriy


