Why CatBoost is the hidden gem of tabular AI (and what the benchmarks actually say)

May 13, 2026

If you train tabular models in 2026, the three libraries on your shortlist are XGBoost, LightGBM, and CatBoost. The first two get most of the airtime. CatBoost gets less, and not because it loses on the benchmarks. It does not.

This is the first post in a two-week series drawing from Mastering CatBoost — The Hidden Gem of Tabular AI. The series starts with the case the book makes in chapter 1: what CatBoost actually does differently, and what the independent benchmarks actually show.

Two problems gradient boosting did not solve

CatBoost, XGBoost, and LightGBM share a name and a loss function. They diverge on two specific engineering decisions, and those decisions are where the difference shows up:

Categorical features. Real tabular data is full of them. Standard gradient boosting cannot consume them directly; the library has to encode them somehow.
Prediction shift. Gradient boosting on a finite sample reuses the same data points to compute residuals, gradients, and target statistics. Prokhorenkova et al. (the CatBoost paper, ref [48] in the book) showed this reuse produces a self-influence bias that standard cross-validation tends to under-detect.

XGBoost and LightGBM leave both problems to the user or to external preprocessing. CatBoost addresses them inside the algorithm. The rest of the design — symmetric trees, ordered target statistics, ordered boosting — follows from that one choice.

What the book’s chapter 1 argues, in three lines

From §1.5:

Ordered Boosting — gradient estimation that excludes the current example’s own label from its own approximation.
Ordered Target Statistics — categorical encoding computed using only data that precedes the current row under a permutation, eliminating the self-encoding leak.
Symmetric (Oblivious) Trees — at each depth, the same feature-and-threshold is applied to every leaf, producing a balanced lattice rather than a lopsided structure.

Each of those is a separate post in this series. The book’s case is that they were co-designed, and they do not work the same in isolation.

What the independent benchmarks actually report

The book is careful here, and I want to be too. From §1.6:

Shmuel et al., 111 datasets [57]. CatBoost won the most best-score datasets among tree-based models: 19 of 111 (17.1%) versus LightGBM’s 15 (13.5%) and XGBoost’s 5 (4.5%). Average rank 4.9, median rank 4 across all datasets. In regression specifically, CatBoost was the best on 15 datasets with average rank 3.7. Note that AutoGluon (an ensemble that includes CatBoost) was the overall winner with 39 best-score datasets — single-library comparisons have their limits.
McElfresh et al., NeurIPS’23, 176 datasets [41]. CatBoost achieved the best mean rank among GBDTs: 5.06 versus XGBoost’s 6.38 and LightGBM’s 7.80. The paper compared 19 algorithms total including neural networks (MLPs, ResNets, TabNets).
TabArena 2025 [16]. A continuously-updated living benchmark with controlled hardware and hyperparameter budgets. CatBoost is the top GBDT in the conventional tuning regime, with particular strength on high-cardinality categoricals, mixed-type features, and moderate-to-high noise datasets.
Ye et al. 2024, 300 datasets [67]. Evaluated 40+ algorithms. GBDTs (including CatBoost) remain strong baselines, with the qualifier that recent tabular foundation models like TabPFN v2 and TabICL now match or surpass them on many tasks. CatBoost’s native categorical handling is called out as a methodological advantage.
One Kaggle case the book cites specifically [56]. Kaggle Playground Series S4E10 (Loan Approval Prediction, 2024): a single CatBoost model, trained with minimal feature engineering, won outright — outperforming ensembles of XGBoost. This is one cited example; the book does not claim it generalises to “all Kaggle competitions.”

The honest pattern from these studies: CatBoost is consistently among the strongest single GBDTs, especially when categorical features and mixed-type data are involved. It is not always the absolute winner — ensembles like AutoGluon, and tabular foundation models, sometimes can beat any single GBDT. But as a default for tabular ML, the numbers point to CatBoost as the right baseline to include.

The three innovations in one paragraph each

Ordered Target Statistics. When CatBoost needs to encode a categorical feature (a postcode, a merchant ID), it does not compute a global mean over the full training set. It walks the data in a permuted order and uses only the rows that came before the current one. The current row never sees its own label. Manual target encoding — the leakage-prone alternative most XGBoost users do externally — is one of the most common causes of silent leakage in tabular ML.

Ordered Boosting. Standard gradient boosting computes residuals using a model trained on the full dataset, including the row whose residual is being computed. That self-influence is what Prokhorenkova et al. characterised as prediction shift. CatBoost uses a permutation prefix for residual computation too — the residual for row i is computed from a model trained only on rows that came before i. No self-influence, no prediction-shift bias.

Symmetric (Oblivious) Trees. At each depth level of a CatBoost tree, every leaf is split on the same feature and the same threshold. The result is a balanced binary lattice with depth-many distinct splits (six splits at depth 6, irrespective of how many leaves the tree has). The book’s §1.5.3 lists the consequences: built-in regularisation, branch-regular inference paths that compile to bitwise operations on CPU, and easier capacity control through a single dial (depth). The book is explicit that the realised inference speedup “depends on tree depth, dataset characteristics, implementation, and hardware” — practitioners should measure their own deployment, not trust generic multipliers.

When CatBoost is the wrong choice

The book has a whole section on this (§1.9, “But What About the No Free Lunch Theorem?”). Three cases where CatBoost is not the right default:

Very large, fully numeric datasets where training speed dominates. LightGBM’s leaf-wise growth and engineering optimisations were built for this regime.
Workflows tied to XGBoost-specific tooling. A team with deep XGBoost integrations (custom losses, distributed training, MLOps tooling) may not benefit from switching even if CatBoost edges out on accuracy.
When ensembles are an option. Per the Shmuel benchmark, the overall winner across 111 datasets was AutoGluon (an ensemble of ML and DL methods, 39/111 best scores), not any single GBDT. If your problem warrants the engineering cost of ensembling, you may not need to pick a single library.

The reference book is Mastering CatBoost — The Hidden Gem of Tabular AI — 12 chapters and an appendix that builds a MiniCatBoost from scratch.

For uncertainty quantification, the companion is Applied Conformal Prediction (CatBoost chapter 5 builds on it directly): https://valeman.gumroad.com/l/applied_conformal_prediction

For forecasting practitioners, Modern Forecasting Mastery on Maven applies the same ideas to time series: https://maven.com/valeriy-manokhin/modern-forecasting-mastery

— Valeriy

Valeriy’s Substack

Discussion about this post

Ready for more?