Valeriy’s Substack

Valeriy’s Substack

Byte-Pair Encoding Is Compression in Disguise

Valeriy Manokhin's avatar
Valeriy Manokhin
Dec 16, 2025
∙ Paid

Tokenization is usually introduced as a preprocessing trick.
In reality, it is one of the strongest inductive biases we impose on language models.

Byte-Pair Encoding (BPE) is a particularly revealing case.
What looks like a simple greedy algorithm turns out to sit at the intersection of information theory, Kolmogorov complexity, and Minimum Description Length.

This post explains why BPE works—not operationally, but conceptually.

User's avatar

Continue reading this post for free, courtesy of Valeriy Manokhin.

Or purchase a paid subscription.
© 2026 Valery Manokhin · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture