Byte-Pair Encoding Is Compression in Disguise
Tokenization is usually introduced as a preprocessing trick.
In reality, it is one of the strongest inductive biases we impose on language models.
Byte-Pair Encoding (BPE) is a particularly revealing case.
What looks like a simple greedy algorithm turns out to sit at the intersection of information theory, Kolmogorov complexity, and Minimum Description Length.
This post explains why BPE works—not operationally, but conceptually.


