Transformers vs. Time — Deep Learning’s New Battle with Non-Stationarity
Subtitle: How attention-based models are redefining what it means to forecast in a changing world Series: Forecasting Reality — Machine Learning in a Non-Stationary World (Part 5)
In the last post, we explored LSTMs — deep learning models designed for sequence data. They ignore the i.i.d. assumption by design, and they’re great at learning dependencies over time.
But now, the forecasting landscape is shifting again.
Over the past few years, Transformer models — originally developed for natural language processing — have become the new stars of time series modeling. Their secret weapon? Self-attention, which can capture long-range dependencies in a way RNNs never could.
But here’s the catch:
Transformers weren’t built for time series.
They’re powerful, but vanilla Transformers struggle with non-stationarity unless we make some key architectural changes.
Let’s explore how this battle is unfolding — and what it teaches us about forecasting in the real world.


