Transformers: What if time-series forecasting was as easy as predicting my next …
What will happen tomorrow? This is a central question of human existence. As children, we slowly learn that we are beings evolving in time. Philosophers like Heidegger argue that we are time, time is the thing we are made of.
Moving back into practical reality, some of the earliest recorded uses of writing include lunar observations, probably in an effort to find patterns in nature, and, maybe, predict its future movements.
Now that we have calendars, we use predictions to solve other types of
What will the weather be like tomorrow? What should I wear?
How many customers will come to my bar? How many kilograms of salads should I buy?
All of these questions can be solved by accurate time-series forecasting. The fact that we can use intelligent algorithms to answer some of humanity’s deepest questions is one of the reasons I love my job.
The formal definition of the time-series problem is the following:
Given a sequence of observations \(\\{x_t\\}_{t=1}^T\), where \(x_t \in \mathbb{R}\) represents the value at time \(t\), the goal is to predict future values \(\\{x_{T+1}, x_{T+2}, \dots, x_{T+H}\\}\) over a forecasting horizon \(H\), based on the historical data \(\\{x_t\\}_{t=1}^T\) and potentially additional covariates.
Recent Developments
In the first seven years of their existence, Transformer models 1 have taken the world by storm. This article will try to go beyond and around the hype, into their impact on time-series forecasting. In parallel to Transformers, this article will look into the concept of meta-learning, learning about learning, and its applications on time-series problems (see this article about leveraging, meta-information, information about a problem to solve it).
In the 2010s, which now feels an eternity away, the field of Automated Machine Learning (AutoML) became an independent area of research. The idea was to build algorithms that would search over the ML model and hyperparameter space to find the model or ensemble best suited to approach a given prediction task (i.e., a dataset). Some of these AutoML algorithms leveraged ML models to search these spaces2. They were, in a sense, learning to learn. As a Computer Scientist obsessed with Artificial Intelligence, this type of recursions and self-references are fascinating.
These models already had a strong impact on time-series predictions, with AutoML and Auto-Time Series solutions bubbling all over the industry (e.g., Data Robot: Time Series Modeling 3).
In parallel, in 2017, the seminal Attention is All You Need4 paper showed the power of Transformers in understanding and generating natural language. Before going any further, it is important to remember that text is a sequence of words; this should remind you of the time-series problem definition. The Transformers learning objective is also to predict the sequence of words over the \(T+H\) horizon (\(\\{x_{T+1}, x_{T+2}, \dots, x_{T+H}\\}\)), given the sequence \(\\{x_t\\}_{t=1}^T\). But instead of dealing with time, we are dealing with position in a sentence.
Papers like An Image is Worth 16x16 Words5 soon showed that Transformers could perform many other tasks, as their learnable attention feature could successfully identify complex patterns in any input represented as a sequence, even the pixels of an image. It is as if Transformers were able to learn the information (such as translation invariance) built into state-of-the-art Convolutional Neural Networks (CNNs).
Meta-Learning meets Transformers
Early time-series forecasting research has focussed on both auto-regressive models and the use of additional covariates. Many modern time-series approaches like ES-RNN 6 started leveraging hierarchical data structures (product category > group > SKU) to learn over different sequences. As models grew more powerful and expressive, it made sense to move from local models, one model per product or location, to global models, one model for all products or locations.
Yet, few would have anticipated just how expressive (read “large”) ML models were about to get. With the development of Large Language Models (LLMs) with hundreds of billions of parameters, it became possible to think big. If models could use all human-generated text stored on the internet as training data, what would prevent smaller models (but still quite large) from being trained on all publicly available time-series data? This is meta-learning at its finest, learning from a large amount of time-series data to predict more time-series data. Learning to learn…
Note: You may want to pause there. If you started Data Science in the 2010s or before, this is absolutely wild. It is good to remember to look back and be grateful for humanity’s capacity for innovation beyond what many would have thought possible.
Back to training over all publicly available time-series data, this is exactly what happened. Models were now using information contained in all time-series to predict a single sequence. I like to think of it as a global-global model. Models like Amazon Chronos 7 published a few weeks ago or TimeGPT-1 8 in May this year are leading the way to a new paradigm in time-series forecasting.
Next <MASK>
For many Machine Learning practitioners, these models raise more questions than answers:
Is this it? Is time-series forecasting solved?
I doubt it, this is just the beginning. Many time-series problems are also very simple and could be approached by much simpler and explainable models. More on explainability further down.What will state-of-the-art time-series modelling look like?
A lot of relevant time-series data is still private (e.g., own operational or transactional data). This data contains information that would be critical for a time-series prediction task.We may see a similar development to what happened in the field of Computer Vision. These days, very few ML practitioners train a CNN from scratch; transfer learning has become the norm. We may see the same happening in time-series, in which large models are fine-tuned for a given prediction task. But it is still too early to say.
Ok, but why?
Explainability was already a challenge in Machine Learning predictions, which had been moderately alleviated by methods like SHAP9 and LIME10. We are only scratching the surface of Transformer interpretability today. With the current pace of change in fields like Mechanistic Interpretation, this is an exciting research area.
Final Thoughts
That was a lot, and all good things come to an end. Time-series forecasting is a fascinating field. We thought that we had solved it by building algorithms to search the space of ML models and algorithms. We then discovered the power of Transformer models in sequence understanding and generating. Who knows what happens next?
Speaking of predictions, who could have forecasted that just five years ago?