What is it about?

Many real-world applications require precise and fast time-series forecasting. Recent trends in time-series forecasting models are shifting from LSTM-based models to Transformer-based models. However, the Transformer-based model has a limited ability to represent sequential relationships in time-series data. In addition, the transformer-based model suffers from slow training and inference speed due to the bottleneck incurred by a deep encoder and step-by-step decoder inference. To address these problems, we propose a time-series forecasting optimized Transformer model, called TS-Fastformer. TS-Fastformer introduces three new optimizations: First, we propose a Sub Window Tokenizer for compressing input in a simple manner. The Sub Window Tokenizer reduces the length of input sequences to mitigate the complexity of self-attention and enables both single and multi-sequence learning. Second, we propose Time-series Pre-trained Encoder to extract effective representations through pre-training. This optimization enables TS-Fastformer to capture both seasonal and trend representations as well as to mitigate bottlenecks of conventional transformer models. Third, we propose the Past Attention Decoder to forecast target by incorporating past long short-term dependency patterns. Furthermore, Past Attention Decoder achieves high performance improvement by removing a trend distribution that changes over a long period. We evaluate the efficiency of our model with extensive experiments using seven real-world datasets and compare our model to six representative time-series forecasting approaches. The results show that the proposed TS-Fastformer reduces MSE by 10.1% compared to state-of-the-art model, and demonstrates 21.6% faster training time compared to the existing fastest transformer respectively.

Featured Image

Why is it important?

We identify three weaknesses for time-series forecasting using this transformer architecture. (i) The Transformer architecture is originally proposed for natural language processing tasks but lacks specific operations to capture relationships among consecutive sequences, unlike other models designed for time series tasks such as RNN and LSTM. To address this issue, Transformer applies positional encoding, which is effective for capturing sequential information in NLP. However, most time-series data consist of scalars or low-dimensional vectors. Therefore, positional encoding that adds deterministic vectors using sinusoidal functions has limitations in representing sequential information of time-series input. This means that the transformer model does not capture all of the sophisticated sequential information between single sequences. Therefore, we propose a way for transformers to learn the relationships between multiple sequences that are composites of single sequences. (ii) Transformer can be effectively Pre-trained with large natural language datasets. Bidirectional Encoder Representations from Transformer (BERT) model is a successful example of natural language expression extraction through unsupervised learning with a large dataset using the Transformer encoder structure. BERT has become the standard for NLP tasks. The Transformer encoder consists of self-attention-based modules that are repeated n times to learn the representation of the input. A bottleneck occurs because the Transformer decoder can only start its tasks after the Transformer encoder has completed the training process. Moreover, the original structure of the Transformer encoder is aimed at NLP tasks. For time-series data, we argue that an alternative and more efficient structure is needed. Many studies on learning representations of time series data have focused on learning instance-level representations and describing whole segments of the input time series using models that are lighter than Transformer encoder structures. Instance-level representations may not be suitable for tasks that require granular representations, e.g., time-series forecasting and anomaly detection. Furthermore, representation learning models that vectorize time series data have limitations in time series forecasting tasks with complex patterns, as they apply ridge regression. (iii) Transformer infers using a dynamic decoder structure. In time series forecasting, the dynamic decoder infers by repeating step-by-step as much as the output data sequence starting from the last value of the input data. Although the dynamic decoder shows good performance for short sequences, errors in the dynamic decoder can be amplified, and result in slow inference speed for long sequences. Furthermore, the complexity of self-attention is quadratic in the length of the input, resulting in significantly increased training and inference time in long-sequence time-series forecasting (LTSF). In recent studies, to address these issues, several approaches have been proposed, including the utilization of generative decoders and various techniques aimed at accelerating the self-attention mechanism to achieve sub-quadratic processing time. However, there have been concerns regarding some studies that claim to demonstrate sub-quadratic processing time. Our experimental results in real-world settings support these doubts. We propose a novel Transformer-based model, called TS-Fastformer. The paper makes three original contributions. First, we propose a Sub Window Tokenizer (SWT) that compresses a time series data of length l to l/wl, reducing the complexity of self-attention from Ο(l^2) to Ο((l/wl)^2). SWT transforms single sequence information learning into multiple sequences information learning. While sacrificing some sequential information, SWT maintains performance and enables faster training. Second, we introduce a Time-series Pre-trained Encoder (TPE) that removes the bottleneck phenomenon in the Transformer encoder and is better suited for extracting time-series representations. TPE can extract representations based on two crucial aspects of time series prediction tasks: seasonality and trend. Third, we propose a Past Attention Decoder that effectively captures both short-term and long-term dependency patterns, achieving high performance. Furthermore, the M0 method used in the Past Attention Decoder removes the distribution of long-term trends, resulting in robust performance even after a significant period of time.


We hope this paper will have a significant impact on the field of univariate time-series forecasting. We believe that the convergence of time series representation learning and transformers can be further developed. TS-Fastformer can be used in a variety of fields, including engineering, medicine, and statistics.

Inha University

Read the Original

This page is a summary of: TS-Fastformer: Fast Transformer for Time-Series Forecasting, ACM Transactions on Intelligent Systems and Technology, October 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3630637.
You can read the full text:




The following have contributed to this page