Introduction
Transformer models, introduced in 2017, and its very many variants are achieving great success in Natural Language Processing (NLP), Computer Vision, and even in domains such as speech processing for a few years now. Due to their ability in effectively modeling long-range dependencies in sequential data, the use of Transformers in various time-series tasks like predicting next time steps or anomaly detection is increasing. Calculating attention via Scaled Dot-Product Attention in Vanilla Transformers is computationally expensive and scales with respect to the input length which leads to a bottleneck for very long sequences.
“Transformers are RNNs” proposed by Katharopoulos et al. (2020) linearizes the computational complexity for computing self-attention while also retaining the accuracy of the original softmax-based self-attention computation which has quadratic complexity. This makes the linearized trans-formers fast for processing very long sequences.
On the benchmark PDEBench PDEs (Takamoto et al., 2022), we almost invariably sub-sample the data across the spatial and temporal dimensions. The original Fourier Neural Operator (FNO) (Li et al., 2020) implementation also performs spatial and temporal sub-sampling. This sub-sampled data might not always be desirable since we might miss crucial dynamics at those sampling locations that were omitted during the sub-sampling procedure. We would ideally like to utilize the full resolution of the temporal dimension, but that will be computationally expensive. The reduced computational complexity of linearized Transformers may help to reduce the time and memory consumption at training and inference for a model trained on full temporal and spatial resolution.
Parametric PDEs contain a parameter that determines the evolution of PDEs. Preferably, we would wish to have a single neural PDE surrogate model that generalizes to a range of PDE parameters. For instance, the model can be trained on a set of different PDE parameters and evaluated on another disjunct set of PDE parameters to assess the model’s generalization capabilities.
This thesis work will focus on investigating different ways of using linearized Transformers for solving parametric 1D PDEs. The task comprises developing a transformer-based model that gener-alizes to different PDE parameters and testing it on the PDEBench benchmark datasets. The work also explores the importance of this full temporal resolution data using the linearized transformer models and compares the obtained results to the original FNO model in terms of errors and speed.