The Transformer architecture was introduced by Vaswani et al. [9] and has since become the standard architecture for NLP tasks. The model uses a self-attention mechanism to process the input sequence, allowing it to capture long-term dependencies without the need for recurrent layers. This has resulted in improved performance and faster training times compared to traditional NLP models. Originally it was trained for machine translation tasks. However since its inception, numerous successors of the Transformer model have been developed, such as BERT [7] or GPT [28], which showed that a properly pretrained Transformer can obtain state-of-the-art on a wide selection of NLP tasks.
Pretraining coupled with the efficient Transformer architecture [9] unlocked state-of-the-art performance also in molecular property prediction [12, 14–16, 29, 30]. First applications of deep learning did not offer large improvements over more standard methods such as random forests [31–33]. Consistent improvements were in particular enabled by more efficient architectures adapted to this domain [17, 34, 35]. In this spirit, our goal is to further advance modeling for any chemical task by redesigning self-attention for molecular data.
Encoding efficiently the relation between tokens in self-attention has been shown to substantially boost the performance of Transformers in vision, language, music, and biology [19–25]. The vanilla self-attention includes absolute encoding of position, which can hinder learning when the absolute position in the sentence is not informative.1 Relative positional encoding featurizes the relative distance between each pair of tokens, which led to substantial gains in the language and music domains [22, 36].
On the other hand, a Transformer can be perceived as a fully-connected (all vertices are connected to all vertices) Graph Neural Network with trainable edge weights given by a self-attention [37]. From a practical perspective, the empirical success of the Transformer stems from its ability to learn highly complex and useful patterns.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.