Transformer is a neural network architecture first introduced in 2017 by a paper published by Google named ‘Attention is All You Need’ [@vaswaniAttentionAllYou2023]. It uses a mechanism of attention to parallelise the training process and, as a result, drastically speed up AI model development [@alammarIllustratedTransformer2018].

A few distinctive features set transformers apart from the previously dominant architecture, the Recurrent Neural Networks (RNN).

Parallelisation

Unlike earlier architectures that process sentences word by word sequentially, transformers employ a parallel process by introducing positional encoding and attention.

Attention

The attention mechanism provides an efficient way for transformers to learn the context in which the words are used. Each word only attends to other words to which it must pay attention to complete a given language task successfully. For example, if the model is tasked to translate ‘The chicken crossed the road because it thought it was fun.’ to French, it must understand that the first ‘it’ refers to the chicken, not the road. Therefore, the first ‘it’ must be ‘attending’ to the word ‘chicken’ so the model would respect grammar rules such as agreeing with gender and numbers. On the other hand, ‘it’ has little to do with the word ‘because’, and it can pay little to no attention to that word.

References