Transformer Neural Networks (TNN) - Transformers
  • is a deep learning model introduced in 2017, used primarily in the field of natural language processing
  • like Recurrent Neural Networks (RNNs) and Gated RNNs (e.g. LSTM & GRU), Transformers are designed to handle sequential data, such as natural language, for tasks such as translation and text summarization. However, unlike RNNs, Transformers do not require that the sequential data be processed in order. For example, if the input data is a natural language sentence, the Transformer does not need to process the beginning of it before the end. Due to this feature, the Transformer allows for much more parallelization than RNNs and therefore reduces training times
  • is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution

How it Works - TL;DR

Transformer architecture outline:

First self-attention block architecture:

  1. word embeddings - encodes words into numbers
  2. positional encoding - encodes positions of words
  3. self-attention - encodes the relationships among words
  4. residual connections - mainly helps mitigate the vanishing gradient problem

MLP architecture:

Each token that has gone through the previous attention block will go through the following steps:

  • linear up-scaling
  • non-linear transformation (i.e. ReLU)
  • linear down-scaling
  • then added with original token

Transformer - Timeline of Models

Subpages

Resources