／var／log marcus chiu

❯

❯

Artificial Intelligence (AI) - Cognitive Computing - Machine Intelligence

❯

❯

Machine Learning (ML) - Pattern Recognition

❯

❯

Artificial Neural Networks (ANN)

❯

ANN - Architectures

Transformer Neural Networks (TNN) - Transformers

Created on Aug 03, 2020 · Last Modified on Oct 10, 2025

Transformer Neural Networks (TNN) - Transformers

is a deep learning model introduced in 2017, used primarily in the field of natural language processing
like Recurrent Neural Networks (RNNs) and Gated RNNs (e.g. LSTM & GRU), Transformers are designed to handle sequential data, such as natural language, for tasks such as translation and text summarization. However, unlike RNNs, Transformers do not require that the sequential data be processed in order. For example, if the input data is a natural language sentence, the Transformer does not need to process the beginning of it before the end. Due to this feature, the Transformer allows for much more parallelization than RNNs and therefore reduces training times
is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution

How it Works - TL;DR

3Blue1Brown - Attention in Transformers
3Blue1Brown - Storing Facts in Multi-Layer Perceptron
StatQuest - Transformer Neural Networks
StatQuest - Decoder-Only Transformers

Transformer architecture outline:

input → word embeddings + positional encodings → [multi-head self-attention block → multi-layer perceptron]*96 → word un-embedding

First self-attention block architecture:

word embeddings - encodes words into numbers
positional encoding - encodes positions of words
self-attention - encodes the relationships among words
residual connections - mainly helps mitigate the vanishing gradient problem

MLP architecture:

Each token that has gone through the previous attention block will go through the following steps:

linear up-scaling
non-linear transformation (i.e. ReLU)
linear down-scaling
then added with original token

Transformer - Timeline of Models

Subpages

Generative Pre-trained Transformer (GPT)
Positional Encoding Theory

Resources

Attention Is All You Need.pdf - 2017 white paper