Architecture 2 min read

Transformer

Also known as: Transformer Architecture, Transformer Model

A neural network architecture based on self-attention mechanisms that processes input sequences in parallel, forming the foundation of virtually all modern large language models.

Definition

A neural network architecture based on self-attention mechanisms that processes input sequences in parallel, forming the foundation of virtually all modern large language models.

Architecture 2 min read T

Overview

The Transformer is arguably the most influential neural network architecture of the modern AI era. Introduced in the landmark 2017 paper "Attention Is All You Need" by researchers at Google, it replaced recurrent and convolutional architectures with a pure attention-based mechanism that could process sequence data in parallel, leading to dramatically faster training and superior performance.

Key Components

Self-Attention

The core innovation of the Transformer is the self-attention mechanism, which allows every element in an input sequence to attend to every other element. This enables the model to capture long-range dependencies that recurrent networks struggled with, and to process all positions simultaneously rather than sequentially.

Multi-Head Attention

Rather than performing a single attention computation, Transformers use multiple attention "heads" that learn to focus on different types of relationships in the data — syntactic, semantic, positional, and more.

Positional Encoding

Since Transformers process all tokens in parallel (unlike sequential models like RNNs), positional encodings are added to the input to provide information about the order of tokens in the sequence.

Feed-Forward Networks

Each Transformer layer includes a position-wise feed-forward network that processes each token independently, adding representational capacity beyond what attention alone provides.

Variants

  • Encoder-only: BERT, used for classification and understanding tasks
  • Decoder-only: GPT, Claude, Llama — used for text generation
  • Encoder-Decoder: T5, BART — used for translation and summarization

Impact on Context Management

The Transformer's self-attention mechanism is directly responsible for modern context management capabilities. The attention mechanism determines how much "attention" each part of the context receives, effectively implementing a form of context prioritization. Innovations like sparse attention, Flash Attention, and sliding window attention continue to improve how Transformers handle context.