Architecture 2 min read

Attention Mechanism

Also known as: Self-Attention, Scaled Dot-Product Attention, Multi-Head Attention

A neural network component that allows models to selectively focus on the most relevant parts of their input, dynamically weighting the importance of different elements in a sequence.

Definition

A neural network component that allows models to selectively focus on the most relevant parts of their input, dynamically weighting the importance of different elements in a sequence.

Architecture 2 min read A

Overview

The attention mechanism is the core innovation that makes modern large language models possible. It allows a model to dynamically determine which parts of its input are most relevant to generating each part of its output. Unlike earlier architectures that processed text sequentially, attention mechanisms can relate any part of the input to any other part, regardless of distance.

How Attention Works

In the standard attention formulation, three vectors are computed for each element in the input sequence:

  • Query (Q): What information am I looking for?
  • Key (K): What information do I contain?
  • Value (V): What information do I provide if selected?

Types of Attention

Self-Attention

Each element in a sequence attends to all other elements in the same sequence. This is the primary mechanism in Transformer encoders and decoders.

Cross-Attention

Elements in one sequence (e.g., a decoder) attend to elements in another sequence (e.g., an encoder). Used in encoder-decoder models for translation and summarization.

Causal (Masked) Attention

Each element can only attend to previous elements, preventing information from "leaking" from the future. Used in autoregressive language models like GPT.

Attention as Context Management

The attention mechanism is, in essence, a context management mechanism built into the model architecture. It automatically determines which context is most relevant for each decision, performing real-time context prioritization across millions of parameters.