Attention Mechanism
Also known as: Self-Attention, Scaled Dot-Product Attention, Multi-Head Attention
A neural network component that allows models to selectively focus on the most relevant parts of their input, dynamically weighting the importance of different elements in a sequence.
“A neural network component that allows models to selectively focus on the most relevant parts of their input, dynamically weighting the importance of different elements in a sequence.
“
Overview
The attention mechanism is the core innovation that makes modern large language models possible. It allows a model to dynamically determine which parts of its input are most relevant to generating each part of its output. Unlike earlier architectures that processed text sequentially, attention mechanisms can relate any part of the input to any other part, regardless of distance.
How Attention Works
In the standard attention formulation, three vectors are computed for each element in the input sequence:
- Query (Q): What information am I looking for?
- Key (K): What information do I contain?
- Value (V): What information do I provide if selected?
Types of Attention
Self-Attention
Each element in a sequence attends to all other elements in the same sequence. This is the primary mechanism in Transformer encoders and decoders.
Cross-Attention
Elements in one sequence (e.g., a decoder) attend to elements in another sequence (e.g., an encoder). Used in encoder-decoder models for translation and summarization.
Causal (Masked) Attention
Each element can only attend to previous elements, preventing information from "leaking" from the future. Used in autoregressive language models like GPT.
Attention as Context Management
The attention mechanism is, in essence, a context management mechanism built into the model architecture. It automatically determines which context is most relevant for each decision, performing real-time context prioritization across millions of parameters.
Sources & Further Reading
Related Terms
Context Window
The maximum amount of text (measured in tokens) that a language model can process in a single interaction, determining how much information the model can consider when generating a response.
Large Language Model
A type of AI model trained on vast amounts of text data that can understand, generate, and manipulate human language, typically based on the transformer architecture with billions of parameters.
Neural Network
A computing system inspired by biological neural networks, consisting of interconnected nodes (neurons) organized in layers that process information using learnable weights and activation functions.
Transformer
A neural network architecture based on self-attention mechanisms that processes input sequences in parallel, forming the foundation of virtually all modern large language models.