Tokens
Also known as: Token, Subword Token, BPE Token
The basic units of text that language models process, typically representing words, subwords, or characters. Token counts determine context window usage and API costs.
“The basic units of text that language models process, typically representing words, subwords, or characters. Token counts determine context window usage and API costs.
“
Overview
Tokens are the fundamental building blocks of how language models process text. A token can represent a whole word, part of a word, a punctuation mark, or even a single character, depending on the tokenization scheme used. Understanding tokens is essential for effective context management because the context window is measured in tokens, not words or characters.
How Tokenization Works
Most modern language models use subword tokenization algorithms, with Byte Pair Encoding (BPE) being the most common. These algorithms break text into the most statistically efficient units based on frequency analysis of the training corpus.
Word-Level vs. Subword Tokenization
Simple word-level tokenization splits on whitespace and punctuation but creates enormous vocabularies and can't handle misspellings or rare words. Subword tokenization solves this by learning the most common character sequences, breaking rare words into known subcomponents.
Token Counts in Practice
- 1 token is approximately 4 characters in English
- 1 token is approximately 0.75 words
- 100 tokens is approximately 75 words
- 1 page of text is approximately 300-400 tokens
Context Management Implications
- API costs are calculated per token (both input and output)
- Context windows have hard token limits
- Different languages tokenize differently — Chinese text typically uses more tokens per concept than English
- Code and technical content often tokenize less efficiently than prose
- Efficient context management requires balancing information density against token consumption
Sources & Further Reading
Related Terms
Context Window
The maximum amount of text (measured in tokens) that a language model can process in a single interaction, determining how much information the model can consider when generating a response.
Large Language Model
A type of AI model trained on vast amounts of text data that can understand, generate, and manipulate human language, typically based on the transformer architecture with billions of parameters.