Core Concepts 2 min read

Tokens

Also known as: Token, Subword Token, BPE Token

The basic units of text that language models process, typically representing words, subwords, or characters. Token counts determine context window usage and API costs.

Definition

The basic units of text that language models process, typically representing words, subwords, or characters. Token counts determine context window usage and API costs.

Core Concepts 2 min read T

Overview

Tokens are the fundamental building blocks of how language models process text. A token can represent a whole word, part of a word, a punctuation mark, or even a single character, depending on the tokenization scheme used. Understanding tokens is essential for effective context management because the context window is measured in tokens, not words or characters.

How Tokenization Works

Most modern language models use subword tokenization algorithms, with Byte Pair Encoding (BPE) being the most common. These algorithms break text into the most statistically efficient units based on frequency analysis of the training corpus.

Word-Level vs. Subword Tokenization

Simple word-level tokenization splits on whitespace and punctuation but creates enormous vocabularies and can't handle misspellings or rare words. Subword tokenization solves this by learning the most common character sequences, breaking rare words into known subcomponents.

Token Counts in Practice

  • 1 token is approximately 4 characters in English
  • 1 token is approximately 0.75 words
  • 100 tokens is approximately 75 words
  • 1 page of text is approximately 300-400 tokens

Context Management Implications

  • API costs are calculated per token (both input and output)
  • Context windows have hard token limits
  • Different languages tokenize differently — Chinese text typically uses more tokens per concept than English
  • Code and technical content often tokenize less efficiently than prose
  • Efficient context management requires balancing information density against token consumption