Training Data
Also known as: Training Dataset, Training Corpus, Training Set
The curated dataset used to train machine learning models, whose quality, diversity, size, and representativeness directly determine the model's capabilities and limitations.
“The curated dataset used to train machine learning models, whose quality, diversity, size, and representativeness directly determine the model's capabilities and limitations.
“
Overview
Training data is the foundation upon which all machine learning models are built. The adage "garbage in, garbage out" is especially true in AI — a model can only be as good as the data it was trained on. For large language models, training data consists of vast corpora of text from books, websites, research papers, code repositories, and other sources.
Key Characteristics
Scale
Modern LLMs are trained on trillions of tokens. GPT-3 was trained on approximately 570GB of text, while more recent models use even larger datasets. The scale of training data significantly impacts model capability.
Quality
Not all data is equally valuable. High-quality, well-written, factually accurate data produces better models than noisy, erroneous, or biased data. Data curation and filtering are critical preprocessing steps.
Diversity
Training data should be representative of the range of tasks and domains the model will encounter. Lack of diversity leads to poor performance on underrepresented topics or languages.
Recency
Training data has a knowledge cutoff — the model doesn't know about events or information that occurred after its training data was collected. This is a key motivation for RAG and other context management techniques.
Data Governance
Training data raises important legal and ethical questions around copyright, consent, privacy, and bias. Regulatory frameworks increasingly require transparency about training data composition and provenance.
Sources & Further Reading
Related Terms
Bias in AI
Systematic errors in AI system outputs that create unfair outcomes for certain groups, typically arising from biased training data, flawed model design, or biased evaluation metrics.
Fine-Tuning
The process of further training a pre-trained AI model on a specialized dataset to adapt its behavior, knowledge, or output style for a specific domain or task.
Machine Learning
A subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed, using algorithms that identify patterns in data.
Supervised Learning
A machine learning paradigm where models are trained on labeled datasets containing input-output pairs, learning to map inputs to correct outputs for prediction and classification tasks.