Model Training 2 min read

Training Data

Also known as: Training Dataset, Training Corpus, Training Set

The curated dataset used to train machine learning models, whose quality, diversity, size, and representativeness directly determine the model's capabilities and limitations.

Definition

The curated dataset used to train machine learning models, whose quality, diversity, size, and representativeness directly determine the model's capabilities and limitations.

Model Training 2 min read T

Overview

Training data is the foundation upon which all machine learning models are built. The adage "garbage in, garbage out" is especially true in AI — a model can only be as good as the data it was trained on. For large language models, training data consists of vast corpora of text from books, websites, research papers, code repositories, and other sources.

Key Characteristics

Scale

Modern LLMs are trained on trillions of tokens. GPT-3 was trained on approximately 570GB of text, while more recent models use even larger datasets. The scale of training data significantly impacts model capability.

Quality

Not all data is equally valuable. High-quality, well-written, factually accurate data produces better models than noisy, erroneous, or biased data. Data curation and filtering are critical preprocessing steps.

Diversity

Training data should be representative of the range of tasks and domains the model will encounter. Lack of diversity leads to poor performance on underrepresented topics or languages.

Recency

Training data has a knowledge cutoff — the model doesn't know about events or information that occurred after its training data was collected. This is a key motivation for RAG and other context management techniques.

Data Governance

Training data raises important legal and ethical questions around copyright, consent, privacy, and bias. Regulatory frameworks increasingly require transparency about training data composition and provenance.