Scaling Context Stores to Billions of Records

The Scale Challenge

Enterprise AI generates context at a relentless pace. Every user interaction, system event, API call, and data integration adds context records. A mid-size SaaS application with 100,000 active users generating 50 context events per day accumulates 1.8 billion records per year. Without careful architecture, systems that performed well at millions of records will grind to a halt at billions — queries that took 5ms now take 5 seconds, writes that were instant now queue for minutes, and operational costs spiral out of control.

Scaling a context store is not a single decision; it is a series of architectural trade-offs across sharding, federation, write and read optimization, data lifecycle management, and cost engineering. This guide covers the patterns and specific technologies that make billion-record context stores viable.

The most expensive scaling mistake is designing for your current load instead of your projected load. By the time you realize you need to shard, migrating a monolithic database under production traffic is orders of magnitude harder than starting with a sharding-ready architecture.

Horizontal Scaling Patterns

Sharding Strategies

Sharding distributes data across multiple database instances so no single instance bears the full load. The choice of sharding key determines how evenly data distributes, how efficiently queries resolve, and how painful future resharding will be.

Tenant-Based Sharding: In multi-tenant systems, sharding by tenant ID is the most natural approach. Each tenant's context lives on a specific shard, queries never span shards for single-tenant operations, and tenant isolation is inherent. The risk is data skew — one large tenant can overload a shard. For a deeper exploration of multi-tenant design, see our article on multi-tenant context architecture.
Hash-Based Sharding: Applying a consistent hash function to the primary key distributes data uniformly across shards. This eliminates skew but makes range queries expensive since related records may live on different shards. Use consistent hashing (e.g., jump hash or ring-based) to minimize data movement when adding or removing shards.
Time-Based Sharding: For temporal context data (event logs, interaction histories), sharding by time period keeps recent data on fast, hot shards while older data resides on slower, cheaper shards. This aligns naturally with tiered storage and data lifecycle policies.
Geographic Sharding: For global applications, sharding by region keeps data close to users and simplifies compliance with data residency regulations (GDPR, data sovereignty laws).

Sharding Strategy	Data Distribution	Cross-Shard Queries	Resharding Difficulty	Best For
Tenant-Based	Can be skewed	Rare (per-tenant)	Moderate	Multi-tenant SaaS
Hash-Based	Uniform	Common for ranges	Hard (data movement)	High-volume uniform access
Time-Based	Temporal skew (hot recent)	Rare within time range	Easy (add new shards)	Event logs, time-series
Geographic	Region-dependent	Rare within region	Moderate	Global apps, data residency

Federation: Polyglot Persistence

Not all context data has the same access patterns. Rather than forcing all data into a single store, federation splits different context types across specialized databases:

User profiles and configuration: Document stores (MongoDB, DynamoDB) for flexible schemas and key-value access patterns.
Temporal event context: Time-series databases (TimescaleDB, InfluxDB) optimized for time-range queries and downsampling.
Relationship context: Graph databases (Neo4j, Amazon Neptune) for traversal-heavy queries on entity relationships.
Embedding vectors: Vector databases (Pinecone, Weaviate, Milvus) or pgvector for similarity search.
Full-text context: Search engines (Elasticsearch, OpenSearch) for complex text queries and aggregations.

The cost of federation is operational complexity — you are running multiple database systems, each with its own scaling, backup, and monitoring requirements. A unified context access layer (API gateway or context service) abstracts this complexity from consuming applications. For foundational patterns, see our guide on building a scalable context store.

Write Optimization at Scale

Write Buffering and Batching

At billions of records, individual writes become a bottleneck. Buffering writes through a message queue (Kafka, Amazon SQS, or RabbitMQ) decouples write throughput from database capacity. Batch insert operations — writing 1,000 records in a single transaction instead of 1,000 individual inserts — reduces per-record overhead by 10-50x. For a deep dive into building these pipelines, see our article on context pipelines with Kafka.

Write-Ahead Logging and Async Persistence

For context that does not require immediate consistency, accept the write into a local write-ahead log (WAL) and acknowledge the client immediately. A background process flushes the WAL to the primary store. This pattern delivers sub-millisecond write acknowledgment while maintaining durability. The trade-off is a window during which recently written context is only available on the local instance.

Conflict-Free Writes

When multiple services write context concurrently, conflicts are inevitable at scale. Conflict-free replicated data types (CRDTs) and append-only designs eliminate write conflicts entirely. Instead of updating a context record in place, append a new version and resolve the current state at read time. This approach pairs well with context versioning strategies.

Read Optimization at Billion-Record Scale

Materialized Views and Precomputed Aggregates

At billion-record scale, even indexed queries on the primary store can be slow for complex joins or aggregations. Materialized views precompute common query results and store them as denormalized tables that can be queried in constant time. Update materialized views asynchronously via change data capture (CDC) to keep them current without impacting write performance.

Caching Layers

Aggressive caching is non-negotiable at this scale. A multi-tier caching architecture — in-process caches for per-instance hot data, Redis for shared hot data, and database read replicas for warm data — ensures that the vast majority of reads never touch the primary store. Target a combined cache hit rate above 95%. See our deep dive on sub-millisecond context retrieval for implementation details.

Search Engine Offloading

For complex query patterns (full-text search, faceted filtering, fuzzy matching), offload to Elasticsearch or OpenSearch. These systems are purpose-built for read-heavy workloads and can handle billions of documents with sub-second query times. Use CDC or event streams to keep the search index synchronized with the primary store.

Data Lifecycle and Tiered Storage

Not all context records are equally valuable. A 3-year-old interaction log is rarely accessed but may need to be retained for compliance. Implement tiered storage to manage costs:

Hot Tier (0-30 days): Primary database with full indexing and caching. Fast SSD storage. This is where active context lives.
Warm Tier (30-365 days): Read replicas or cheaper storage (e.g., Amazon S3 with Athena for queries). Reduced indexing. Acceptable latency for occasional access.
Cold Tier (1+ years): Archive storage (S3 Glacier, Azure Blob Archive). Accessed only for compliance or forensic purposes. Retrieval latency measured in hours, cost measured in pennies per GB per month.

Automated lifecycle policies move data between tiers based on age, access frequency, or business rules. Most cloud databases (DynamoDB, Cosmos DB) support automatic tiering. For self-managed systems, implement lifecycle workers that run on a schedule.

Cost Engineering at Scale

At billions of records, infrastructure costs become a primary engineering concern. Key cost levers include:

Compression: Compressing context data at rest reduces storage costs by 60-80%. LZ4 provides fast compression/decompression with moderate ratios; Zstandard provides higher ratios with slightly more CPU. See our article on context compression and tokenization for detailed techniques.
Right-Sizing Instances: Over-provisioned database instances waste money. Use performance monitoring to identify actual resource needs and right-size accordingly. Reserved instances or committed use discounts reduce costs by 30-60% for stable workloads.
Query Optimization: A single unoptimized query running millions of times per day can cost more in compute than the entire rest of the workload. Profile and optimize your top 10 queries by frequency and resource consumption.

Monitoring and Operational Excellence

At billion-record scale, operational visibility is critical. Monitor shard balance (no shard should hold more than 2x the average), replication lag (stale reads indicate falling behind), query latency by percentile (p99 matters more than average), and storage growth rate (to forecast when you need additional capacity). Automate alerting on all of these metrics, and conduct regular capacity planning reviews.

What is the practical limit for a single PostgreSQL instance?

A well-tuned PostgreSQL instance can handle hundreds of millions to low billions of rows in a single table, depending on row size, index complexity, and query patterns. Beyond approximately 1-2 billion rows, query planning overhead and vacuum operations become problematic. At that point, partitioning (native range or hash partitioning) or external sharding (via Citus) becomes necessary.

How do I migrate from a monolithic store to a sharded architecture?

The safest approach is dual-write migration: continue writing to the monolith while simultaneously writing to the new sharded store. Backfill historical data in the background. Validate consistency by comparing reads from both stores. Once the sharded store is verified, cut over reads, then stop writes to the monolith. This process typically takes weeks to months for billion-record stores.

Should I use a managed database service or self-manage at this scale?

For most organizations, managed services (Amazon RDS, Cloud SQL, Cosmos DB, DynamoDB) are the right choice at billion-record scale. The operational overhead of managing sharding, replication, backups, patching, and failover for a self-managed cluster is substantial. Self-management makes sense only when you have specific performance requirements that managed services cannot meet or when cost optimization at extreme scale justifies the engineering investment.

How do I handle schema migrations at billion-record scale?

Traditional ALTER TABLE operations lock the table and are impractical at billion-record scale. Use online schema migration tools like gh-ost (GitHub) or pt-online-schema-change (Percona) that create a shadow table, copy data incrementally, and swap tables atomically. For NoSQL stores, schema changes are typically applied lazily at read time using versioned schemas.

MCP Tutorials

RAG Cookbook

Library Integrations

Context Window Engineering

Embeddings & Retrieval

Tool Use & Function Calling