Vector Search for Context Retrieval: Implementation Guide

Why Vector Search Matters for Context Retrieval

Traditional keyword search relies on exact or fuzzy string matching, which fails when users phrase queries differently than the stored context. A user asking about "cancellation policy" will not match context stored under "refund procedures" even though the intent is nearly identical. Vector search solves this by converting text into high-dimensional numerical representations (embeddings) that capture semantic meaning. Similar concepts end up close together in vector space, enabling retrieval based on meaning rather than exact wording.

Switching from keyword-based to vector-based context retrieval typically improves retrieval relevance by 25-45%, directly translating to higher-quality AI responses that are grounded in the most pertinent context.

Vector search is the core retrieval mechanism in Retrieval-Augmented Generation (RAG) architectures, which have become the standard approach for grounding LLM responses in organizational knowledge. If you are building any system that serves context to AI models, vector search should be on your roadmap.

Core Components

A vector search system for context retrieval consists of four interconnected components:

Embedding Model: Converts text into fixed-dimension numerical vectors that capture semantic meaning
Vector Store: Stores vectors and performs efficient similarity search across potentially millions of entries
Indexing Pipeline: Processes new and updated context into embeddings and writes them to the vector store
Query Pipeline: Converts user queries into embeddings, searches for similar vectors, and returns ranked results

Step 1: Choose Your Embedding Model

Your embedding model determines the quality of your search results. The choice involves tradeoffs between accuracy, cost, latency, and operational complexity.

Model	Dimensions	Strengths	Latency	Cost	Self-Hosted
OpenAI text-embedding-3-small	1536	Strong general purpose, good cost-performance ratio	~50ms	$0.02/1M tokens	No
OpenAI text-embedding-3-large	3072	Highest quality from OpenAI, dimension reduction support	~80ms	$0.13/1M tokens	No
Cohere embed-v3	1024	Multilingual, search-optimized variants	~60ms	$0.10/1M tokens	No
sentence-transformers (all-MiniLM-L6)	384	Free, fast, good for English	~10ms (local)	Free (compute cost)	Yes
BGE-large-en-v1.5	1024	Top open-source quality, MTEB benchmark leader	~30ms (local)	Free (compute cost)	Yes

Selection Criteria

Domain relevance: Test embedding models on your actual data. A model trained on scientific papers may underperform on conversational context.
Dimension count: Higher dimensions capture more nuance but increase storage and search costs. For most context management use cases, 768-1536 dimensions are sufficient.
Latency requirements: Self-hosted models eliminate network round-trips. If your context retrieval budget is under 20ms, self-hosting is likely necessary (see our sub-millisecond retrieval guide).
Data privacy: If your context contains sensitive information, self-hosted models keep data within your infrastructure.

Step 2: Set Up Vector Storage with pgvector

If you followed our getting started guide, you already have PostgreSQL running. Adding pgvector lets you store and search vectors without introducing a new database into your stack.

Install and Configure pgvector

-- Install the extension (requires pgvector to be installed on the server)
CREATE EXTENSION IF NOT EXISTS vector;

-- Add an embedding column to your existing contexts table
ALTER TABLE contexts
ADD COLUMN embedding vector(1536);

-- Create an HNSW index for fast approximate nearest neighbor search
CREATE INDEX idx_contexts_embedding
ON contexts
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

Understanding Index Types

pgvector supports two index types, each suited to different workloads:

IVFFlat: Partitions vectors into clusters. Faster to build, uses less memory, but requires periodic reindexing as data changes. Best for datasets under 1 million vectors where you can tolerate occasional reindex operations.
HNSW (Hierarchical Navigable Small World): Builds a graph-based index. Slower to build and uses more memory, but provides better recall and does not need reindexing. Best for datasets that change frequently or where recall quality is critical.

For most context management systems, start with HNSW. The slightly higher memory cost is worth the better recall and zero-maintenance indexing. Switch to IVFFlat only if memory becomes a constraint.

Step 3: Build the Embedding Pipeline

Every time context is created or updated, you need to generate and store its embedding. Implement this as an asynchronous pipeline to avoid blocking your API responses.

import openai
import asyncpg
from typing import List, Dict

class EmbeddingPipeline:
    def __init__(self, pool: asyncpg.Pool, model: str = "text-embedding-3-small"):
        self.pool = pool
        self.model = model
        self.client = openai.AsyncOpenAI()

    async def embed_text(self, text: str) -> List[float]:
        """Generate embedding for a single text."""
        response = await self.client.embeddings.create(
            input=text,
            model=self.model
        )
        return response.data[0].embedding

    async def embed_batch(self, texts: List[str]) -> List[List[float]]:
        """Generate embeddings for a batch of texts (up to 2048)."""
        response = await self.client.embeddings.create(
            input=texts,
            model=self.model
        )
        return [item.embedding for item in response.data]

    async def index_context(self, context_id: str, content: Dict):
        """Generate and store embedding for a context entry."""
        # Prepare text for embedding: combine key fields
        text = self._prepare_text(content)
        embedding = await self.embed_text(text)

        await self.pool.execute(
            "UPDATE contexts SET embedding = $1 WHERE id = $2",
            str(embedding), context_id
        )

    def _prepare_text(self, content: Dict) -> str:
        """Convert structured content to text for embedding."""
        parts = []
        if "title" in content:
            parts.append(content["title"])
        if "body" in content:
            parts.append(content["body"])
        if "tags" in content:
            parts.append(" ".join(content["tags"]))
        return " ".join(parts)

Batch Processing for Existing Data

When you first add vector search to an existing system, you need to backfill embeddings for all existing context. Process in batches to manage API rate limits and memory usage:

async def backfill_embeddings(pipeline: EmbeddingPipeline,
                              batch_size: int = 100):
    """Backfill embeddings for all contexts missing them."""
    offset = 0
    while True:
        rows = await pipeline.pool.fetch(
            """SELECT id, content FROM contexts
               WHERE embedding IS NULL AND is_active = true
               ORDER BY id LIMIT $1 OFFSET $2""",
            batch_size, offset
        )
        if not rows:
            break

        texts = [pipeline._prepare_text(dict(r["content"]))
                 for r in rows]
        embeddings = await pipeline.embed_batch(texts)

        for row, emb in zip(rows, embeddings):
            await pipeline.pool.execute(
                "UPDATE contexts SET embedding = $1 WHERE id = $2",
                str(emb), row["id"]
            )
        offset += batch_size
        print(f"Processed {offset} contexts")

Step 4: Implement Semantic Search

With embeddings stored and indexed, you can now query by semantic similarity. The search process involves embedding the query, performing a vector similarity search, and optionally combining with metadata filters.

class ContextSearch:
    def __init__(self, pool: asyncpg.Pool, pipeline: EmbeddingPipeline):
        self.pool = pool
        self.pipeline = pipeline

    async def search(self, query: str, user_id: str = None,
                     context_type: str = None,
                     limit: int = 10,
                     similarity_threshold: float = 0.7) -> List[Dict]:
        """Search contexts by semantic similarity."""
        query_embedding = await self.pipeline.embed_text(query)

        # Build dynamic query with optional filters
        conditions = ["is_active = true",
                      "embedding IS NOT NULL"]
        params = [str(query_embedding), limit]
        param_idx = 3

        if user_id:
            conditions.append(f"user_id = ${param_idx}")
            params.append(user_id)
            param_idx += 1

        if context_type:
            conditions.append(f"context_type = ${param_idx}")
            params.append(context_type)
            param_idx += 1

        where_clause = " AND ".join(conditions)

        query_sql = f"""
            SELECT id, user_id, context_type, content, metadata,
                   1 - (embedding <=> $1) AS similarity
            FROM contexts
            WHERE {where_clause}
            ORDER BY embedding <=> $1
            LIMIT $2
        """

        rows = await self.pool.fetch(query_sql, *params)
        results = [dict(r) for r in rows
                   if r["similarity"] >= similarity_threshold]
        return results

Hybrid Search: Combining Vector and Keyword

Pure vector search sometimes misses results that contain exact important terms (like product codes or technical identifiers). Hybrid search combines vector similarity with keyword matching for the best of both approaches:

Run both searches in parallel against the same dataset
Normalize scores from each approach to a 0-1 range
Combine with weighted scoring: typically 0.7 * vector_score + 0.3 * keyword_score works well as a starting point
Re-rank the merged results and return the top N

This approach is especially valuable for context systems that store both natural language content and structured technical data. For more on data integration approaches, see our guide on unifying disparate data sources.

Performance Optimization

Vector search performance degrades predictably with dataset size and query volume. Here are the most impactful optimizations:

Pre-filtering

Apply metadata filters (user_id, context_type, date ranges) before vector search to reduce the candidate set. PostgreSQL can combine GIN indexes on metadata with HNSW vector indexes efficiently.

Dimension Reduction

If using a high-dimension model (3072 dimensions), consider reducing to 1024 or 768 using PCA or the model's built-in dimension reduction. This cuts storage and search costs significantly with minimal quality impact.

Query Caching

Cache embeddings for repeated queries and cache search results for identical queries within a short TTL. See our Redis caching guide for implementation details.

Connection Pooling

Vector search queries are more resource-intensive than standard queries. Ensure your connection pool is sized appropriately and consider dedicating a read replica for search workloads.

Monitoring Vector Search Quality

Deploy monitoring to track search quality over time:

Recall@K: What percentage of relevant context appears in the top K results
Mean Reciprocal Rank: How high the first relevant result ranks on average
Latency distribution: P50, P95, and P99 search latencies
Embedding drift: Monitor whether your embedding model's quality degrades as your data distribution shifts

Frequently Asked Questions

How many vectors can pgvector handle before I need a dedicated vector database?

pgvector handles up to 5-10 million vectors effectively on properly provisioned hardware with HNSW indexing. Beyond that, or if you need sub-10ms P99 latency at scale, consider dedicated solutions like Pinecone, Weaviate, or Qdrant. Our guide on scaling context stores to billions covers this transition in detail.

Should I embed the entire context document or just key fields?

Embed a curated combination of the most semantically meaningful fields. For conversational context, embed the user message and assistant response. For knowledge base articles, embed the title, summary, and key paragraphs. Embedding excessively long documents dilutes the semantic signal.

How do I handle context updates when embeddings change?

Re-embed context whenever the content fields that feed into the embedding change. Use a background job queue (Celery, RQ, or a message broker) to process embedding updates asynchronously. Store the embedding model version alongside each embedding so you can identify stale embeddings when you upgrade models.

What distance metric should I use: cosine, L2, or inner product?

Use cosine similarity (the <=> operator in pgvector) for most text embedding use cases. Cosine similarity normalizes for vector magnitude, making it robust across different text lengths. L2 distance and inner product are better for specialized cases where magnitude carries meaning, such as recommendation scores.

MCP Tutorials

RAG Cookbook

Library Integrations

Context Window Engineering

Embeddings & Retrieval

Tool Use & Function Calling

Implementing Vector Search for Context Retrieval