12 min read · January 2025 · Aleksandr, NLP Lead

A guide to deploying semantic search

From embeddings to production: a practical guide to building a semantic search system for your business.

Introduction

Semantic search changes the game when working with enterprise data. Unlike traditional keyword search, it understands the meaning of a query and surfaces relevant documents even without exact word matches.

This guide walks through the full deployment cycle: from choosing an embedding model to scaling to millions of documents.

System architecture

A typical semantic search system has three components:

1. Embedding model

Turns text into a fixed-size vector (typically 384–1536 dimensions). Model quality drives search quality.

2. Vector database

Stores the vectors and serves fast similarity search (ANN — Approximate Nearest Neighbors).

3. Reranker (optional)

Refines results by running a more accurate model over the top-N candidates.

Choosing an embedding model

Models worth considering (as of 2025):

Model	Dim	Context	Best for
intfloat/multilingual-e5-large	1024	512	General-purpose
sentence-transformers/paraphrase-multilingual-mpnet-base-v2	768	128	Speed + quality
cointegrated/rubert-tiny2	312	2048	Resource-heavy workloads

Important: Don't use raw LLMs (GPT, LLaMA) as embedders — they're not optimized for that. Use purpose-built embedding models.

Chunking: splitting documents

Long documents need to be split into chunks that fit the model's context window. Chunk size is a key parameter:

Small (128–256 tokens): high precision, but context is lost
Medium (512–1024 tokens): the right balance for most workloads
Large (2048+ tokens): preserve context, but reduce granularity

Recommended approach: overlapping chunks with 10–20% overlap.

# Example: chunking with overlap
def create_chunks(text, chunk_size=512, overlap=50):
    tokens = tokenizer.encode(text)
    chunks = []

    for i in range(0, len(tokens), chunk_size - overlap):
        chunk = tokens[i:i + chunk_size]
        chunks.append(tokenizer.decode(chunk))

    return chunks

Choosing a vector database

For production systems, consider:

Database	Max vectors	Notes
pgvector (PostgreSQL)	10M+	If you already run PostgreSQL
Pinecone	Unlimited	Managed, easy to start
Weaviate	Unlimited	Hybrid search out of the box
Milvus / Zilliz	Unlimited	High performance

Quality evaluation

Before going to production, measure quality on your own data:

Build a test dataset: 50–100 (query, relevant document) pairs
Metrics:
- Recall@K — share of relevant documents in the top-K
- MRR (Mean Reciprocal Rank) — average position of the first relevant result
- NDCG — accounts for the order of relevant documents
Targets: Recall@5 ≥ 0.8, MRR ≥ 0.7

Production optimizations

Caching

Queries repeat often. Cache results at the application layer (Redis/Memcached) with TTL 1–24 hours.

Hybrid search

Combine semantic and keyword search for better results: blend scores with weights (e.g., 0.7 × semantic + 0.3 × BM25).

Metadata filtering

Pre-filter by metadata (date, category, author) before vector search — it speeds up queries and improves relevance.

Deployment checklist

Test dataset with relevant pairs collectedEmbedding model chosen and benchmarkedOptimal chunk size determinedVector database deployed and tunedTarget quality metrics reachedMonitoring and alerting configuredCaching for frequent queries implemented

Conclusion

Semantic search is a powerful tool, but it rewards attention to detail. Start with a simple architecture, measure quality on your own data, and iterate.

If you need help with deployment — get in touch. We specialize in production-ready NLP solutions.