A guide to deploying semantic search
From embeddings to production: a practical guide to building a semantic search system for your business.
Introduction
Semantic search changes the game when working with enterprise data. Unlike traditional keyword search, it understands the meaning of a query and surfaces relevant documents even without exact word matches.
This guide walks through the full deployment cycle: from choosing an embedding model to scaling to millions of documents.
System architecture
A typical semantic search system has three components:
1. Embedding model
Turns text into a fixed-size vector (typically 384–1536 dimensions). Model quality drives search quality.
2. Vector database
Stores the vectors and serves fast similarity search (ANN — Approximate Nearest Neighbors).
3. Reranker (optional)
Refines results by running a more accurate model over the top-N candidates.
Choosing an embedding model
Models worth considering (as of 2025):
| Model | Dim | Context | Best for |
|---|---|---|---|
| intfloat/multilingual-e5-large | 1024 | 512 | General-purpose |
| sentence-transformers/paraphrase-multilingual-mpnet-base-v2 | 768 | 128 | Speed + quality |
| cointegrated/rubert-tiny2 | 312 | 2048 | Resource-heavy workloads |
Chunking: splitting documents
Long documents need to be split into chunks that fit the model's context window. Chunk size is a key parameter:
- Small (128–256 tokens): high precision, but context is lost
- Medium (512–1024 tokens): the right balance for most workloads
- Large (2048+ tokens): preserve context, but reduce granularity
Recommended approach: overlapping chunks with 10–20% overlap.
# Example: chunking with overlap
def create_chunks(text, chunk_size=512, overlap=50):
tokens = tokenizer.encode(text)
chunks = []
for i in range(0, len(tokens), chunk_size - overlap):
chunk = tokens[i:i + chunk_size]
chunks.append(tokenizer.decode(chunk))
return chunksChoosing a vector database
For production systems, consider:
| Database | Max vectors | Notes |
|---|---|---|
| pgvector (PostgreSQL) | 10M+ | If you already run PostgreSQL |
| Pinecone | Unlimited | Managed, easy to start |
| Weaviate | Unlimited | Hybrid search out of the box |
| Milvus / Zilliz | Unlimited | High performance |
Quality evaluation
Before going to production, measure quality on your own data:
- Build a test dataset: 50–100 (query, relevant document) pairs
- Metrics:
- Recall@K — share of relevant documents in the top-K
- MRR (Mean Reciprocal Rank) — average position of the first relevant result
- NDCG — accounts for the order of relevant documents
- Targets: Recall@5 ≥ 0.8, MRR ≥ 0.7
Production optimizations
Caching
Queries repeat often. Cache results at the application layer (Redis/Memcached) with TTL 1–24 hours.
Hybrid search
Combine semantic and keyword search for better results: blend scores with weights (e.g., 0.7 × semantic + 0.3 × BM25).
Metadata filtering
Pre-filter by metadata (date, category, author) before vector search — it speeds up queries and improves relevance.
Deployment checklist
Conclusion
Semantic search is a powerful tool, but it rewards attention to detail. Start with a simple architecture, measure quality on your own data, and iterate.
If you need help with deployment — get in touch. We specialize in production-ready NLP solutions.