12 min read · January 2025 · Aleksandr, NLP Lead

A guide to deploying semantic search

From embeddings to production: a practical guide to building a semantic search system for your business.

Introduction

Semantic search changes the game when working with enterprise data. Unlike traditional keyword search, it understands the meaning of a query and surfaces relevant documents even without exact word matches.

This guide walks through the full deployment cycle: from choosing an embedding model to scaling to millions of documents.

System architecture

A typical semantic search system has three components:

1. Embedding model

Turns text into a fixed-size vector (typically 384–1536 dimensions). Model quality drives search quality.

2. Vector database

Stores the vectors and serves fast similarity search (ANN — Approximate Nearest Neighbors).

3. Reranker (optional)

Refines results by running a more accurate model over the top-N candidates.

Choosing an embedding model

Models worth considering (as of 2025):

ModelDimContextBest for
intfloat/multilingual-e5-large1024512General-purpose
sentence-transformers/paraphrase-multilingual-mpnet-base-v2768128Speed + quality
cointegrated/rubert-tiny23122048Resource-heavy workloads
Important: Don't use raw LLMs (GPT, LLaMA) as embedders — they're not optimized for that. Use purpose-built embedding models.

Chunking: splitting documents

Long documents need to be split into chunks that fit the model's context window. Chunk size is a key parameter:

  • Small (128–256 tokens): high precision, but context is lost
  • Medium (512–1024 tokens): the right balance for most workloads
  • Large (2048+ tokens): preserve context, but reduce granularity

Recommended approach: overlapping chunks with 10–20% overlap.

# Example: chunking with overlap
def create_chunks(text, chunk_size=512, overlap=50):
    tokens = tokenizer.encode(text)
    chunks = []

    for i in range(0, len(tokens), chunk_size - overlap):
        chunk = tokens[i:i + chunk_size]
        chunks.append(tokenizer.decode(chunk))

    return chunks

Choosing a vector database

For production systems, consider:

DatabaseMax vectorsNotes
pgvector (PostgreSQL)10M+If you already run PostgreSQL
PineconeUnlimitedManaged, easy to start
WeaviateUnlimitedHybrid search out of the box
Milvus / ZillizUnlimitedHigh performance

Quality evaluation

Before going to production, measure quality on your own data:

  1. Build a test dataset: 50–100 (query, relevant document) pairs
  2. Metrics:
    • Recall@K — share of relevant documents in the top-K
    • MRR (Mean Reciprocal Rank) — average position of the first relevant result
    • NDCG — accounts for the order of relevant documents
  3. Targets: Recall@5 ≥ 0.8, MRR ≥ 0.7

Production optimizations

Caching

Queries repeat often. Cache results at the application layer (Redis/Memcached) with TTL 1–24 hours.

Hybrid search

Combine semantic and keyword search for better results: blend scores with weights (e.g., 0.7 × semantic + 0.3 × BM25).

Metadata filtering

Pre-filter by metadata (date, category, author) before vector search — it speeds up queries and improves relevance.

Deployment checklist

Conclusion

Semantic search is a powerful tool, but it rewards attention to detail. Start with a simple architecture, measure quality on your own data, and iterate.

If you need help with deployment — get in touch. We specialize in production-ready NLP solutions.