Embeddings

Intro

Embeddings map text into a dense vector space where semantic similarity becomes geometric proximity — similar meanings land near each other, unrelated meanings land far apart. This is what lets retrieval match a query like "throttle partner API traffic" against a document about "rate limiting for partner plan" even though no keywords overlap.

The mechanism: an encoder model (a transformer trained with contrastive objectives) reads a text span and produces a fixed-length vector. During training, the model learns to push semantically similar pairs closer together and dissimilar pairs apart in vector space. At query time, the system embeds the query with the same model, then finds the nearest stored vectors using cosine similarity or dot product. The closest vectors correspond to the most semantically relevant chunks.

sequenceDiagram
    participant Q as Query Text
    participant E as Embedding Model
    participant VS as Vector Space
    participant I as ANN Index
    participant R as Top-k Results
    Q->>E: Encode query
    E->>VS: Query vector 1x768
    Note over VS: Cosine similarity\nagainst stored chunk vectors
    VS->>I: ANN search -- HNSW or IVF
    I->>R: Return k nearest chunks

The engineering decision is not just model quality — it is a tradeoff across recall, latency, cost, and domain fit under your corpus and SLA. A model that tops the MTEB leaderboard may still miss your internal terminology.

Embedding Model Selection

Model Families

Three categories dominate production RAG systems:

Proprietary API models — OpenAI text-embedding-3-small (1536-dim) and text-embedding-3-large (3072-dim) are the most widely deployed. They support a dimensions parameter that truncates vectors at inference time using Matryoshka training. Cohere embed-v3 adds an input_type parameter (search_document vs search_query) for asymmetric embedding, and covers 100+ languages natively. Check the OpenAI pricing page for current per-token rates — these change frequently.

Open-source bi-encoders — Sentence Transformers (SBERT) provides models like all-MiniLM-L6-v2 (384-dim, 22M params, fast) and all-mpnet-base-v2 (768-dim, 109M params, higher quality). These run locally with no per-token cost, which matters at scale. The tradeoff is infrastructure: you host inference, manage GPU allocation, and handle model updates.

Domain-finetuned models — When general-purpose models underperform on internal terminology, finetuning a base model on your corpus can close the gap. Databricks showed that finetuning gte-large-en-v1.5 (0.4B params) on synthetic domain data beat text-embedding-3-large on FinanceBench retrieval (Recall@10: 0.552 vs 0.44). Two common approaches: continued pre-training with MLM for vocabulary adaptation, and contrastive finetuning with synthetic query-document pairs.

Dimensionality

Higher dimensions give the model more room to represent fine-grained distinctions. But higher dimensions also mean more storage (4 bytes × dimensions per vector), more compute for similarity search, and higher ANN index memory.

Matryoshka Representation Learning (MRL) trains models so that any prefix of the full vector is independently meaningful. OpenAI's text-embedding-3-large at 256 dimensions outperforms ada-002 at 1536 dimensions on MTEB retrieval tasks. This means you can store shorter vectors, search faster, and only use full dimensions when precision demands it.

A practical pattern: index at reduced dimensions for the initial ANN search (fast, cheap), then re-rank the top candidates using full-dimensional vectors or a cross-encoder.

Similarity Metrics

The choice of similarity metric determines how "closeness" is calculated in vector space:

Cosine similarity measures the angle between vectors, ignoring magnitude. Most embedding models are trained with cosine objectives, making it the default choice. Range: -1 to 1 (1 = identical direction).

Dot product is cosine similarity scaled by vector magnitudes. When vectors are L2-normalized (as most embedding APIs return them), dot product equals cosine similarity. Some models intentionally encode relevance in magnitude — for these, dot product captures both semantic alignment and confidence.

Euclidean (L2) distance measures straight-line distance in vector space. Less common for text embeddings because high-dimensional spaces make absolute distances less discriminative than angular measures.

For most RAG systems: use cosine similarity unless the model documentation specifically recommends dot product.

Pitfalls

Distribution Shift on Internal Terminology

A model that scores well on MTEB benchmarks may still fail on your domain. MTEB evaluates on general web text — if your corpus uses internal acronyms, product names, or domain jargon absent from the model's training data, embeddings for those terms will be noisy. Queries and documents containing the same jargon may land far apart in vector space.

Detection: compare recall@k on a held-out set of domain-specific queries vs generic queries. A significant gap signals distribution shift.

Mitigation: finetune with domain data (GPL or continued pre-training on your corpus), or supplement embeddings with keyword search in a hybrid retrieval setup where BM25 handles exact lexical matches.

Embedding Model Swap Invalidation

Changing the embedding model — even a minor version — invalidates every stored vector. The new model produces vectors in a different geometric space. Cosine similarity between old and new vectors is meaningless.

This means re-embedding the entire corpus: for a 10M-chunk index at $0.02/1M tokens and 500 tokens/chunk average, that is ~$100 and hours of ingestion time. Key the embedding cache by model name + version to prevent serving stale vectors.

Benchmark Leaderboard Overfitting

MTEB aggregates scores across multiple task families — retrieval, STS, classification, clustering, bitext mining, and others — each containing many individual datasets. A model ranked #1 overall can rank poorly on retrieval-specific datasets because STS, classification, and clustering scores inflate the average. Always filter MTEB by the Retrieval task category when selecting for RAG, and validate on your own evaluation set.

Multilingual Embedding Collapse

Models trained primarily on English text cluster non-English content into a smaller region of vector space, reducing separation between distinct concepts. A Spanish query about "seguridad informática" and one about "seguridad alimentaria" may land closer together than they should because the model undertrained on Spanish semantic distinctions.

Mitigation: use models with explicit multilingual training (Cohere embed-v3, SBERT multilingual variants), and evaluate recall per language separately.

Tradeoffs

Factor Proprietary API Open-Source Self-Hosted Domain-Finetuned
Cost at scale Per-token pricing -- scales linearly Infrastructure cost -- GPU amortized across volume Training cost upfront -- inference same as base
Recall on general text High -- trained on massive web corpora Competitive -- top models match proprietary on MTEB Depends on base model and training data quality
Recall on domain text Can miss internal terminology Same limitation as proprietary Highest -- trained on your distribution
Latency Network round-trip + provider queue Local inference -- no network hop Same as self-hosted base model
Operational burden Minimal -- API call High -- GPU infra and model serving and updates Highest -- training pipeline plus serving
Vendor lock-in Model changes break vector index Full control over versioning Full control
Dimensionality control Some models support MRL truncation Full control via model choice Full control

Decision rule: start with a proprietary API model to establish baseline recall. Measure per-query-type performance. Switch to domain-finetuning only when domain-specific recall failures dominate over chunking or retrieval issues — finetuning the embedding model cannot fix bad chunks.

Questions

References


Whats next