Last reviewed:
What is an embedding? Definition and business implications
An embedding is the numerical representation of a word, a sentence, or a document, in the form of a vector in a space of several hundred or thousand dimensions. Two texts semantically close have geometrically close embeddings, which enables search by meaning rather than by keywords.
An embedding model is a specialised neural network, distinct from the LLM, that takes a text as input and produces a fixed numerical vector as output. The dimension of this vector varies by model: 1,536 for OpenAI text-embedding-3-small, 3,072 for text-embedding-3-large, 1,024 for Voyage 3 or Mistral Embed. The higher the dimension, the more semantic nuance the vector can capture, at the cost of proportionally higher storage and compute cost. The fundamental property of an embedding is that the distance between two vectors (measured by dot product or cosine similarity) reflects the semantic proximity of the corresponding texts. “Defence lawyer” and “legal counsel” are distant in the dictionary but close in the embedding space. That is what allows, in a RAG, the retrieval of relevant passages even when the query uses different words from those of the target document.
Concrete example
A 30-employee legal SME indexes 5,000 internal contracts to enable its lawyers to search by meaning. Using OpenAI text-embedding-3-small (1,536 dimensions, 0.02 dollar per million tokens), initial indexing costs about 15 euros for the entire corpus, and each additional monthly query a few cents. Vector storage (5,000 documents × 1,536 dimensions × 4 bytes) fits in 30 MB. With text-embedding-3-large (3,072 dimensions), quality rises by about 2 points on the public MTEB benchmarks, for a cost 6.5 times higher. The choice depends on business sensitivity to semantic recall.
See also
Further reading
Sources
- OpenAI Embeddings documentation. https://platform.openai.com/docs/guides/embeddings
- Massive Text Embedding Benchmark (MTEB) leaderboard, Hugging Face. https://huggingface.co/spaces/mteb/leaderboard