::synthesis · Tim-Ferriss method
Embeddings (semantic search MED)
::minimum effective dose
An embedding is a vector — a list of typically 384 to 3,072 floating-point numbers — that represents the semantic meaning of a piece of text. Two texts about similar concepts have vectors that point in similar directions (high cosine similarity); unrelated texts have orthogonal vectors. That's the entire mechanic. The applications are everything that benefits from 'find things that mean similar, not things that match keyword': semantic search, deduplication, clustering, recommendation, classification, RAG retrieval. The MED workflow: (1) Pick one embedding model — OpenAI text-embedding-3-small ($0.02/M tokens) or a local model like nomic-embed-text or bge-large for free and private. (2) Embed your corpus once — every chunk of text becomes a vector, stored in a vector DB (or a plain numpy array for small corpora; Postgres with pgvector for medium; Pinecone, Weaviate, Qdrant for large). (3) At query time, embed the query, find top-K nearest vectors by cosine similarity, return the matching original texts. The dirty secrets nobody tells beginners: (a) Naive cosine similarity often retrieves syntactically similar but semantically wrong chunks — hybrid search (BM25 + vector) typically wins. (b) Chunking strategy matters more than embedding model choice for most workloads. (c) Re-ranking the top-50 with a stronger model dramatically improves results vs trusting top-K directly. (d) Embedding models go stale; the gap between a 2023 model and a 2025 model is real.
::DiSSS · deconstruction questions
- 01What's my chunk size, and have I tested 3-5 alternatives on the same queries?
- 02Am I doing pure vector or hybrid (vector + keyword) — and have I measured both on my data?
- 03What's my retrieval eval (precision@10, recall@10) on a held-out query set?
- 04How do I detect when an embedding-model swap would improve results without re-embedding the whole corpus?
- 05Is my vector store cost actually justified, or could I run this in a 200MB numpy file?
::fear-setting
Cost of not learning this: you'll either (a) ignore embeddings and miss the entire class of semantic search and recommendation features that are basically free now, or (b) over-invest in embeddings (Pinecone subscription, sophisticated pipeline) for a corpus that would fit in memory and run faster on a laptop. Cost of getting it wrong: silent retrieval failures, which then propagate into silent RAG failures, which then propagate into 'hallucinations' that are actually retrieval errors. Most operators ship embedding systems with no precision/recall measurement; they discover the system was returning wrong chunks only when a customer points out a wrong answer. By then the trust damage is done. Embeddings are the foundation of a lot of AI features and almost nobody evaluates them as rigorously as the generation layer.
::80 / 20 cut
SKIP: deep dives into transformer-based embedding architecture, exotic dimensionality reduction (UMAP, t-SNE) for production retrieval, the latest embedding-model paper. OBSESS OVER: (1) building a 50-query held-out eval set BEFORE building the system, (2) testing 3-5 chunk sizes on your data, (3) implementing hybrid search (BM25 + vector) as default, not as 'we'll add it later.' The eval is the work; everything else is plumbing.
::tribe of mentors · paraphrased stances
Nils Reimers
Creator of sentence-transformers, one of the most-used open embedding libraries
Nils's stance: the embedding model is the foundation, but it's not where most operators lose. Most losses are in chunking, lack of re-ranking, and absence of hybrid search. Fix the pipeline before fixing the model.
Jo Bergum
Distinguished Engineer at Vespa, deep practitioner on hybrid search at scale
Jo's stance: pure vector search is a downgrade from hybrid search for most real-world workloads. Keyword still matters; rare terms still matter; identifiers still matter. The default should be hybrid, with vector handling the semantic layer and BM25 handling the lexical layer.
Hamel Husain
Builds production RAG and search systems, writes detailed evaluation guides
Hamel's stance: if you can't show me precision@10 and recall@10 on a held-out set of 50+ queries, you don't have a search system — you have a search demo. The eval is the difference between shipping and guessing.
OpenAI embedding team
Authors of text-embedding-3 series, set the practical baseline most operators use
OpenAI's stance, documented in cookbook: text-embedding-3-small is the right default for most workloads; large is justified only when you have measurable precision wins. Most operators over-pay for 'large' when small handles their task.
::real-world test · this week
This week: take 500 documents you care about (emails, notes, docs, whatever). Embed them with text-embedding-3-small or a local nomic-embed-text. Store as a numpy array. Write a 20-line script that embeds a query and returns the top 10 by cosine similarity. Run 10 real queries you'd actually ask. Notice which ones returned exactly what you wanted and which returned semantically-adjacent-but-wrong. That gap — between what you got and what you wanted — is the work. The fix is rarely a different model; it's chunking, hybrid search, or re-ranking.
::action items · ranked
- 01Pick ONE embedding model (text-embedding-3-small or nomic-embed-text) and embed something you care about today
- 02Build a 50-query held-out eval set with expected results — this is the search system, the rest is plumbing
- 03Default to hybrid search (BM25 + vector) instead of pure vector; the gain is usually free
- 04Test 3 chunk sizes on your data and measure retrieval quality at each — chunking dominates model choice
- 05Add re-ranking with a stronger model for top-50 → top-5 only when your eval shows it's needed