Ai Integration

retrieval-strategy

$npx skills add blunotech-dev/agents --skill retrieval-strategy

Implements optimized retrieval strategies for RAG and search pipelines. Use when building or improving document retrieval, choosing between similarity search, hybrid search, reranking, or MMR, diagnosing retrieval quality issues, or tuning recall/precision tradeoffs in LLM-powered search. Trigger for any task involving vector search, embedding retrieval, semantic search, BM25, reranking, or RAG pipeline optimization.

namedescriptioncategory
retrieval-strategyImplements optimized retrieval strategies for RAG and search pipelines. Use when building or improving document retrieval, choosing between similarity search, hybrid search, reranking, or MMR, diagnosing retrieval quality issues, or tuning recall/precision tradeoffs in LLM-powered search. Trigger for any task involving vector search, embedding retrieval, semantic search, BM25, reranking, or RAG pipeline optimization.Ai Integration

retrieval-strategy

Covers the non-obvious decisions, failure modes, and tuning levers across retrieval approaches. Assumes familiarity with embeddings and basic vector DB APIs.


1. Choosing a Strategy

SignalUse
Queries are natural language, corpus is proseSimilarity search alone (baseline)
Queries contain exact terms, model names, IDs, codesHybrid (semantic + keyword)
Top-k recall is good but ranking is wrongAdd reranker
Results are redundant / clustered around one subtopicMMR
Long documents, multi-hop questionsHierarchical retrieval

Don't default to hybrid + rerank everywhere. Each layer adds latency and a new failure surface. Start minimal, instrument, then add layers where metrics show it helps.


2. Similarity Search — Non-Obvious Failure Modes

The curse of high dimensionality in cosine similarity: At 1536+ dims, cosine distances cluster tightly — top-100 and top-1000 results can have near-identical scores. This makes score thresholding unreliable. Use rank cutoff (top_k) not score threshold unless you normalize per-query.

Query-document embedding asymmetry: Most embedding models are trained with short queries paired against longer passages. Embedding a full paragraph as the query degrades recall. If your "query" is a document chunk (e.g., find similar chunks), use a bi-encoder trained for symmetric similarity (e.g., all-mpnet-base-v2) not an asymmetric retrieval model (e.g., text-embedding-3-large).

Chunking strategy matters more than embedding model choice:

  • Overlapping chunks (50% overlap) outperform non-overlapping for boundary-spanning content
  • Chunk at semantic boundaries (paragraphs, sections), not fixed token counts
  • Parent-child chunking: embed small chunks for precision, return parent chunks for context. Non-obvious: store parent_id on each child chunk, then after retrieval do a single bulk fetch of parent chunks — don't embed parents.
// Parent-child retrieval pattern
const childResults = await vectorDB.query(queryEmbedding, { topK: 20 });
const parentIds = [...new Set(childResults.map(r => r.metadata.parent_id))];
const fullChunks = await docStore.getBatch(parentIds); // return these to LLM

3. Hybrid Search (Semantic + Keyword)

Why BM25 catches what semantic misses: Semantic search generalizes — "myocardial infarction" matches "heart attack". That's usually good. But for exact product names, error codes, version numbers, and proper nouns, semantic similarity dilutes specificity. BM25 scores exact token matches.

Reciprocal Rank Fusion (RRF) over score normalization: Score-level fusion requires normalizing across two different distributions (cosine similarity vs. BM25 TF-IDF). RRF avoids this entirely — it only uses rank positions:

def rrf(rankings: list[list[str]], k=60) -> list[str]:
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores, key=scores.get, reverse=True)

# k=60 is empirically robust; lower k amplifies top ranks more aggressively

Non-obvious: BM25 requires the same tokenizer as your index. If your vector DB tokenizes on ingest, make sure your query-time BM25 uses identical preprocessing (lowercasing, stemming, stopwords). Mismatch is a silent failure — results look plausible but recall drops.

When to weight semantic vs. keyword: Don't tune weights per-query type statically. Instead, detect query type:

  • Query contains quoted strings, version numbers, or uppercase acronyms → boost keyword weight
  • Query is a full sentence or question → boost semantic weight

4. Reranking

Rerankers are cross-encoders; retrievers are bi-encoders. The retriever compares pre-computed embeddings (fast, approximate). The reranker sees (query, document) as a pair and scores relevance jointly (slow, accurate). Never rerank more than you need to — rerank top-20 to 40, not top-200.

Cohere Rerank vs. local cross-encoder:

  • Cohere: lower latency for large batches, handles multilingual, costs per call
  • cross-encoder/ms-marco-MiniLM-L-6-v2 (HuggingFace): free, ~15ms/pair on CPU, good for English
from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query, candidates, top_n=5):
    pairs = [(query, c['text']) for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in ranked[:top_n]]

Non-obvious: reranker scores are not calibrated probabilities. A score of 0.9 vs. 0.3 doesn't mean 3x more relevant — it's an ordinal rank signal. Don't threshold on reranker scores; always take top-n by rank.

Reranker failure mode — length bias: Cross-encoders trained on MS MARCO tend to prefer longer documents (more token overlap opportunity). If your corpus has mixed-length chunks, normalize chunk length before reranking or accept this bias consciously.


5. MMR (Maximal Marginal Relevance)

When to use: Results are relevant but redundant — you retrieve 5 chunks that all say the same thing from the same section. MMR trades some relevance for diversity.

The formula:

MMR(d) = λ · sim(query, d) − (1−λ) · max(sim(d, already_selected))

Implementation:

import numpy as np

def mmr(query_emb, candidate_embs, candidate_docs, top_k=5, lambda_=0.5):
    selected, remaining = [], list(range(len(candidate_docs)))
    
    while len(selected) < top_k and remaining:
        if not selected:
            # First pick: pure relevance
            best = max(remaining, key=lambda i: np.dot(query_emb, candidate_embs[i]))
        else:
            sel_embs = candidate_embs[selected]
            best = max(
                remaining,
                key=lambda i: (
                    lambda_ * np.dot(query_emb, candidate_embs[i])
                    - (1 - lambda_) * np.max(sel_embs @ candidate_embs[i])
                )
            )
        selected.append(best)
        remaining.remove(best)
    
    return [candidate_docs[i] for i in selected]

λ tuning:

  • λ = 1.0 → pure similarity (no diversity)
  • λ = 0.5 → balanced (good default)
  • λ = 0.3 → aggressive diversity (useful for broad survey questions)

Non-obvious: MMR requires embeddings at rerank time, not just scores. If your vector DB only returns scores (not vectors), you'll need a second embed call for the candidates. Cache candidate embeddings during retrieval instead of re-embedding.


6. Hierarchical / Multi-Stage Retrieval

For long documents or multi-hop questions where a single chunk won't contain the answer:

Stage 1 — document-level retrieval: Embed document summaries or titles, retrieve relevant documents. Stage 2 — passage-level retrieval: Within retrieved documents only, run similarity search on chunks.

This avoids noisy cross-document chunk comparison and dramatically reduces search space for stage 2.

Contextual compression (non-obvious pattern): After retrieval, use an LLM to extract only the sentence(s) within each chunk that are relevant to the query before passing to the final LLM. Reduces context bloat without losing the retrieved chunks.

async def compress(chunk_text, query, client):
    resp = await client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": f"Extract only the sentences relevant to: '{query}'\n\nPassage:\n{chunk_text}\n\nIf nothing is relevant, respond with 'NONE'."
        }]
    )
    result = resp.content[0].text
    return None if result.strip() == "NONE" else result

7. Diagnosing Retrieval Quality

Instrument before tuning. Log for every query:

  • retrieved_ids: what was returned
  • rank_of_relevant: where the known-good doc appeared (requires labeled eval set)
  • mrr: mean reciprocal rank across eval set
  • recall@k: fraction of relevant docs in top-k

Fast eval set construction: Take 20-30 real queries, manually label top-3 relevant chunks. Run retrieval, compute MRR. This catches 80% of real issues.

Common diagnosis patterns:

SymptomLikely causeFix
Relevant doc is rank 8-15, not top-5Retriever recall fine, ranking wrongAdd reranker
Relevant doc not in top-100Embedding mismatch or chunking issueCheck chunk size, try different embedding model
All top-5 results from same documentNo diversityAdd MMR or deduplicate by doc_id
Exact-match queries miss obvious resultsNo keyword componentAdd BM25 hybrid
Good results but LLM says "not found"Chunk too small, context truncatedSwitch to parent-child chunking

8. Latency vs. Quality Tradeoffs

OptimizationLatency impactQuality impact
Reduce top-k from 100 → 20−60% vector searchRecall drops if reranker relies on it
Async parallel retrieval (semantic + BM25 simultaneously)Cuts hybrid latency by ~40%None
Cache embeddings for repeated queriesNear-zero for cache hitsNone
Rerank only if retriever score spread is lowSkips reranker ~30% of callsMinimal — low spread means ranking is uncertain anyway
Use smaller embedding model for ANN indexFaster ingest + searchSlight recall drop; test empirically

Async hybrid retrieval pattern:

const [semanticResults, keywordResults] = await Promise.all([
  vectorDB.query(queryEmbedding, { topK: 40 }),
  bm25Index.search(queryText, { topK: 40 })
]);
const fused = rrf([semanticResults.map(r => r.id), keywordResults.map(r => r.id)]);