machine-learning¶

April 12, 2025
in machine-learning, RAG
3 min read

RAG Metrics for Technical Leaders: Beyond Recall

Title: MRR, nDCG, Hit Rate, and Recall: Know Your Retrieval Metrics

If you're working on RAG, search, or anything that touches vector databases, you've probably run into a mess of evaluation metrics: MRR, nDCG, hit rate, recall. Everyone throws these terms around. Few explain them well.

This post is for practitioners who want to go from vague intuition to confident decisions.

If you're just starting out and debugging a hallucinating LLM, use Hit Rate. If you're ready to get serious, use MRR + Recall during retriever tuning. If you're ready to get serious, use nDCG + Hit Rate when tuning reranker or doing system evals.

TL;DR: When to Use What

Use Case / Need	Metric to Use	Why
You just want to check if any correct result was retrieved	Hit Rate	Binary success metric, useful for RAG "was it in the top-k?"
You want to know how many of the correct results were found	Recall	Focused on completeness — how much signal did you recover
You want to know how early the 1^st correct result appears	MRR	Good for single-answer QA and fast-hit UIs
You care about ranking quality across all relevant results	nDCG	Ideal for multi-relevance tasks like document or product search

Understanding Each Metric in Detail

✅ Hit Rate

Binary metric: did any relevant doc show up in top-k?
Doesn’t care if it’s Rank 1 or Rank 5, just needs a hit.

Use Hit Rate when: You're debugging RAG. Great for checking if the chunk with the answer even made it through.

Think: "Did we even get one hit in the top-k?"

↑ Recall

Measures what fraction of all relevant documents were retrieved in top-k.
Penalizes for missing multiple relevant items.

Use Recall when: You want completeness. Think medical retrieval, financial documents, safety-critical systems.

Think: "Did we find enough of what we needed?"

🔮 MRR (Mean Reciprocal Rank)

Tells you how early the first relevant document appears.
If the first correct answer is at Rank 1 → score = 1.0
Rank 2 → score = 0.5; Rank 5 → score = 0.2

Use MRR when: Only one answer matters (QA, intent classification, slot filling). You care if your system gets it fast.

Think: "Do we hit gold in the first result?"

🔠 nDCG (Normalized Discounted Cumulative Gain)

Looks at all relevant docs, not just the first.
Discounts docs by rank: higher = better.
Supports graded relevance ("highly relevant" vs "somewhat relevant").

Use nDCG when: Ranking quality matters. Ideal for search, recsys, anything with many possible good results.

Think: "Did we rank the good stuff higher overall?"

How They Differ

Metric	Binary or Graded	1^st Hit Only?	Sensitive to Rank?	Use For...
Hit Rate	Binary	❌ No	❌ No (thresholded)	RAG debugging, presence check
Recall	Binary or Graded	❌ No	❌ No	Completeness, coverage
MRR	Binary	✅ Yes	✅ Yes	Fast hits, QA
nDCG	Graded	❌ No	✅ Yes	Ranking quality, search

Retrieval Is Not One Metric

People default to one number because it's convenient. But retrieval is multi-objective:

You want early relevant hits (MRR)
You want most relevant hits (Recall)
You want them ranked well (nDCG)
You want to know if you're even in the game (Hit Rate)

Choose the metric that matches your product surface.

Pro Tips

Use Hit Rate when you're just starting out and debugging a hallucinating LLM

And then use the right metric for the right job: - Use MRR + Recall during retriever tuning - Use nDCG + Hit Rate when tuning reranker or doing system evals

Final Word

MRR isn’t better than nDCG. Recall isn’t cooler than Hit Rate. They just answer different questions.

So the next time someone asks, "What's your retrieval performance?" You can say: "Depends. What do you care about?"

March 30, 2025
in machine-learning, RAG
3 min read

Vector Search at Scale: Balancing Cost, Quality, and Sanity

At scale, relevance isn't your only enemy. Cost is. Every millisecond of latency, every token passed to an LLM, and every unnecessary reranker call adds up—fast. The iron triangle of retrieval and hence, RAG is relevance, cost, and latency. You can only pick two.

Today, we'll focus on the cost and latency.

Here's a list of things that teams do that can be improved:

Run full-precision vector search for every query
Skip lexical signals altogether
Avoid reranking because "it's too expensive"
Have no system to analyze why results are bad

This post is a walkthrough of what a real retrieval stack looks like when it's designed not just for correctness, but also for operational efficiency and failure debugging.

Figure: Retrieval stack architecture balancing cost, quality, and latency. Each layer maximizes relevance per dollar and enables debugging.

The Architecture

Forget monoliths. Retrieval is a pipeline. Here's the architecture I recommend. Each layer exists for a reason: to maximize relevance per dollar and to make debugging sane.

1. Query Router

This is your traffic cop. It decides how to fan out the query: to a lexical search engine (BM25), a fast vector index, or both. You can route based on query class, business priority, or budget.

2. BM25 Search

Not dead. In fact, BM25 still shines for acronym-heavy domains, product names, and anything with proper nouns. It's cheap, precise, and the ideal complement to lossy vector embeddings. Run it in parallel with your vector retrieval.

3. Binary Quantized Vector Search (RAM)

This is your fast recall layer—usually IVFPQ or scalar quantization in FAISS or ScaNN. Gets you top-K quickly, cheaply. Think of it as a rough shortlist generator. Latency under 5ms is normal.

4. Full-Precision Vector Search (Disk)

From your shortlist, you can now hit the full-resolution vectors. Higher fidelity, slower access, stored on disk. You should only do this when needed—ambiguous queries, high-value flows, or when the approximate search isn't enough.

5. Cross-Encoder Reranker

This is the first component in the stack that actually understands relevance. Embeddings collapse meaning into vectors. Cross-encoders read both the query and the doc, and compute true semantic alignment. Expensive, yes. But reranking the top 20–100 candidates is usually all you need.

6. Result Aggregation

Once you've got candidates from both BM25 and vector search, and re-ranked the best ones, you blend them. The fusion logic depends on your goal: pure precision, diversity, confidence thresholds, etc.

Building Feedback Loops

Most retrieval problems aren't one-off issues. They're patterns. Instead of debugging individual queries, cluster them. Use a mix of token overlap and embedding distance. Add UMAP or HDBSCAN if needed.

The goal isn't just analysis—it's systematic insight:

Which queries have zero recall?
Which are poorly reranked?
Which embeddings collapse semantically distinct queries?

Once you know that, you can prioritize improvements—embedding quality, routing rules, metadata enrichment, or prompt tuning—at the cluster level. Much higher leverage than spot fixes.

Why This Matters for RAG

If your retrieval is weak, your LLM has to do all the heavy lifting. That means more tokens, more hallucinations, slower responses. And ironically, worse answers.

Your retrieval stack should do two things: 1. Return the most relevant docs 2. Let you understand why it didn't

Without that, you're just doing GPT improv with 3 PDFs in context.

Don't treat retrieval as a "vector db" checkbox. Treat it as a system. The best stacks layer:

Cheap recall
Precise reranking
Old-school lexical sanity checks

1 line summary: RAM-level quantized vectors give you scale, Disk-level full vectors give you fidelity, BM25 gives you robustness, Rerankers give you actual relevance, Query clustering gives you insight.

What's expensive isn't reranking. What's expensive is debugging bad search with no observability.

If you're building RAG at scale and want to audit your retrieval infra, I do this for a living. We go from "it kind of works" to "we know exactly what's wrong and how to fix it."

March 22, 2025
in machine-learning, RAG
5 min read

5 RAG Query Patterns Every Engineering Leader Should Know

Ever tried building a RAG system that actually works for the all the different ways humans ask questions? After years of building and breaking retrieval systems at scale, I've found that most RAG failures happen at the query understanding level.

Here's the thing: not all queries are created equal. The reason your system hallucinates or gives garbage answers often has more to do with the question type than your vector DB settings or chunking strategy.

I've distilled RAG queries into 5 distinct patterns, each requiring different handling strategies. Understanding these will save your team months of confusion and help you diagnose issues before they become production nightmares. These are the most common patterns I've seen in RAG systems, but I don't claim they are the only ones.

tl;dr

Synthesis queries: Straightforward factoid retrieval with light transformation
Lookup queries: Require specific information retrieval, often with time/comparative elements
Multi-hop queries: Need decomposition into sub-questions for complete answers
Insufficient context queries: Questions your system should admit it can't answer
Creative/generative queries: Where LLM hallucination is actually desired

1. Synthesis Queries: The RAG Sweet Spot

Synthesis queries are the bread and butter of RAG systems - straightforward questions requiring basic factual retrieval and minimal transformation.

Examples:

"What were our Q2 earnings?"
"What's the maximum dosage for Drug X?"
"When was our healthcare policy updated?"

💡 Key insight: Synthesis queries typically map directly to content in your knowledge base, requiring minimal inferencing from the LLM. These are where RAG truly shines.

These queries typically follow a predictable pattern:

A clear, singular subject
A specific attribute being requested
No complex temporal or conditional elements

Engineering implication: For synthesis queries, retrieval precision matters more than recall. Your system needs to find the exact relevant information rather than gathering broadly related context.

I built a healthcare RAG system where we optimized specifically for synthesis queries by implementing a document-first chunking strategy. This increased our accuracy by 17% for straightforward factual queries while sacrificing performance on more complex questions - a tradeoff we explicitly made based on user behavior analysis.

2. Lookup Queries: Beyond Simple Facts

Lookup queries introduce additional complexity through comparative elements, time components, or the need to process patterns. These often rely on aggregation over some attributes e.g. time, location and I recommend setting up a metadata index to support these queries.

Examples:

"How did our healthcare costs compare between 2022 and 2023?"
"What's the trend in side effect reporting for Drug X over the past 5 years?"
"Show me all dividend-paying stocks that increased yield for 3 consecutive quarters"

Look for these patterns in lookup queries:

Time-bound components ("during 2023," "over the past five years")
Comparative elements ("compared to," "versus")
Trend analysis requirements ("pattern," "trend," "over time")

Engineering implication: Lookup queries often require merging information from multiple documents or sources. Your RAG system needs strong reranking capabilities and potentially dedicated retrieval strategies e.g. text2sql and preprocessing the corpus to include tables which can be queried (h/t Dhruv Anand)

One approach I've found effective is implementing a two-phase retrieval:

Fetch the core entities and facts
Run a separate retrieval for the comparison elements
Let the LLM synthesize both retrieved contexts

3. Multi-hop Queries: The Reasoning Challenge

These are the questions that require breaking down into sub-questions, with each answer feeding into the next retrieval step.

Examples:

"Which of our healthcare plans has the best coverage for the conditions most common among our engineering team?"
"What investment strategy would have performed best in the sectors where we saw the highest growth last quarter?"

💡 Key insight: Multi-hop queries can't be solved with a single retrieval operation. They require decomposition, planning, and sequential execution.

Engineering implication: Your system architecture needs to support query planning and multiple retrieval steps. This often means implementing:

A query decomposition module to break complex questions into simpler ones
A retrieval orchestrator to manage multiple search operations
A synthesis component to integrate findings from multiple retrievals

I remember debugging a financial RAG system that kept hallucinating on multi-hop queries. The root cause wasn't the retrieval system - it was the lack of a decomposition step. We implemented a simple query planning stage that improved accuracy by 32% for complex queries.

4. Insufficient Context Queries: Learning to Say "I Don't Know"

Some questions simply cannot be answered with the information available. The hallmark of a mature RAG system is recognizing these cases.

Examples:

"What will our stock price be next quarter?"
"Which unreleased drug in our pipeline will have the fewest side effects?"
"How will changes to healthcare policy affect our costs in 2026?"

Engineering implication: You need to implement robust confidence scoring and thresholds for when your system should refuse to answer. This requires:

Evaluating retrieval quality (not just semantic similarity)
Assessing whether retrieved content actually addresses the query
Implementing explicit "insufficient information" detection

One technique I've found effective is implementing a self-evaluation prompt after the RAG pipeline generates an answer:

Given the original query "{query}" and the retrieved context "{context}", 
evaluate whether the generated answer "{answer}" is:
1. Fully supported by the retrieved context
2. Partially supported with some unsupported claims
3. Largely unsupported by the context

If the evaluation returns categories 2 or 3, we either refuse to answer or clearly indicate what parts of the response are speculative.

5. Creative/Generative Queries: When Hallucination is a Feature

Some queries explicitly request creative generation where strict factuality isn't the primary goal.

Examples:

"Draft a blog post about our healthcare benefits program"
"Generate a sample investor pitch based on our financial performance"
"Write a description of what our ideal drug delivery mechanism might look like"

💡 Key insight: For creative queries, LLM capabilities should be emphasized over retrieval, using the knowledge base as inspiration rather than constraint.

Engineering implication: Your system needs to:

Identify when a query is creative rather than factual
Adjust the retrieval-generation balance to favor generation
Use broader, more diverse retrieval to spark creativity
Preferably, implement different evaluation metrics for these queries

Practical Implementation: Query Type Detection (Evals)

Don't expect users to tell you what type of query they're asking. Your system needs to detect this automatically. I've implemented a simple but effective query classifier that looks something like this:

def classify_rag_query(query: str) -> str:
    """
    Classifies a query into one of the five RAG query types using Instructor for function calling.
    """
    from instructor import patch
    from pydantic import BaseModel, Field

    class QueryClassification(BaseModel):
        category: str = Field(
            description="The query category",
            enum=[
                "synthesis",
                "lookup",
                "multi-hop", 
                "insufficient_context",
                "creative"
            ]
        )
        confidence: float = Field(
            description="Confidence score for the classification",
            ge=0.0,
            le=1.0
        )

    # Patch the LLM to enable structured outputs
    patched_llm = patch(llm)

    result = patched_llm.chat.predict_model(
        model=QueryClassification,
        messages=[{
            "role": "user",
            "content": f"Classify this query: {query}"
        }]
    )

    return result.category

Testing Matrix for Different Query Types

For effective RAG system evaluation, you need a test suite that covers all five query types:

Query Type	Evaluation Metrics
Synthesis	Precision, Answer correctness
Lookup	F1 score, Completeness
Multi-hop	Reasoning correctness, Factuality
Insufficient context	Refusal rate, Hallucination detection
Creative	Relevance, Creativity metrics

Think About This

How often does your team debug RAG issues without first identifying the query type? Most teams I see spend weeks optimizing retrieval parameters when the real problem is a mismatch between query type and system design.

Next time your RAG system fails, ask: "What type of query is this, and is our system designed to handle this specific type?"

Originally published by Nirant Kasliwal, who builds RAG systems that don't embarrass your brand.

Thanks to Dhruv Anand and Rajaswa Patil for reading drafts of this.