RAG Metrics for Technical Leaders: Beyond Recall
Title: MRR, nDCG, Hit Rate, and Recall: Know Your Retrieval Metrics
If you're working on RAG, search, or anything that touches vector databases, you've probably run into a mess of evaluation metrics: MRR, nDCG, hit rate, recall. Everyone throws these terms around. Few explain them well.
This post is for practitioners who want to go from vague intuition to confident decisions.
If you're just starting out and debugging a hallucinating LLM, use Hit Rate. If you're ready to get serious, use MRR + Recall during retriever tuning. If you're ready to get serious, use nDCG + Hit Rate when tuning reranker or doing system evals.
TL;DR: When to Use What¶
Use Case / Need | Metric to Use | Why |
---|---|---|
You just want to check if any correct result was retrieved | Hit Rate | Binary success metric, useful for RAG "was it in the top-k?" |
You want to know how many of the correct results were found | Recall | Focused on completeness โ how much signal did you recover |
You want to know how early the 1st correct result appears | MRR | Good for single-answer QA and fast-hit UIs |
You care about ranking quality across all relevant results | nDCG | Ideal for multi-relevance tasks like document or product search |
Understanding Each Metric in Detail¶
โ Hit Rate¶
- Binary metric: did any relevant doc show up in top-k?
- Doesnโt care if itโs Rank 1 or Rank 5, just needs a hit.
Use Hit Rate when: You're debugging RAG. Great for checking if the chunk with the answer even made it through.
Think: "Did we even get one hit in the top-k?"
โ Recall¶
- Measures what fraction of all relevant documents were retrieved in top-k.
- Penalizes for missing multiple relevant items.
Use Recall when: You want completeness. Think medical retrieval, financial documents, safety-critical systems.
Think: "Did we find enough of what we needed?"
๐ฎ MRR (Mean Reciprocal Rank)¶
- Tells you how early the first relevant document appears.
- If the first correct answer is at Rank 1 โ score = 1.0
- Rank 2 โ score = 0.5; Rank 5 โ score = 0.2
Use MRR when: Only one answer matters (QA, intent classification, slot filling). You care if your system gets it fast.
Think: "Do we hit gold in the first result?"
๐ nDCG (Normalized Discounted Cumulative Gain)¶
- Looks at all relevant docs, not just the first.
- Discounts docs by rank: higher = better.
- Supports graded relevance ("highly relevant" vs "somewhat relevant").
Use nDCG when: Ranking quality matters. Ideal for search, recsys, anything with many possible good results.
Think: "Did we rank the good stuff higher overall?"
How They Differ¶
Metric | Binary or Graded | 1st Hit Only? | Sensitive to Rank? | Use For... |
---|---|---|---|---|
Hit Rate | Binary | โ No | โ No (thresholded) | RAG debugging, presence check |
Recall | Binary or Graded | โ No | โ No | Completeness, coverage |
MRR | Binary | โ Yes | โ Yes | Fast hits, QA |
nDCG | Graded | โ No | โ Yes | Ranking quality, search |
Retrieval Is Not One Metric¶
People default to one number because it's convenient. But retrieval is multi-objective:
- You want early relevant hits (MRR)
- You want most relevant hits (Recall)
- You want them ranked well (nDCG)
- You want to know if you're even in the game (Hit Rate)
Choose the metric that matches your product surface.
Pro Tips¶
- Use Hit Rate when you're just starting out and debugging a hallucinating LLM
And then use the right metric for the right job: - Use MRR + Recall during retriever tuning - Use nDCG + Hit Rate when tuning reranker or doing system evals
Final Word¶
MRR isnโt better than nDCG. Recall isnโt cooler than Hit Rate. They just answer different questions.
So the next time someone asks, "What's your retrieval performance?" You can say: "Depends. What do you care about?"