Skip to content

RAG

5 RAG Query Patterns Every Engineering Leader Should Know

Ever tried building a RAG system that actually works for the all the different ways humans ask questions? After years of building and breaking retrieval systems at scale, I've found that most RAG failures happen at the query understanding level.

Here's the thing: not all queries are created equal. The reason your system hallucinates or gives garbage answers often has more to do with the question type than your vector DB settings or chunking strategy.

I've distilled RAG queries into 5 distinct patterns, each requiring different handling strategies. Understanding these will save your team months of confusion and help you diagnose issues before they become production nightmares. These are the most common patterns I've seen in RAG systems, but I don't claim they are the only ones.

tl;dr

  • Synthesis queries: Straightforward factoid retrieval with light transformation
  • Lookup queries: Require specific information retrieval, often with time/comparative elements
  • Multi-hop queries: Need decomposition into sub-questions for complete answers
  • Insufficient context queries: Questions your system should admit it can't answer
  • Creative/generative queries: Where LLM hallucination is actually desired

1. Synthesis Queries: The RAG Sweet Spot

Synthesis queries are the bread and butter of RAG systems - straightforward questions requiring basic factual retrieval and minimal transformation.

Examples:

  • "What were our Q2 earnings?"
  • "What's the maximum dosage for Drug X?"
  • "When was our healthcare policy updated?"

💡 Key insight: Synthesis queries typically map directly to content in your knowledge base, requiring minimal inferencing from the LLM. These are where RAG truly shines.

These queries typically follow a predictable pattern:

  • A clear, singular subject
  • A specific attribute being requested
  • No complex temporal or conditional elements

Engineering implication: For synthesis queries, retrieval precision matters more than recall. Your system needs to find the exact relevant information rather than gathering broadly related context.

I built a healthcare RAG system where we optimized specifically for synthesis queries by implementing a document-first chunking strategy. This increased our accuracy by 17% for straightforward factual queries while sacrificing performance on more complex questions - a tradeoff we explicitly made based on user behavior analysis.

2. Lookup Queries: Beyond Simple Facts

Lookup queries introduce additional complexity through comparative elements, time components, or the need to process patterns. These often rely on aggregation over some attributes e.g. time, location and I recommend setting up a metadata index to support these queries.

Examples:

  • "How did our healthcare costs compare between 2022 and 2023?"
  • "What's the trend in side effect reporting for Drug X over the past 5 years?"
  • "Show me all dividend-paying stocks that increased yield for 3 consecutive quarters"

Look for these patterns in lookup queries:

  • Time-bound components ("during 2023," "over the past five years")
  • Comparative elements ("compared to," "versus")
  • Trend analysis requirements ("pattern," "trend," "over time")

Engineering implication: Lookup queries often require merging information from multiple documents or sources. Your RAG system needs strong reranking capabilities and potentially dedicated retrieval strategies e.g. text2sql and preprocessing the corpus to include tables which can be queried (h/t Dhruv Anand)

One approach I've found effective is implementing a two-phase retrieval:

  1. Fetch the core entities and facts
  2. Run a separate retrieval for the comparison elements
  3. Let the LLM synthesize both retrieved contexts

3. Multi-hop Queries: The Reasoning Challenge

These are the questions that require breaking down into sub-questions, with each answer feeding into the next retrieval step.

Examples:

  • "Which of our healthcare plans has the best coverage for the conditions most common among our engineering team?"
  • "What investment strategy would have performed best in the sectors where we saw the highest growth last quarter?"

💡 Key insight: Multi-hop queries can't be solved with a single retrieval operation. They require decomposition, planning, and sequential execution.

Engineering implication: Your system architecture needs to support query planning and multiple retrieval steps. This often means implementing:

  1. A query decomposition module to break complex questions into simpler ones
  2. A retrieval orchestrator to manage multiple search operations
  3. A synthesis component to integrate findings from multiple retrievals

I remember debugging a financial RAG system that kept hallucinating on multi-hop queries. The root cause wasn't the retrieval system - it was the lack of a decomposition step. We implemented a simple query planning stage that improved accuracy by 32% for complex queries.

4. Insufficient Context Queries: Learning to Say "I Don't Know"

Some questions simply cannot be answered with the information available. The hallmark of a mature RAG system is recognizing these cases.

Examples:

  • "What will our stock price be next quarter?"
  • "Which unreleased drug in our pipeline will have the fewest side effects?"
  • "How will changes to healthcare policy affect our costs in 2026?"

Engineering implication: You need to implement robust confidence scoring and thresholds for when your system should refuse to answer. This requires:

  1. Evaluating retrieval quality (not just semantic similarity)
  2. Assessing whether retrieved content actually addresses the query
  3. Implementing explicit "insufficient information" detection

One technique I've found effective is implementing a self-evaluation prompt after the RAG pipeline generates an answer:

Given the original query "{query}" and the retrieved context "{context}", 
evaluate whether the generated answer "{answer}" is:
1. Fully supported by the retrieved context
2. Partially supported with some unsupported claims
3. Largely unsupported by the context

If the evaluation returns categories 2 or 3, we either refuse to answer or clearly indicate what parts of the response are speculative.

5. Creative/Generative Queries: When Hallucination is a Feature

Some queries explicitly request creative generation where strict factuality isn't the primary goal.

Examples:

  • "Draft a blog post about our healthcare benefits program"
  • "Generate a sample investor pitch based on our financial performance"
  • "Write a description of what our ideal drug delivery mechanism might look like"

💡 Key insight: For creative queries, LLM capabilities should be emphasized over retrieval, using the knowledge base as inspiration rather than constraint.

Engineering implication: Your system needs to:

  1. Identify when a query is creative rather than factual
  2. Adjust the retrieval-generation balance to favor generation
  3. Use broader, more diverse retrieval to spark creativity
  4. Preferably, implement different evaluation metrics for these queries

Practical Implementation: Query Type Detection (Evals)

Don't expect users to tell you what type of query they're asking. Your system needs to detect this automatically. I've implemented a simple but effective query classifier that looks something like this:

def classify_rag_query(query: str) -> str:
    """
    Classifies a query into one of the five RAG query types using Instructor for function calling.
    """
    from instructor import patch
    from pydantic import BaseModel, Field

    class QueryClassification(BaseModel):
        category: str = Field(
            description="The query category",
            enum=[
                "synthesis",
                "lookup",
                "multi-hop", 
                "insufficient_context",
                "creative"
            ]
        )
        confidence: float = Field(
            description="Confidence score for the classification",
            ge=0.0,
            le=1.0
        )

    # Patch the LLM to enable structured outputs
    patched_llm = patch(llm)

    result = patched_llm.chat.predict_model(
        model=QueryClassification,
        messages=[{
            "role": "user",
            "content": f"Classify this query: {query}"
        }]
    )

    return result.category

Testing Matrix for Different Query Types

For effective RAG system evaluation, you need a test suite that covers all five query types:

Query Type Evaluation Metrics
Synthesis Precision, Answer correctness
Lookup F1 score, Completeness
Multi-hop Reasoning correctness, Factuality
Insufficient context Refusal rate, Hallucination detection
Creative Relevance, Creativity metrics

Think About This

How often does your team debug RAG issues without first identifying the query type? Most teams I see spend weeks optimizing retrieval parameters when the real problem is a mismatch between query type and system design.

Next time your RAG system fails, ask: "What type of query is this, and is our system designed to handle this specific type?"

Originally published by Nirant Kasliwal, who builds RAG systems that don't embarrass your brand.

Thanks to Dhruv Anand and Rajaswa Patil for reading drafts of this.

Beyond Basic RAG: What You Need to Know

The Real World of RAG Systems

📒 Picture this: You're a developer who just deployed your first RAG system. Everything seems perfect in testing. Then reality hits - users start complaining about irrelevant results, not being able to do "basic stuff" and occasional hallucinations. Welcome to the world of real-world RAG systems.

The Problem With "Naive RAG"

Let's start with a truth bomb: dumping documents into a vector database and hoping for the best is like trying to build a search engine with just a dictionary - technically possible, but practically useless.

Here's why:

  1. The Embedding Trap: Think embedding similarity is enough? Here's a fun fact - in many embedding models, "yes" and "no" have a similarity of 0.8-0.9. Imagine asking for "yes" and getting a "no" instead in a legal search 😅

  2. The Context Confusion: Large Language Models (LLMs) get surprisingly confused when you give them unrelated information. They're like that friend who can't ignore a app notification while telling a story - everything gets mixed up.

  3. Length Effect: Just like humans tend to get worse at noticing details the longer a story is, LLMs with large context windows get worse at noticing details the longer the information is.

The Three Pillars of Production RAG

1. Query Understanding 🎯

The first step to better RAG isn't about better embeddings - it's about understanding what your users are actually asking for. Here's the basics:

  • Query Classification: Before rushing to retrieve documents, classify the query type. Is it a simple lookup? A comparison? An aggregation? Each needs different handling.
    • NIT: Navigational, Informational, Transactional are the 3 very broad types.
  • Metadata Extraction: Time ranges, entities, filters - extract these before retrieval. Think of it as giving your students sample questions to help them pay attention to what's important in the exam (at query time) much better and faster.

Metadata Queries

The CEO of a company asks for "last year's revenue"

The CFO asks for "revenue from last year"

The CMO asks for "revenue from the last fiscal year"

Do all these queries mean different things? Not really. The asker role i.e. query metadata changes the query intent.

2. Intelligent Retrieval Strategies 🔍

Here's where most systems fall short. Instead of one-size-fits-all retrieval:

  • Hybrid Search: Combine dense (embedding) and sparse (keyword) retrieval. You can rerank using late interaction, use LLM as a reranker or even use both in a cascade. I can probably write a whole blog post on this, but tl;dr is that you can use a combination of many retrieval strategies to get the best of precision, recall, cost and latency.
  • Query Expansion: Don't just search for what users ask - search for what they mean. Example: "Q4 results" should also look for "fourth quarter performance."
  • Context-Aware Filtering: Use metadata to filter before semantic search. If someone asks for "last week's reports," don't rely on embeddings to figure out the time range.

3. Result Synthesis and Validation ✅

The final piece is making sure your responses are accurate and useful:

  • Cross-Validation: For critical information (dates, numbers, facts), validate across multiple sources at ingestion time. It's possible that your ingestion pipeline is flawed and you don't know it.
  • Readability Checks: Use tools like the Flesch-Kincaid score to ensure responses match your user's expertise level.
  • Hallucination Detection: Implement systematic checks for information that isn't grounded in your retrieved documents. Considering evaluating the pipeline using offline tools like Ragas

Real-World Example: The Leave Policy Fiasco

Here's a real story that illustrates why naive RAG fails:

The Leave Policy Fiasco

Company X implemented a RAG system for HR queries. When employees asked about leave policies, the system kept used the entire company's wiki -- including that of the sales team. And sales "ranked" higher because it contained similar keywords.

The result? The entire company was getting sales team vacation policies instead of their own 🤦‍♂️

The solution? They implemented:

  1. Role-based filtering

  2. Document source validation

  3. Query intent classification

Making Your RAG System Production-Ready

Here's your action plan:

  1. Query Understanding: Implement basic query type classification
  2. Ingestion: Extract key metadata (dates, entities, filters)
  3. Retrieval: Begin with metadata filtering
  4. Retrieval: Add keyword-based search or BM25
  5. Retrieval: Top it off with semantic search
  6. Synthesis: Combine results intelligently using a good re-ranker or fusion e.g. RRF
  7. Validation: Cross-check extracted dates and numbers
  8. Validation: Implement a RAG metrics system e.g. Ragas
  9. Validation: Monitor user feedback e.g. using A/B tests and adapt

Reciprocal Rank Fusion

Reciprocal Rank Fusion (RRF) is a technique that combines the results of multiple retrieval systems. It's a powerful way to improve the quality of your search results by leveraging the strengths of different retrieval methods.

But it's NOT a silver bullet.

The Challenge

Stop thinking about RAG as just "retrieve and generate."

Start thinking about it as a complex system that needs to understand, retrieve, validate, and synthesize information intelligently.

Your homework: Take one query type that's failing in your system. Implement query classification and targeted retrieval for just that type. Measure the improvement. You'll be amazed at the difference this focused approach makes.


Remember: The goal isn't to build a perfect RAG system (that doesn't exist). The goal is to build a RAG system that improves continuously and fails gracefully.

Your Turn

What's your biggest RAG challenge? Let's solve it together. Let me know on Twitter or email.

Retrieval Augmented Generation Best Practices

Retrieval and Ranking Matter!

Chunking

  1. Including section title in your chunks improves that, so does keywords from the documents
  2. Different token-efficient separators in your chunks e.g. ### is a single token in GPT

Examples

  1. Few examples are better than no examples
  2. Examples at the start and end have the highest weight, the middle ones are kinda forgotten by the LLM

Re Rankers

Latency permitting — use a ReRanker — Cohere, Sentence Transformers and BGE have decent ones out of the box

Embedding

Use the right embedding for the right problem:

GTE, BGE are best for most support, sales, and FAQ kind of applications.

OpenAI is the easiest for Code Embedding to use.

e5 family does outside English and Chinese

If you can, finetune the embedding to your domain — takes about 20 minutes on a modern laptop or Colab notebook, improves recall by upto 30-50%

Evaluation

Evaluation Driven Development makes your entire "dev" iteration much faster.

Think of these as the "running the code to see if it works"

Strongly recommend using Ragas for something like this. They've Langchain and Llama Index integrations too which are great for real world scenarios.

Scaling

LLM Reliability

Have a failover LLM for when your primary LLM is down, slow or just not working well. Can you switch to a different LLM in 1 minute or less automatically?

Vector Store

When you're hitting latency and throughput limits on the Vector Store, consider using scalar quantization with a dedicated vector store like Qdrant or Weaviate

Qdrant also has Binary Quantization which allows you to scale 30-40x with OpenAI Embeddings.

Finetuning

LLM: OpenAI GPT3.5 will often be as good as GPT4 with finetuning.

Needs about 100 records and you get the 30% latency improvements for free

So quite often worth the effort!

This extends to OSS LLM models. Can't hurt to "pretrain" finetune your Mistral or Zephyr7B for $5