Skip to content

RAG

Beyond Basic RAG: What You Need to Know

The Real World of RAG Systems

πŸ“’ Picture this: You're a developer who just deployed your first RAG system. Everything seems perfect in testing. Then reality hits - users start complaining about irrelevant results, not being able to do "basic stuff" and occasional hallucinations. Welcome to the world of real-world RAG systems.

The Problem With "Naive RAG"

Let's start with a truth bomb: dumping documents into a vector database and hoping for the best is like trying to build a search engine with just a dictionary - technically possible, but practically useless.

Here's why:

  1. The Embedding Trap: Think embedding similarity is enough? Here's a fun fact - in many embedding models, "yes" and "no" have a similarity of 0.8-0.9. Imagine asking for "yes" and getting a "no" instead in a legal search πŸ˜…

  2. The Context Confusion: Large Language Models (LLMs) get surprisingly confused when you give them unrelated information. They're like that friend who can't ignore a app notification while telling a story - everything gets mixed up.

  3. Length Effect: Just like humans tend to get worse at noticing details the longer a story is, LLMs with large context windows get worse at noticing details the longer the information is.

The Three Pillars of Production RAG

1. Query Understanding 🎯

The first step to better RAG isn't about better embeddings - it's about understanding what your users are actually asking for. Here's the basics:

  • Query Classification: Before rushing to retrieve documents, classify the query type. Is it a simple lookup? A comparison? An aggregation? Each needs different handling.
    • NIT: Navigational, Informational, Transactional are the 3 very broad types.
  • Metadata Extraction: Time ranges, entities, filters - extract these before retrieval. Think of it as giving your students sample questions to help them pay attention to what's important in the exam (at query time) much better and faster.

Metadata Queries

The CEO of a company asks for "last year's revenue"

The CFO asks for "revenue from last year"

The CMO asks for "revenue from the last fiscal year"

Do all these queries mean different things? Not really. The asker role i.e. query metadata changes the query intent.

2. Intelligent Retrieval Strategies πŸ”

Here's where most systems fall short. Instead of one-size-fits-all retrieval:

  • Hybrid Search: Combine dense (embedding) and sparse (keyword) retrieval. You can rerank using late interaction, use LLM as a reranker or even use both in a cascade. I can probably write a whole blog post on this, but tl;dr is that you can use a combination of many retrieval strategies to get the best of precision, recall, cost and latency.
  • Query Expansion: Don't just search for what users ask - search for what they mean. Example: "Q4 results" should also look for "fourth quarter performance."
  • Context-Aware Filtering: Use metadata to filter before semantic search. If someone asks for "last week's reports," don't rely on embeddings to figure out the time range.

3. Result Synthesis and Validation βœ…

The final piece is making sure your responses are accurate and useful:

  • Cross-Validation: For critical information (dates, numbers, facts), validate across multiple sources at ingestion time. It's possible that your ingestion pipeline is flawed and you don't know it.
  • Readability Checks: Use tools like the Flesch-Kincaid score to ensure responses match your user's expertise level.
  • Hallucination Detection: Implement systematic checks for information that isn't grounded in your retrieved documents. Considering evaluating the pipeline using offline tools like Ragas

Real-World Example: The Leave Policy Fiasco

Here's a real story that illustrates why naive RAG fails:

The Leave Policy Fiasco

Company X implemented a RAG system for HR queries. When employees asked about leave policies, the system kept used the entire company's wiki -- including that of the sales team. And sales "ranked" higher because it contained similar keywords.

The result? The entire company was getting sales team vacation policies instead of their own πŸ€¦β€β™‚οΈ

The solution? They implemented:

  1. Role-based filtering

  2. Document source validation

  3. Query intent classification

Making Your RAG System Production-Ready

Here's your action plan:

  1. Query Understanding: Implement basic query type classification
  2. Ingestion: Extract key metadata (dates, entities, filters)
  3. Retrieval: Begin with metadata filtering
  4. Retrieval: Add keyword-based search or BM25
  5. Retrieval: Top it off with semantic search
  6. Synthesis: Combine results intelligently using a good re-ranker or fusion e.g. RRF
  7. Validation: Cross-check extracted dates and numbers
  8. Validation: Implement a RAG metrics system e.g. Ragas
  9. Validation: Monitor user feedback e.g. using A/B tests and adapt

Reciprocal Rank Fusion

Reciprocal Rank Fusion (RRF) is a technique that combines the results of multiple retrieval systems. It's a powerful way to improve the quality of your search results by leveraging the strengths of different retrieval methods.

But it's NOT a silver bullet.

The Challenge

Stop thinking about RAG as just "retrieve and generate."

Start thinking about it as a complex system that needs to understand, retrieve, validate, and synthesize information intelligently.

Your homework: Take one query type that's failing in your system. Implement query classification and targeted retrieval for just that type. Measure the improvement. You'll be amazed at the difference this focused approach makes.


Remember: The goal isn't to build a perfect RAG system (that doesn't exist). The goal is to build a RAG system that improves continuously and fails gracefully.

Your Turn

What's your biggest RAG challenge? Let's solve it together. Let me know on Twitter or email.

Retrieval Augmented Generation Best Practices

Retrieval and Ranking Matter!

Chunking

  1. Including section title in your chunks improves that, so does keywords from the documents
  2. Different token-efficient separators in your chunks e.g. ### is a single token in GPT

Examples

  1. Few examples are better than no examples
  2. Examples at the start and end have the highest weight, the middle ones are kinda forgotten by the LLM

Re Rankers

Latency permitting β€” use a ReRanker β€” Cohere, Sentence Transformers and BGE have decent ones out of the box

Embedding

Use the right embedding for the right problem:

GTE, BGE are best for most support, sales, and FAQ kind of applications.

OpenAI is the easiest for Code Embedding to use.

e5 family does outside English and Chinese

If you can, finetune the embedding to your domain β€” takes about 20 minutes on a modern laptop or Colab notebook, improves recall by upto 30-50%

Evaluation

Evaluation Driven Development makes your entire "dev" iteration much faster.

Think of these as the "running the code to see if it works"

Strongly recommend using Ragas for something like this. They've Langchain and Llama Index integrations too which are great for real world scenarios.

Scaling

LLM Reliability

Have a failover LLM for when your primary LLM is down, slow or just not working well. Can you switch to a different LLM in 1 minute or less automatically?

Vector Store

When you're hitting latency and throughput limits on the Vector Store, consider using scalar quantization with a dedicated vector store like Qdrant or Weaviate

Qdrant also has Binary Quantization which allows you to scale 30-40x with OpenAI Embeddings.

Finetuning

LLM: OpenAI GPT3.5 will often be as good as GPT4 with finetuning.

Needs about 100 records and you get the 30% latency improvements for free

So quite often worth the effort!

This extends to OSS LLM models. Can't hurt to "pretrain" finetune your Mistral or Zephyr7B for $5

AI4Humans aka Software x LLMs

AI4Bharat, IIT Madras, July 2023

Namaste! πŸ™ I'm Nirant and here's a brief of what we discussed in our session.

Why You Should Care?

I have a track record in the field of NLP and machine learning, including a paper at ACL 2020 on Hinglish, the first Hindi-LM, and an NLP book with over 5000 copies sold. I've contributed to IndicGlue by AI4Bharat, built and deployed systems used by Nykaa, and consulted for healthcare enterprises and YC companies. I also manage India’s largest GenAI community with regular meetups since February 2023.

Here's my Github.

AI4Humans: Retrieval Augmented Generation for India

We dived into two main areas:

  1. Retrieval Augmented Generation: Examples of RAG for India, engineering choices, open problems, and how to improve it
  2. LLM Functions: Exploring tool augmentation and "perfect" natural language parsing

Retrieval Augmented Generation (RAG)

RAG is a popular pattern in AI. It's used in various applications like FAQ on WhatsApp, customer support automation, and more. It's the backbone of services like Kissan.ai, farmer.chat and Bot9.ai.

However, there are several open problems in RAG, such as text splitting, improving ranking/selection of top K documents, and embedding selection.

Adding Details to RAG

We can improve RAG by integrating models like OpenAI's GPT4, Ada-002, and others. We can also enhance the system by adding a Cross-Encoder and 2 Pass Search.

RAG Outline

Despite these improvements, challenges remain in areas like evaluation, monitoring, and handling latency/speed. For instance, we discussed how to evaluate answers automatically, monitor model degradation, and improve system latency.

Using LLM to Evaluate

An interesting application of LLM is to use it for system evaluation. For example, we can use LLM to auto-generate a QA test set and auto-grade the results of the specified QA chain. Check out this auto-evaluator as an example.

Addressing Open Problems

We discussed the best ways to improve system speed, including paged attention, caching, and simply throwing more compute at it. We also touched on security concerns, such as the need for separation of data and the use of Role Based Access Control (RBAC).

LLM β€œFunctions”

We explored how LLMs can be used for tool augmentation and converting language to programmatic objects or code. The Gorilla LLM family is a prime example of this, offloading tasks to more specialized, reliable models.

In the context of AgentAI, we discussed how it can help in converting text to programmatic objects, making it easier to handle complex tasks. You can check out the working code here.

Thank you for attending the session! Feel free to connect with me: Twitter, LinkedIn or learn more about me here.

References

Images in this blog are taken from the slides presented during the talk.