Skip to content

machine-learning

Retrieval Augmented Generation Best Practices

Retrieval and Ranking Matter!

Chunking

  1. Including section title in your chunks improves that, so does keywords from the documents
  2. Different token-efficient separators in your chunks e.g. ### is a single token in GPT

Examples

  1. Few examples are better than no examples
  2. Examples at the start and end have the highest weight, the middle ones are kinda forgotten by the LLM

Re Rankers

Latency permitting — use a ReRanker — Cohere, Sentence Transformers and BGE have decent ones out of the box

Embedding

Use the right embedding for the right problem:

GTE, BGE are best for most support, sales, and FAQ kind of applications.

OpenAI is the easiest for Code Embedding to use.

e5 family does outside English and Chinese

If you can, finetune the embedding to your domain — takes about 20 minutes on a modern laptop or Colab notebook, improves recall by upto 30-50%

Evaluation

Evaluation Driven Development makes your entire "dev" iteration much faster.

Think of these as the "running the code to see if it works"

Strongly recommend using Ragas for something like this. They've Langchain and Llama Index integrations too which are great for real world scenarios.

Scaling

LLM Reliability

Have a failover LLM for when your primary LLM is down, slow or just not working well. Can you switch to a different LLM in 1 minute or less automatically?

Vector Store

When you're hitting latency and throughput limits on the Vector Store, consider using scalar quantization with a dedicated vector store like Qdrant or Weaviate

Qdrant also has Binary Quantization which allows you to scale 30-40x with OpenAI Embeddings.

Finetuning

LLM: OpenAI GPT3.5 will often be as good as GPT4 with finetuning.

Needs about 100 records and you get the 30% latency improvements for free

So quite often worth the effort!

This extends to OSS LLM models. Can't hurt to "pretrain" finetune your Mistral or Zephyr7B for $5

pgvector vs Qdrant- Results from the 1M OpenAI Benchmark

You may have considered using PostgreSQL's pgvector extension for vector similarity search. There are good reasons why this option is strictly inferior to dedicated vector search engines, such as Qdrant.

We ran both benchmarks using the ann-benchmarks solely dedicated to processing vector data. The difference in performance is quite staggering.

Query Speed

Final results show that pgvector lags behind Qdrant by a factor of 15 when it comes to throughput.

That is a 1500% deficit in speed. However, we shouldn't only consider speed as the main metric when evaluating a database. In terms of accuracy, pgvector delivers way fewer relevant results than Qdrant.

Workload

Interestingly, these disparities start to surface with as few as 100,000 chunked documents.

As an ardent supporter of PostgreSQL, it is disheartening to witness that pgvector doesn't just commence at under half the QPS at 100,000 vectors, when compared to Qdrant - it plunges precipitously beyond that.

Correctness

One might try to rationalize this by assuming that Postgres is slower, but more accurate? Data reveals that pgvector is not just slower, but also ~18% less accurate!

We measure this using the same methodology as the ann-benchmarks codebase: k-NN bruteforce as ground truth.

Latency

Here, Qdrant holds its own. The worst p95 latency for Qdrant is 2.85s, a stark contrast to pgvector, whose best p95 latency is a full 4.02s. Even more astonishing, pgvector's worst p95 latency skyrockets to an unbelievable 45.46s.

Benchmark Specs

The machine we used to run the benchmark: t3.2xlarge, 8 vCPU, 32GB RAM

For data enthusiasts among us, this Google Sheet details all the numbers for a more in-depth analysis: Google Sheet

Configuration

We use the default configuration for Qdrant and much better parameters for pgvector:

Qdrant(quantization=False, m=16, ef_construct=128, grpc=True, hnsw_ef=None, rescore=True)
PGVector(lists=200, probes=2)

The pgvector recommendation which'd be possibly worse performance-wise:

PGVector(lists=1000, probes=1)

There is much more to be tested. We will continue to explore the configuration space for both platforms and update this.

Conversations with the Community

Paul Copplestone (CEO, Supabase) has also shared his thoughts on the matter:

Yup: 1. Wait 6 months, a lot of development is happening on pgvector 2. Use hybrid search 3. Use filters on other indexed columns 4. Use partitions

And as always, take benchmarks with a grain of salt, they are never as clear-cut as they seem. We’ll publish benchmarks soon too using the latest version of pgvector

Adding my notes here:

pgvector uses full-scan when there are filters or hybrid search. This is a very slow algorithm when using 1536 embeddings. It's O(n) where n -> number of vectors matching the filter.

When there are no filters, pgvector uses IVF. This is a slower algorithm when using 1536 embeddings, and it’s less accurate than Qdrant's HNSW.

Aside: Feel free to check out my Twitter Intro to IVFPQ.

@jobergum, creator of Vespa.ai (a vector search engine) also shared his thoughts:

pgvector is an extension which default will just search the closest cluster to the query vector which for most high dimensional embedding models will return just 2-3 out of 10 real neighbors.

This is a very important point. pgvector is not a vector search engine. It's a vector extension for PostgreSQL, and that involves some tradeoffs which are sometimes not obvious.

There is a US$2000 bounty for anyone who can raise a PR to make the pgvector extension use HNSW instead of IVF.

Acknowledgements

The engineering and dataset were both done by Kumar Shivendu. Most of my contribution was in the form of spotting the bottlenecks, feedback and sponsorship.

These surprising revelations are courtesy of Erik Bernhardsson's ann-benchmarks code.

Breaking into NLP

Bulk of this is borrowed from notes made my teammate and friend at Verloop.io's NLP/ML team of our conversations. I've taken the liberty to remove our internal slang and some boring stuff.

I want to build a community around me on NLP. How can I get discovered by others?

Broadly speaking, the aim in forming connections can be split into Long Term and Short term. A short term aim would be where you can receive something immediate out of the connections or a particular connection itself. This could be a collaboration, correspondence, recommendation/advice or anything else.

A more long-term, strategic aim would be a well defined long term goal that requires multiple steps to achieve. A strategic aim could involve multiple tactical steps. This is also, what we like to call friendship in some polite-speak areas of the world.

I have no immediate goals or projects, just need some basic ideas on how to be a part of the ML community.

Find interests of people and do something for them. Many people simply ask questions on Twitter, or you can infer what they are interested in looking at their Linkedin/work and their personal blogs.

What would be a good starting point for this?

A very easy thing to start with is literature review. Specially, for new topics being researched by influential people in the field. A good literature review shows your interest and willingness to help. Opens door to communication.

A good place to find what topics are missing a decent literature review: Go through NLP reddit r/LanguageTechnology or subreddits for Deep Learning, Machine Learning and so on.

Or go through twitter. And help people out there. Answer their question with depth. Do not rush to be the first, but the best. When it comes to technology, almost all platforms behave a bit like StackOverflow, the right answer might not get accepted: but it'll get noticed. Btw, lot of the Huggingface contributors happen to be active on Github and Twitter both. Hanging around on their Slack can't hurt either.

But important thing, try and stick to one medium. The place where you are most at home and gels with your personality. This could actually even be Youtube if you're an English-fluent, attractive looking person.

The other reason you need to stick to one medium is that audience will spend most of their time on 1 or 2 social media channels.

If they see that your content is not that popular on the other channel - They will do the cross posting for you. For instance, we've both seen Twitter content even within ML such as the Gary Marcus debate and attack on Yann LeCun spill over on reddit. And of course, people are still posting Tweets on TikTok!

Word of mouth will be your biggest friend.

Find problems that many people face. Usually a simple problem faced by many is a great problem statement. The python requests library comes to my mind as an excellent example of such a challenge. The work by gensim around shallow vectorization methods like word2vec and Glove was also quite similar in vein for quite a lot of time. Of course, with the rise of Deep Learning and better tooling makes their work less important - but they stuck in my mind, didn't they?

Why is that a great problem statement?

It's maximising the area under the curve. Solve a trivial problem faced by many or a huge problem faced by some. It has the same impact.

What's something that has worked for you in finding interesting problems?

Find intersections with domains that have little to do with each other. For us, there are domains that have little to do with tech/code and can see great benefits from our involvement.

Marketing yourself has nothing to do with marketing but everything to do with the problems you solve and the solutions you come up with. Make sure the solution is accessible to the wider audience. It should not be that only a certain section of the population can use it. If you plan to market yourself, spend 95% of the time on a quality problem and a quality solution and 5% of time talking about it. This is usually enough if the first 95% is done well.

What medium to talk about these in?

The usual are the blog posts or social media posts etc. But there is an open secret within the community. Writing papers is probably the best way to talk about stuff you’ve done.

Why so?

Papers have the halo effect. It improves your reputation and makes it sticky. People might forget a blog post quickly but you can get recognition/perks for around 2 years or so after writing a paper. There are other secondary gains too from doing this. Once you write a paper, you start reading papers differently. You have a better intuition of reading between the lines to understand the author’s intent/pov. Another obvious benefit is you get better at writing papers. Your thought process will start coming across much more clearly.