machine-learning¶

June 30, 2023
in RAG, production, machine-learning
4 min read

pgvector vs Qdrant- Results from the 1M OpenAI Benchmark

You may have considered using PostgreSQL's pgvector extension for vector similarity search. There are good reasons why this option is strictly inferior to dedicated vector search engines, such as Qdrant.

We ran both benchmarks using the ann-benchmarks solely dedicated to processing vector data. The difference in performance is quite staggering.

Query Speed

Final results show that pgvector lags behind Qdrant by a factor of 15 when it comes to throughput.

That is a 1500% deficit in speed. However, we shouldn't only consider speed as the main metric when evaluating a database. In terms of accuracy, pgvector delivers way fewer relevant results than Qdrant.

Workload

Interestingly, these disparities start to surface with as few as 100,000 chunked documents.

As an ardent supporter of PostgreSQL, it is disheartening to witness that pgvector doesn't just commence at under half the QPS at 100,000 vectors, when compared to Qdrant - it plunges precipitously beyond that.

Correctness

One might try to rationalize this by assuming that Postgres is slower, but more accurate? Data reveals that pgvector is not just slower, but also ~18% less accurate!

We measure this using the same methodology as the ann-benchmarks codebase: k-NN bruteforce as ground truth.

Latency

Here, Qdrant holds its own. The worst p95 latency for Qdrant is 2.85s, a stark contrast to pgvector, whose best p95 latency is a full 4.02s. Even more astonishing, pgvector's worst p95 latency skyrockets to an unbelievable 45.46s.

Benchmark Specs

The machine we used to run the benchmark: t3.2xlarge, 8 vCPU, 32GB RAM

For data enthusiasts among us, this Google Sheet details all the numbers for a more in-depth analysis: Google Sheet

Configuration

We use the default configuration for Qdrant and much better parameters for pgvector:

Qdrant(quantization=False, m=16, ef_construct=128, grpc=True, hnsw_ef=None, rescore=True)

PGVector(lists=200, probes=2)

The pgvector recommendation which'd be possibly worse performance-wise:

PGVector(lists=1000, probes=1)

There is much more to be tested. We will continue to explore the configuration space for both platforms and update this.

Conversations with the Community

Paul Copplestone (CEO, Supabase) has also shared his thoughts on the matter:

Yup: 1. Wait 6 months, a lot of development is happening on pgvector 2. Use hybrid search 3. Use filters on other indexed columns 4. Use partitions

And as always, take benchmarks with a grain of salt, they are never as clear-cut as they seem. We’ll publish benchmarks soon too using the latest version of pgvector

Adding my notes here:

pgvector uses full-scan when there are filters or hybrid search. This is a very slow algorithm when using 1536 embeddings. It's O(n) where n -> number of vectors matching the filter.

When there are no filters, pgvector uses IVF. This is a slower algorithm when using 1536 embeddings, and it’s less accurate than Qdrant's HNSW.

Aside: Feel free to check out my Twitter Intro to IVFPQ.

@jobergum, creator of Vespa.ai (a vector search engine) also shared his thoughts:

pgvector is an extension which default will just search the closest cluster to the query vector which for most high dimensional embedding models will return just 2-3 out of 10 real neighbors.

This is a very important point. pgvector is not a vector search engine. It's a vector extension for PostgreSQL, and that involves some tradeoffs which are sometimes not obvious.

There is a US$2000 bounty for anyone who can raise a PR to make the pgvector extension use HNSW instead of IVF.

Acknowledgements

The engineering and dataset were both done by Kumar Shivendu. Most of my contribution was in the form of spotting the bottlenecks, feedback and sponsorship.

These surprising revelations are courtesy of Erik Bernhardsson's ann-benchmarks code.

December 7, 2021
in careers, machine-learning
4 min read

Breaking into NLP

Bulk of this is borrowed from notes made my teammate and friend at Verloop.io's NLP/ML team of our conversations. I've taken the liberty to remove our internal slang and some boring stuff.

I want to build a community around me on NLP. How can I get discovered by others?

Broadly speaking, the aim in forming connections can be split into Long Term and Short term. A short term aim would be where you can receive something immediate out of the connections or a particular connection itself. This could be a collaboration, correspondence, recommendation/advice or anything else.

A more long-term, strategic aim would be a well defined long term goal that requires multiple steps to achieve. A strategic aim could involve multiple tactical steps. This is also, what we like to call friendship in some polite-speak areas of the world.

I have no immediate goals or projects, just need some basic ideas on how to be a part of the ML community.

Find interests of people and do something for them. Many people simply ask questions on Twitter, or you can infer what they are interested in looking at their Linkedin/work and their personal blogs.

What would be a good starting point for this?

A very easy thing to start with is literature review. Specially, for new topics being researched by influential people in the field. A good literature review shows your interest and willingness to help. Opens door to communication.

A good place to find what topics are missing a decent literature review: Go through NLP reddit r/LanguageTechnology or subreddits for Deep Learning, Machine Learning and so on.

Or go through twitter. And help people out there. Answer their question with depth. Do not rush to be the first, but the best. When it comes to technology, almost all platforms behave a bit like StackOverflow, the right answer might not get accepted: but it'll get noticed. Btw, lot of the Huggingface contributors happen to be active on Github and Twitter both. Hanging around on their Slack can't hurt either.

But important thing, try and stick to one medium. The place where you are most at home and gels with your personality. This could actually even be Youtube if you're an English-fluent, attractive looking person.

The other reason you need to stick to one medium is that audience will spend most of their time on 1 or 2 social media channels.

If they see that your content is not that popular on the other channel - They will do the cross posting for you. For instance, we've both seen Twitter content even within ML such as the Gary Marcus debate and attack on Yann LeCun spill over on reddit. And of course, people are still posting Tweets on TikTok!

Word of mouth will be your biggest friend.

Find problems that many people face. Usually a simple problem faced by many is a great problem statement. The python requests library comes to my mind as an excellent example of such a challenge. The work by gensim around shallow vectorization methods like word2vec and Glove was also quite similar in vein for quite a lot of time. Of course, with the rise of Deep Learning and better tooling makes their work less important - but they stuck in my mind, didn't they?

Why is that a great problem statement?

It's maximising the area under the curve. Solve a trivial problem faced by many or a huge problem faced by some. It has the same impact.

What's something that has worked for you in finding interesting problems?

Find intersections with domains that have little to do with each other. For us, there are domains that have little to do with tech/code and can see great benefits from our involvement.

Marketing yourself has nothing to do with marketing but everything to do with the problems you solve and the solutions you come up with. Make sure the solution is accessible to the wider audience. It should not be that only a certain section of the population can use it. If you plan to market yourself, spend 95% of the time on a quality problem and a quality solution and 5% of time talking about it. This is usually enough if the first 95% is done well.

What medium to talk about these in?

The usual are the blog posts or social media posts etc. But there is an open secret within the community. Writing papers is probably the best way to talk about stuff you’ve done.

Why so?

Papers have the halo effect. It improves your reputation and makes it sticky. People might forget a blog post quickly but you can get recognition/perks for around 2 years or so after writing a paper. There are other secondary gains too from doing this. Once you write a paper, you start reading papers differently. You have a better intuition of reading between the lines to understand the author’s intent/pov. Another obvious benefit is you get better at writing papers. Your thought process will start coming across much more clearly.

August 2, 2021
in machine-learning, tech
8 min read

Data Science Org Design for Startups

While there is plenty of good advice on making ML work and making a career as a Data Scientist - I think very little discussion happens on the organization design for Data Science itself.

This blog will hopefully help folks not just build their team, but also understand the ecosystem from which they are hiring.

Organization Design is determined by these 3 broad categories:

Software Engineer vs Research: To what extent is the Machine Learning team responsible for building or integrating with software? How important are Software Engineering skills on the team?

Data Ownership: How much control does the Machine Learning team have over data collection, warehousing, labeling, and pipelining?

Model Ownership: Is the Machine Learning team responsible for deploying models into production? Who maintains the deployed models?

--- Josh Tobin at Full Stack Deep Learning

It's harder for ML/DS teams to succeed than your typical product and platform engineering functions in startups. This is because:

Good folks are hard to retain as their skills are highly transferable across multiple role (i.e., possibly high team attrition)

Management is unclear on what does "success" look like for ML as a function

These were the two key pitfalls which I wanted to solve for when designing the ML team and how it sits in the larger org. The most well known ways in which companies organise Machine Learning Teams are these:

1. Research & Development Labs

Of these, a R&D Lab is ideal for most well capitalized businesses because it enables them to attract talent, which can in turn work on long term business/tech challenges.

Uber AI Labs, Google AI Research, DeepMind, and FAIR are some examples from the top of my head. Microsoft Research - with contributions across almost all of Computer Science, should probably be the gold standard for this. I have personally spent some time working in such an org design¹

The limitation with this org design is that R&D teams don't own the pipeline to their work i.e. inputs (data) and outputs (model performance in production). To clarify, R&D teams in some cases do own data inputs in some places which does make the process more end to end - but often, the deployment in production is still not under them. This makes this org design all but useless for a pre/near Product Market Fit startups.

2. Embedded within Business & Product Teams

By far the most popular org design in Indian startups, is a Data Scientist is embedded into an engineering team along with an analyst which is then assigned to a business or product team. From the top of my head, this is how some of the best Data Science teams like AirBnb, Flipkart, and Facebook organize their ML teams.

I strongly considered this org design, but ultimately opted against this because it would not play to my strengths at all.

I expected these challenges:

Hard to maintain uniform data and engineering standards across the org

In this org structure, the primary, and sometimes only stakeholder for each Data person is their Product manager. There is a lot of work which is repeated e.g. data pipelines, cleaning, pre-processing. The larger organisations enforce some degree of uniformity via their Data Platforms or equivalent. In the early stages, this effort is not worth the decoupling speed.

Management Complexity, in terms of the ever increasing breadth of problems across different features

In the embedded space, each team could itself be working on a wide variety of “small” ML problems e.g. demand forecasting, text embedding, and sentiment analysis could be all worked on by a single Data Scientist.

Since the Product Manager doesn’t have the technical skill to evaluate whether the solution approach was apt or not, it falls on the Data Science Manager to have a lot of breadth and assist several IC Data Scientists across multiple problems at the same time.

3. Data Science Consultant

Small businesses which themselves had a services arm or revenue love this.

Business or product teams bring specific problems to a data science lead, who then scopes out a plan, defines a success criteria and hands it off to a Data Scientist or Machine Learning Engineer within the team.

There are so many understated but commonly known limitations of this:

Less Engaged Team: Since the problem solving and implementation are separated - the engineer feels less creative and empowered to make changes and is less invested in getting the small details right. There is no single owner of the data or models, and thereby no single person responsible for the technical outcomes of the project.
Communication overhead in terms of energy and time both, which happens 2x: first, when the consultant understands the problem from the person on the team and second when the consultant transfers/shares the proposed solution. This is not just slow, it’s error-prone and expensive.

This makes completing feedback loops even harder, since no one person has all the necessary context which can be carried forward to the next project. This was dropped fairly early as a candidate for that specific reason.

4. [Near Future] Productized Data Science Team

When I studied more modern teams, especially in B2B SaaS or eCommerce from outside, I felt they made a small but important change in this model: Instead of a matrix where the Data Scientist was ultimately responsible to their own product/pod and nothing else, they had a central Data Science function to which all Data Scientists reported.

Some teams created a new, "Chief Data Scientist" or "VP, Machine Learning" designations to reflect this increased autonomy and status within the org. Notice that this is quite similar to how some Design teams are organized.

While I had not worked under this org design, I had interned at a place which was small (10-50 employees) and I could understand the limitations of this org design when I was told the same.

The most common warning was the amount of context which any lead/manager Data Science had to keep beyond a certain project count within the company. I expect that the Verloop.io ML Team will evolve into this over the next 12-24 months. I'm estimating this on the basis of the problem complexity and the headcount needed for engineering and data science teams both. If we can have ICs reporting to both the Product Manager and a Data Science org, the added management complexity would be worth it in the faster shipping speed via shared tooling and context.

5. [Verloop Today] Full Stack Data Science

This is the org design at Verloop.io ML today. The defining characteristic of this org design is that every ML person does things end to end - there is no division of labour.

There is a brilliant explanation on this from StitchFix: https://multithreaded.stitchfix.com/blog/2019/03/11/FullStackDS-Generalists/

The goal of data science is not to execute. Rather, the goal is to learn and develop new business capabilities. … There are no blueprints; these are new capabilities with inherent uncertainty. … All the elements you’ll need must be learned through experimentation, trial and error, and iteration. – Eric Colson

As Eric calls out, Data Science, unlike say Platform Engineering functions - is not a pure "execution" function. It's an iteration and discovery function of the organisation. In business terms, you might call this Market Research, but where technology is applied to develop new capabilities.

This full cycle development seems to be endorsed by Netflix Tech officially and Data Science folks at Lazada as well.

Case Study: ML Org at Verloop.io

I hope the above gives you a sense of common data science team organisations. If you’ve seen them in the past, now we both have a shared vocabulary to talk about it.

As a case study, let me share some of the operational things we had at Verloop.io. This was mostly as part of our 0 to 1 journey as a B2B SaaS product.

These are not recommendations, but just how things shaped up in the early days. I hope this gives you a case study to concretely think about what we just discussed.

ML function reported directly to the CEO for the longest time. The CEO directly brings the business context and drives quick wins. The ML Lead needs to negotiate continuously on organisational goals, constantly query for added context, and make long term bets.

Part of the ML Product Manager role also got absorbed into what I'd been doing as a Machine Learning Lead/Manager because we did not have a full time Product Manager in the company for more than 6 months.

Attracting young talent was easier by giving them quite high autonomy. The team also owns model performance and deployment.

The deployment ownership is made possible by a ML System Design decision as well. The strong adherence to multi-tenant models instead of client specific models.

Talent pool is smaller + retention is hard

We had a smaller talent pool for at least 2 reasons: A few data science candidates refused to join the team because they were not interested in engineering, and wanted to focus on modeling tasks exclusively.

In some other cases, the conversation broke down because we couldn’t match their pay expectations.

We managed to make our retention hard because of good intentions, but with bad outcomes:

The engineering org in our early stages did a lot of ad hoc development in smaller, demo-driven sprint cycles. We assumed that separating ML from the rest of the engineering org would allow us to ship faster. It would also allow us to focus longer on one project, without being distracted by ad hoc tasks. This did work to a certain extent.

In hindsight, this was a mistake. It definitely empowered us to ship faster, but teammates felt isolated, and it was hard to complete the feedback loop with our end users via the Product Manager alone. Additionally, if we needed engineering’s help to ship something, they’d pick “their” work over integrating our shipped work. This slowed down our shipping pace itself over a longer duration. This in turn, hurt the morale of the team, and made retention much harder.

I’d do this differently the next time around.

There are 3 things I’d do differently:

Remove the middleman (i.e me): PM and the Data Scientist should work directly with each other. Instead of the information flowing/gathered with me as the nodal person.
Better Retrospectives: We did a few reviews i.e. what went well or wrong, but not enough of “How does this inform our future?”
Add Front End, DevOps Skills: Lot of our releases would reach the end user because the interface was designed, but not implemented. Engineering teams would quite obviously pick their own OKRs above ours. The short term fix is to add Front End and DevOps skills.

Even something as simple as being able to build+deploy Gradio or Streamlit demos would go a long way in convincing the org to prioritise the shipped work.

Ending Note

The terms are borrowed from the amazing blog by Pardis Noorzad: Models for integrating data science teams within companies | by Pardis Noorzad | Medium

Thanks to Eugene Yan and Maneesh Mishra for taking the time to review this piece. A lot of the improvements are thanks to their comments.

Photo by Rahul Chakraborty on Unsplash

Notes

Advanced Technologies Lab at Samsung Research Institute, Bengaluru ↩↩↩↩