Skip to content

machine-learning

Deepseek R1 Ideas for GPU Poor and Middle Class

The internet is abuzz right now with DeepSeek. I want to here suggest some ideas and opportunities which engineers have - most of them are exciting to do for engineering curiosity.

Agents with Better Planning

R1 is exceptional at planning and significantly cheaper than O1/O3, and we expect the prices for reasoning models to go continuously cheaper. So, with that in mind, I want to suggest some ideas which benefit from better planning agents.

Multi-file Code Editing

Early evidence: R1 Sonnet set SOTA on Aider's Polyglot Benchmark

R1 as architect with Sonnet as editor has set a new SOTA of 64.0% on the aider polyglot benchmark. They achieve this at 14X less cost compared to the previous o1 SOTA result. o1 paired with Sonnet didn’t produce better results than just using o1 alone.

This is a very good result and as part of the broader trend, there's enough evidence to pursue this direction.

There is an opportunity to improve upon Cursor's interfaces for multi-file cold code editing as it stands today in Q1 2025. For instance, what would it look like if you could directly interact between a user feature request in their own words with the code base to generate a pull request and run an A/B test over for that specific feature?

Browser Agents

Early evidence: John Rush benchmarked multiple LLMs for browser usage.

tl;dr: One could build a better benchmark for evaluating your LLM's ability to do reasoning with function calling in addition to WebVoyager. This is a great start for anyone wanting to build a LLM for a specialized use case (not workflow or task oriented) at a lower cost.

Open Source Document Inlining

The core idea here is that you can convert any text based LLM into a Vision LLM. Most folks like try to do this right now by doing some sort of thin OCR. But I think a more promising approach would be to lightly fine tune and update the weights to work with the Vision component. And I guess that is what Fireworks has been doing. And this is very promising when combined with Reasoning LLMs.

Early evidence: Fireworks Document Inlining

Document Inlining

Today, we are excited to launch a public preview of our first use case, Document Inlining, a compound system that automatically turns any LLM into a vision model to ingest images or PDFs for document-based vision tasks. Document Inlining parses images and pipes them directly into an LLM of your choice to deliver:

Higher quality - Achieve better reasoning and generation capabilities by utilizing any LLM of choice or specialized/fine-tuned models Input flexibility - Automatically transform multiple file types like PDFs and screenshots. We can also handle rich document structures like tables/charts Ultra-simple usage - Our API is OpenAI compatible. Enable this capability by editing 1-line to specify “#transform=inline” alongside your file

Reasoning Distilation

Early evidence: Bespoke Labs and Sky T1

tl;dr: This is part of a broader evidence that distillation of all kinds really works!

How? They took the reasoning traces from R1, collected over a large set of math and code problems and distilled it into Qwen. They use it to beat the previous SOTA on math and code problems.

We already knew this for deep learning based models, but given the higher utility and cost savings — it's great to know that it holds for LLMs too!

Sky T1 was fine-tuned on a $450 run, so it's quite cheap.

Reasoning models with Structured Outputs (JSON/XML)

Opportunity: Open source reasoning models currently don't prioritize function calling and structured outputs. Even less so when used with images, scans and pdf-images.

We have abundant training data available: - GorillaBench datasets - ComplexFuncBench is another resource for evaluating your LLM's ability to do reasoning with function calling.

While closed source models have so far excelled at function calling, this advantage may be diminishing: - We now have access to direct traces from R1 - These traces can be used for training

How to do this?

Approach 1: You need the traces:

  1. Prompt R1 to generate XML traces of outputs
  2. Verify these traces against existing datasets
  3. Fine-tune reasoning models to produce structured JSON/XML

Approach 2: You don't need the traces:

  1. Use reasoning models to generate structured JSON/XML
  2. Fine-tune reasoning models to produce structured JSON/XML

Retrieval Augmented Generation Best Practices

Retrieval and Ranking Matter!

Chunking

  1. Including section title in your chunks improves that, so does keywords from the documents
  2. Different token-efficient separators in your chunks e.g. ### is a single token in GPT

Examples

  1. Few examples are better than no examples
  2. Examples at the start and end have the highest weight, the middle ones are kinda forgotten by the LLM

Re Rankers

Latency permitting — use a ReRanker — Cohere, Sentence Transformers and BGE have decent ones out of the box

Embedding

Use the right embedding for the right problem:

GTE, BGE are best for most support, sales, and FAQ kind of applications.

OpenAI is the easiest for Code Embedding to use.

e5 family does outside English and Chinese

If you can, finetune the embedding to your domain — takes about 20 minutes on a modern laptop or Colab notebook, improves recall by upto 30-50%

Evaluation

Evaluation Driven Development makes your entire "dev" iteration much faster.

Think of these as the "running the code to see if it works"

Strongly recommend using Ragas for something like this. They've Langchain and Llama Index integrations too which are great for real world scenarios.

Scaling

LLM Reliability

Have a failover LLM for when your primary LLM is down, slow or just not working well. Can you switch to a different LLM in 1 minute or less automatically?

Vector Store

When you're hitting latency and throughput limits on the Vector Store, consider using scalar quantization with a dedicated vector store like Qdrant or Weaviate

Qdrant also has Binary Quantization which allows you to scale 30-40x with OpenAI Embeddings.

Finetuning

LLM: OpenAI GPT3.5 will often be as good as GPT4 with finetuning.

Needs about 100 records and you get the 30% latency improvements for free

So quite often worth the effort!

This extends to OSS LLM models. Can't hurt to "pretrain" finetune your Mistral or Zephyr7B for $5

pgvector vs Qdrant- Results from the 1M OpenAI Benchmark

You may have considered using PostgreSQL's pgvector extension for vector similarity search. There are good reasons why this option is strictly inferior to dedicated vector search engines, such as Qdrant.

We ran both benchmarks using the ann-benchmarks solely dedicated to processing vector data. The difference in performance is quite staggering.

Query Speed

Final results show that pgvector lags behind Qdrant by a factor of 15 when it comes to throughput.

That is a 1500% deficit in speed. However, we shouldn't only consider speed as the main metric when evaluating a database. In terms of accuracy, pgvector delivers way fewer relevant results than Qdrant.

Workload

Interestingly, these disparities start to surface with as few as 100,000 chunked documents.

As an ardent supporter of PostgreSQL, it is disheartening to witness that pgvector doesn't just commence at under half the QPS at 100,000 vectors, when compared to Qdrant - it plunges precipitously beyond that.

Correctness

One might try to rationalize this by assuming that Postgres is slower, but more accurate? Data reveals that pgvector is not just slower, but also ~18% less accurate!

We measure this using the same methodology as the ann-benchmarks codebase: k-NN bruteforce as ground truth.

Latency

Here, Qdrant holds its own. The worst p95 latency for Qdrant is 2.85s, a stark contrast to pgvector, whose best p95 latency is a full 4.02s. Even more astonishing, pgvector's worst p95 latency skyrockets to an unbelievable 45.46s.

Benchmark Specs

The machine we used to run the benchmark: t3.2xlarge, 8 vCPU, 32GB RAM

For data enthusiasts among us, this Google Sheet details all the numbers for a more in-depth analysis: Google Sheet

Configuration

We use the default configuration for Qdrant and much better parameters for pgvector:

Qdrant(quantization=False, m=16, ef_construct=128, grpc=True, hnsw_ef=None, rescore=True)
PGVector(lists=200, probes=2)

The pgvector recommendation which'd be possibly worse performance-wise:

PGVector(lists=1000, probes=1)

There is much more to be tested. We will continue to explore the configuration space for both platforms and update this.

Conversations with the Community

Paul Copplestone (CEO, Supabase) has also shared his thoughts on the matter:

Yup: 1. Wait 6 months, a lot of development is happening on pgvector 2. Use hybrid search 3. Use filters on other indexed columns 4. Use partitions

And as always, take benchmarks with a grain of salt, they are never as clear-cut as they seem. We’ll publish benchmarks soon too using the latest version of pgvector

Adding my notes here:

pgvector uses full-scan when there are filters or hybrid search. This is a very slow algorithm when using 1536 embeddings. It's O(n) where n -> number of vectors matching the filter.

When there are no filters, pgvector uses IVF. This is a slower algorithm when using 1536 embeddings, and it’s less accurate than Qdrant's HNSW.

Aside: Feel free to check out my Twitter Intro to IVFPQ.

@jobergum, creator of Vespa.ai (a vector search engine) also shared his thoughts:

pgvector is an extension which default will just search the closest cluster to the query vector which for most high dimensional embedding models will return just 2-3 out of 10 real neighbors.

This is a very important point. pgvector is not a vector search engine. It's a vector extension for PostgreSQL, and that involves some tradeoffs which are sometimes not obvious.

There is a US$2000 bounty for anyone who can raise a PR to make the pgvector extension use HNSW instead of IVF.

Acknowledgements

The engineering and dataset were both done by Kumar Shivendu. Most of my contribution was in the form of spotting the bottlenecks, feedback and sponsorship.

These surprising revelations are courtesy of Erik Bernhardsson's ann-benchmarks code.