2022¶

July 21, 2022
in data, tech
4 min read

Airbnb's Metric Store: Minerva

Data lineage is a problem because most companies have several tables and queries before humans consume it!

This has well known challenges: changes do not propagate downstream from the source, and reliable (fresh, updated or complete) data is not always available.

What does Minerva do?

I was expecting Minerva to a database (collection of tables), but it turns out that Minerva is what I'll call: Data Transformation Manager.

It overlaps quite a bit with dbt but it's not a pure execution layer. It also stores metadata, orchestrates the DAG itself, and provides a way to query the data (Hive/Druid here)

Minerva solves for one major problem in the analytics space: Time to insights — as [[Julie Zhuo]] has mentioned several times at Sundial

Minerva 1.0

This is a bit preview about the past and what problems did they first solve, what was left undone and some tooling/technology choices.

Pre-computation engine

Quite similar to how we were building Sundial till very recently.

De-normed measures and dimensions
Segments and behaviours were both separated out

De-norming measures can be quite expensive but useful. We converged to this design across multiple clients while working with Event-level data. We also see some of our clients maintaining a similar table: "User Segments Table".

Tradeoffs in the Precomputing Approach

Cubing — SQL Minerva already knows what SQL query to run, across what segments and what durations upfront. This means it can leverage CUBE operations.

Some people believe that OLAP CUBE has fallen out of use, but that's clearly not true here. As companies get larger, "old" pressures on compute and storage should re-appear and so should already known solutions like cubing.

Fast Query Time: Since the results are precomputed - fast at query time
Exponential in query cost. Backfill — damn expensive and wastes time and money

Everything has to be asked ahead of time, so you end up calculating too many things

Minerva 2.0

This is what a truly "modern" data transformation manager should look like in my opinion.

Here are some of the design choices:

On the fly joins
On the fly aggregations
Optional denorm and cubing
Enable precompute to be turned on

The way I see it, this is striking a balance between flexibility (the ability to do on-the-fly joins and aggregations) and cost (the ability to precompute with denorm and cubing).

Engineering Choices

Moved from Druid to StarRocks

Why StarRocks?

Minerva is SQL generation tool, not a Druid ingestion tool

Minerva has a SQL Interface now, early was JSON

SQLGlot — Python SQL Parser and Transpiler -- this is very similar to dbt for how it generates SQL using a parser and transpiler. SQLGlot is open source btw: https://github.com/tobymao/sqlglot

Near Real Time Metrics

Summary of changes made for 2.0 release

Major changes is that SQL is now a first class citizen

This is quite important. We should resist the temptation of inventing a Python transformation layer/logic. While some Python is inevitable for doing more interesting things like forecasting, using Python for calculating ratios is a bit overkill. We should instead try and consider pushing the limits of SQL for the same.

SQL is not only more widely spoken, it'd be a lot more efficient and more scalable. The downside? It's less general purpose language, and we'd have to write some tooling to make SQL work like Python.

These are some notes and screen grabs from a talk which I'd found on Youtube. Thanks to Kunal Kundu for finding the talk link which I'd lost!

February 12, 2022
in careers
2 min read

Beyond First 90 Days

This one's gonna be brief and echoes 2 Less Obvious Ideas to the younger me.

I am assuming that you already know the hygiene factors: Make few promises. Keep most of them and exceed few of them atleast. Get to like the top 5% in the skill of effort estimation for your own work at the very least. And so on.

Contribute to Developer Ecosystem

Improving any part of the developer ecosystem is useful and visible at the same time. For instance, let's say you add tests for a code path on which 10 developers are working. You've made the lives of 10 developers easier. They'll remember this when you come to them for help.

For some projects/teams, even the build time is quite large and error prone. Any improvements there also save a lot of contributor or developer time.

As Joel Spolsky (the person behind Stack Overflow) wrote: There is more than 1 way to help:

Maintaining an issue tracker
Write a decent functional specification

You get the gist. Get creative and figure out points of leverage: low effort, high return on your time.

Engineering Brand Efforts

You already know what are the 1-2 things your team's best is e.g. speed, scale, cadence, or software quality.

Take those 2 topics and write down 5 reasons or points of evidence on why you think those are the 2 topics on which your team is best. For instance, if I was writing "speed" - one of my points would look like, we make 20 releases a week to almost 500K users. Or, we have fewer than 20 bugs for a release thanks to our amazing testing and QA friends.

Now - expand these 5 points into a short, bullet point essay like this one. Ask your manager and other senior engineers for advice. You say something like, "Hey, I wrote down what our team does best - do you think I captured the essence and reasoning?"

Done?

Great, now go write this as an internal and external blog. Submit this at a technical conference which cares about the dimension on which your team is the best. Bringing accolades to the team, with their blessings is much higher returns than reading 10 Medium blogs.

January 27, 2022
in careers
2 min read

Dos and Don'ts for ML Hiring

This is primarily for my future self. These are observations based on my own experience of 2 years at Verloop.io and helping a few companies hire for similar roles.

Do

Seniormost hire first: Start by hiring the senior most person you're going to hire. E.g. start by hiring the ML Lead (assuming you already have a CTO)
Have a means to tell that your investment in data science is working out or not
Closest to User first: Hire the person who will consume the data to build for the user first
Sourcing: Begin sourcing early and over-emphasise two channels: Referrals and Portfolios
- Typically, in India - expect:
  - ~2 months to close a full time role at early career (0-3 years) and
  - ~3 months to close a mid career (3-7 years) and
  - 6+ months to close a senior hire
If a developer has open source code contributions in the last 2-3 years, consider waving off the coding or algorithmic challenge to speed up the interview process
Pay above market cash salaries
- In 12-18 months from now, when your ML Engineer will have internalised all the requirements, company culture and built a bunch of important tooling - she would get an offer which is 2-3x of today. If you're already paying above market salaries, a 20-30% jump is quite often enough to retain many folks
Have 3 at least versions of your shipping timeline
Do hire full stack Data Science people/teams. If you're hiring for early members for your team, this is practical necessity. An example of T shaped skills could look like

Don't

Rely on HR or your usual backend engineering hiring channels to work well for you, in general
Don't hire the person who builds the means to move data (e.g. ML before data engineering) before hiring atleast 1-2 stakeholders in ML
- Why? This is because it's cheaper (and often faster) to change ML modeling approach than to make changes in data engineering pipelines
Don't start by hiring an intern to implement papers or take things to production before you've done them
Don't expect data science to deliver or ship at the same "user value" pace as Product Engineering
- Why? Data Science suffers from the twin problems of being new and experiment-driven
Don't assume that you've so much data, and since all of it is queryable, it's all usable
Hire ultra-specialists e.g. post-docs and PhDs too early, barring products which requires invention and not application