Skip to content

machine-learning

Data Science Org Design for Startups

While there is plenty of good advice on making ML work and making a career as a Data Scientist - I think very little discussion happens on the organization design for Data Science itself.

This blog will hopefully help folks not just build their team, but also understand the ecosystem from which they are hiring.

Organization Design is determined by these 3 broad categories:

  1. Software Engineer vs Research: To what extent is the Machine Learning team responsible for building or integrating with software? How important are Software Engineering skills on the team?
  2. Data Ownership: How much control does the Machine Learning team have over data collection, warehousing, labeling, and pipelining?
  3. Model Ownership: Is the Machine Learning team responsible for deploying models into production? Who maintains the deployed models?

--- Josh Tobin at Full Stack Deep Learning

It's harder for ML/DS teams to succeed than your typical product and platform engineering functions in startups. This is because:

  • Good folks are hard to retain as their skills are highly transferable across multiple role (i.e., possibly high team attrition)
  • Management is unclear on what does "success" look like for ML as a function

These were the two key pitfalls which I wanted to solve for when designing the ML team and how it sits in the larger org. The most well known ways in which companies organise Machine Learning Teams are these:

1. Research & Development Labs

Of these, a R&D Lab is ideal for most well capitalized businesses because it enables them to attract talent, which can in turn work on long term business/tech challenges.

Uber AI Labs, Google AI Research, DeepMind, and FAIR are some examples from the top of my head. Microsoft Research - with contributions across almost all of Computer Science, should probably be the gold standard for this. I have personally spent some time working in such an org design1

The limitation with this org design is that R&D teams don't own the pipeline to their work i.e. inputs (data) and outputs (model performance in production). To clarify, R&D teams in some cases do own data inputs in some places which does make the process more end to end - but often, the deployment in production is still not under them. This makes this org design all but useless for a pre/near Product Market Fit startups.

2. Embedded within Business & Product Teams

By far the most popular org design in Indian startups, is a Data Scientist is embedded into an engineering team along with an analyst which is then assigned to a business or product team. From the top of my head, this is how some of the best Data Science teams like AirBnb, Flipkart, and Facebook organize their ML teams.

I strongly considered this org design, but ultimately opted against this because it would not play to my strengths at all.

I expected these challenges:

  1. Hard to maintain uniform data and engineering standards across the org

In this org structure, the primary, and sometimes only stakeholder for each Data person is their Product manager. There is a lot of work which is repeated e.g. data pipelines, cleaning, pre-processing. The larger organisations enforce some degree of uniformity via their Data Platforms or equivalent. In the early stages, this effort is not worth the decoupling speed.

  1. Management Complexity, in terms of the ever increasing breadth of problems across different features

In the embedded space, each team could itself be working on a wide variety of “small” ML problems e.g. demand forecasting, text embedding, and sentiment analysis could be all worked on by a single Data Scientist.

Since the Product Manager doesn’t have the technical skill to evaluate whether the solution approach was apt or not, it falls on the Data Science Manager to have a lot of breadth and assist several IC Data Scientists across multiple problems at the same time.

3. Data Science Consultant

Small businesses which themselves had a services arm or revenue love this.

Business or product teams bring specific problems to a data science lead, who then scopes out a plan, defines a success criteria and hands it off to a Data Scientist or Machine Learning Engineer within the team.

There are so many understated but commonly known limitations of this:

  1. Less Engaged Team: Since the problem solving and implementation are separated - the engineer feels less creative and empowered to make changes and is less invested in getting the small details right. There is no single owner of the data or models, and thereby no single person responsible for the technical outcomes of the project.

  2. Communication overhead in terms of energy and time both, which happens 2x: first, when the consultant understands the problem from the person on the team and second when the consultant transfers/shares the proposed solution. This is not just slow, it’s error-prone and expensive.

This makes completing feedback loops even harder, since no one person has all the necessary context which can be carried forward to the next project. This was dropped fairly early as a candidate for that specific reason.

4. [Near Future] Productized Data Science Team

When I studied more modern teams, especially in B2B SaaS or eCommerce from outside, I felt they made a small but important change in this model: Instead of a matrix where the Data Scientist was ultimately responsible to their own product/pod and nothing else, they had a central Data Science function to which all Data Scientists reported.

Some teams created a new, "Chief Data Scientist" or "VP, Machine Learning" designations to reflect this increased autonomy and status within the org. Notice that this is quite similar to how some Design teams are organized.

While I had not worked under this org design, I had interned at a place which was small (10-50 employees) and I could understand the limitations of this org design when I was told the same.

The most common warning was the amount of context which any lead/manager Data Science had to keep beyond a certain project count within the company. I expect that the Verloop.io ML Team will evolve into this over the next 12-24 months. I'm estimating this on the basis of the problem complexity and the headcount needed for engineering and data science teams both. If we can have ICs reporting to both the Product Manager and a Data Science org, the added management complexity would be worth it in the faster shipping speed via shared tooling and context.

5. [Verloop Today] Full Stack Data Science

This is the org design at Verloop.io ML today. The defining characteristic of this org design is that every ML person does things end to end - there is no division of labour.

There is a brilliant explanation on this from StitchFix: https://multithreaded.stitchfix.com/blog/2019/03/11/FullStackDS-Generalists/

The goal of data science is not to execute. Rather, the goal is to learn and develop new business capabilities. … There are no blueprints; these are new capabilities with inherent uncertainty. … All the elements you’ll need must be learned through experimentation, trial and error, and iteration. – Eric Colson

As Eric calls out, Data Science, unlike say Platform Engineering functions - is not a pure "execution" function. It's an iteration and discovery function of the organisation. In business terms, you might call this Market Research, but where technology is applied to develop new capabilities.

This full cycle development seems to be endorsed by Netflix Tech officially and Data Science folks at Lazada as well.

Case Study: ML Org at Verloop.io

I hope the above gives you a sense of common data science team organisations. If you’ve seen them in the past, now we both have a shared vocabulary to talk about it.

As a case study, let me share some of the operational things we had at Verloop.io. This was mostly as part of our 0 to 1 journey as a B2B SaaS product.

These are not recommendations, but just how things shaped up in the early days. I hope this gives you a case study to concretely think about what we just discussed.

  1. ML function reported directly to the CEO for the longest time. The CEO directly brings the business context and drives quick wins. The ML Lead needs to negotiate continuously on organisational goals, constantly query for added context, and make long term bets.

Part of the ML Product Manager role also got absorbed into what I'd been doing as a Machine Learning Lead/Manager because we did not have a full time Product Manager in the company for more than 6 months.

  1. Attracting young talent was easier by giving them quite high autonomy. The team also owns model performance and deployment.

The deployment ownership is made possible by a ML System Design decision as well. The strong adherence to multi-tenant models instead of client specific models.

  1. Talent pool is smaller + retention is hard

We had a smaller talent pool for at least 2 reasons: A few data science candidates refused to join the team because they were not interested in engineering, and wanted to focus on modeling tasks exclusively.

In some other cases, the conversation broke down because we couldn’t match their pay expectations.

We managed to make our retention hard because of good intentions, but with bad outcomes:

The engineering org in our early stages did a lot of ad hoc development in smaller, demo-driven sprint cycles. We assumed that separating ML from the rest of the engineering org would allow us to ship faster. It would also allow us to focus longer on one project, without being distracted by ad hoc tasks. This did work to a certain extent.

In hindsight, this was a mistake. It definitely empowered us to ship faster, but teammates felt isolated, and it was hard to complete the feedback loop with our end users via the Product Manager alone. Additionally, if we needed engineering’s help to ship something, they’d pick “their” work over integrating our shipped work. This slowed down our shipping pace itself over a longer duration. This in turn, hurt the morale of the team, and made retention much harder.

I’d do this differently the next time around.

There are 3 things I’d do differently:

  1. Remove the middleman (i.e me): PM and the Data Scientist should work directly with each other. Instead of the information flowing/gathered with me as the nodal person.
  2. Better Retrospectives: We did a few reviews i.e. what went well or wrong, but not enough of “How does this inform our future?”
  3. Add Front End, DevOps Skills: Lot of our releases would reach the end user because the interface was designed, but not implemented. Engineering teams would quite obviously pick their own OKRs above ours. The short term fix is to add Front End and DevOps skills.

Even something as simple as being able to build+deploy Gradio or Streamlit demos would go a long way in convincing the org to prioritise the shipped work.

Ending Note

The terms are borrowed from the amazing blog by Pardis Noorzad: Models for integrating data science teams within companies | by Pardis Noorzad | Medium

Thanks to Eugene Yan and Maneesh Mishra for taking the time to review this piece. A lot of the improvements are thanks to their comments.

Photo by Rahul Chakraborty on Unsplash

Notes


  1. Advanced Technologies Lab at Samsung Research Institute, Bengaluru 

MLOps for Startups

Start your development by writing the overall impact and feature overview in the Press Release doc and README

If your time to ship is more than 2 weeks, write a functional spec

In case of bug fixes, add bug details or link to Asana/Github Issues

Always. Do trunk-based development. Don’t restrict a deployment trigger to specific people. As soon as you are done, go ahead, deploy and let others deploy.

SERVICE DETAILS

DOCS

  1. Please provide API documentation for your service (e.g. via API definition)
  2. Add auto-generated engineering docs in HTML/Markdown
  3. Who is the maintaining team/person at this moment for the service? All maintainers and committers should be listed Repo README should include instructions to set up this repo for development

Service Component — DATABASE

  1. List down all the database changes, if you added any columns, removed any columns, added or removed any tables.
  2. What kind of indexes do you have? If you added a new column does it require an index? If yes, why? If no, why not?
  3. Do you make changes to records? Do you do frequent deletes or updates?

Service Component — MONITORING

  1. List all the services you own and list down each server monitoring parameters
  2. Alerts for service uptime, service performance degradation e.g. latency, throughput
  3. Alerts for service machine disk/CPU/memory — what’s the threshold and how are they triggered

Please include today’s screenshots for each of them e.g. StackDriver. We need to make sure that you have proper monitoring in place.

Service Component — DEPLOYMENT

TODO Add a step by step documentation on how to write your first service and deploy it in the dev/stage/production environment for your org

CODE STYLING

PYTHON

  1. Add a pre-commit with black, sort and flake8 to your code. Follow the happy path convention. Add type hints

  2. Start writing code by writing APIs, then tests and then implement code. We use tests as API docs

  3. Add proper liveness and/or readiness check for your Kubernetes deployment:

Liveliness check is for Kubernetes to know whether your application is running.

Readiness check is for you to specify when traffic should be sent to the pod. For example, if your pod needs some operations to be done before it can take traffic (such as downloading a dataset), your readiness check should send back 200 only once it's completely ready for taking traffic.

Docs:

  1. Kubernetes Manual
  2. Blog on Best Practices

DATA BEST PRACTICES

VERSION CONTROL

Use dvc.org and version all your datasets to Cloud Storage Buckets and include your DVC file in your Github repo

EXPERIMENTS

  1. Aim for reproducible experiments by re-using actively maintained APIs

  2. There is no specific tooling to do reproducible experiments, but consider using something simple like Sacred or Hydra

Read deep learning paper

Who is this for?

Practitioners who are looking to level up their game in Deep Learning

Why Do We Need Instructions on How to Read a Deep Learning Paper?

Quantity: There are more papers than we can humanly read even within our own niche. For instance, consider EMNLP - which is arguably the most popular Natural Language Processing conference selects more than 2K papers across a variety of topics. And NLP is just one area!

Some people read academic papers like they read novels: Open link. Read the text. Scroll Down. Finish. Close tab. Some people read like a math book with problems, obsessing over every detail. Their Zotero or hypothes.is accounts are filled with annotations which they are probably never going to revisit in their lifetimes. Others skim but without a coherent structure. All of these are valid ways to read a paper.

Here, I am trying to distill and form a better structure for myself to improve the return on my very limited energy.

Four Types of Reading

  • In his cult classic book, "How to Read a Book", Mortimer J Adler explains his Four Types of Reading, mainly keeping a non-fiction book in mind. I am adapting these to the context of Deep Learning paper for us.

Elementary

This is the point where you're when you finish a 101 course in Machine Learning. You know the key terms and vocabulary e.g. convergence, loss functions and optimizers. You can understand what the words in the paper mean and read them, maybe follow the narrative, but not much more. Since you're reading this blog, I assume you are already reading at a level above this.

Inspectional

This is basically skimming. You look at the headings, read the beginning and end of some sections, some of the statements in bold. The intent is to get a fast & superficial sense of what the author is trying to say.

Intelligently Skimming

  • The first type of inspectional reading is systematic skimming, which you can easily put into practice today. This is most useful when you're reading within a topic you've some familiarity with. For instance, within most topics around intent classification in dialogue systems -- this is how I would read. Here’s how you start:

  • Read the title and abstract. This might seem obvious, but authors do put in the effort to compress their key ideas, findings or contributions in these places. This effect is even more amplified since these are the most important fields on arXiv. If you spend a minute of full attention, you should get a feel for the intent and scope of the work. This will not only prime you for what you might be reading next, but also mentally map this work to ideas which you might be already familiar with.

  • Skim the Section Headings, which will give you a feel for organization of the paper. Is the paper emphasizing datasets? A new architecture? Or an empirical work which is basically throwing compute and sharing that "See, It works!". I am always a little annoyed when I am discussing a paper when it turns out that the reader has not even got the intent of why the paper sections are organized in a specific way. Obviously, many conferences have specific templates which make it even easier to discover the structure which the authors actually wanted you to pay attention to.

  • Get a sense of the context This means skimming the Related Work section. The intent isn't to read every paper or idea mentioned in this section, but only the topics mentioned here. This will help you get a sense of the jargon used, the variety of topics and what the authors consider adjacent problems/areas.

  • Read the Conclusion. Authors generally do a good job summarizing their work in the last few pages. This where they sum up what they think is most important about their work. Just jump to this first.

💡Pro Tip💡: Check out their interview, podcast, oral presentation, or Twitter thread or poster. While this has nothing to do with the actual paper, these can be a great way to get the gist of a paper in 30 minutes or so. Authors do so much promotion now that it's relatively easy to find interviews. Many selected papers have oral presentations. And of course, they use the best examples from the book in these interviews.

Superficial Reading

This is most useful when you’re reading outside your usual comfort zone. Here is the key idea: Read without stopping

If you read a lot of papers, you will find that there are some things that you don’t understand. If you stop and try to figure out what it means, it will take a long time to finish the paper. But if you keep on reading, the next thing that happens will help explain what the first thing meant. And so on.

You might get very little of what is being said in the first pass and that's fine. You now know the lay of the land, and when you make a second pass -- you can connect the dots much better and faster.

Analytical Reading

This is where you really dive into a text. You read slowly and closely, you take notes, you look up words or references you don’t understand, and you try to get into the author’s head in order to be able to really get what’s being said.

Don't Google Too Early. If there is a math formula, concept, or word you don’t know, first look at the context to try to discern its meaning. See if the author explains what happens when or why they used it. Warm up and use your brain to get started. If it’s something you simply can’t get past, or the word is clearly too important for you to glance over, then check the citations. If even that isn't enough, then finally Google it. The main point is that you can use the tools around you, but don’t lean on them. Let your brain work a little bit before letting Google work for you.

Get a sense of the author's background. Look at what institutions do they mention. Are they from academia? An applied AI lab like Apple or GoogleAI? Or an academic lab, sponsored by industry like DeepMind/FAIR? Two examples of how it can inform your reading:

    1. There are some companies/labs where a person has to write a certain number of papers every year in order to get promoted (or even retain their jobs) -- they typically have narrow ideas which solve a specific problem incredibly well, but are mostly not adaptable to another domain or context.
    2. Teams and labs have distinct flavors and sometimes work on specific themes. This can help you quickly get a sense of whether the paper is part of a longer series and see the papers before and after the one you're reading.

Answer the 4 Key Questions

  • This, Adler says, is actually the key to analytical reading. To be able to answer these questions shows that you have at least some understanding of the paper and what you've read. If you can’t answer them, you probably haven’t quite paid attention well enough. I also find it personally helpful that you should actually write (or type) these answers out. Consider it to be like a book journal. It’ll stay with you and become much more ingrained than if you just answer them in your head.

What is the paper about, as a whole? This is essentially the abstract or conclusion. You could cheat, but that's not going to be very helpful. Instead use your own words and write a the highlights of what you can recall about the paper. See if you can connect it to the wider knowledge base which you've read in the past.

What is being said in detail, and how? This is where you start to dig a little deeper. Briefly go back and skim through the paper, jogging your memory of the key points, formulae, section headings, graphs and tables with results. With most papers, outlining is pretty straightforward since the section headings do bulk of the job for you. For short papers, this could be as short as 5-10 lines. Pay special attention to what datasets, experiment configurations and ablation results if they're mentioned.

Is the paper true - in whole or in part? If you're reading within your own comfort zone, you'll begin to see by now the scenarios/tasks/areas where the paper falls short. For instance, if you're reading up on Long Range Transformers -- based on your knowledge of pre-trained Transformer models like BERT, RoBERTa or T5, you should expect them do better at summarization and Q&A tasks than these. If the paper falls short, you can quickly jot that down as a question to ask, ponder upon or experiment yourself. This is true for both peer-reviewed and pre-print papers - they can often have glaring errors and mistakes which you might notice.

So what? What’s the significance? Most papers are incremental in their contribution to the world. This is not necessarily a bad thing. As long as the paper made you see the field or area in a new light, or even a new nugget of knowledge - it was helpful. We should aim for reading papers which at-least give us something valuable in either perspective, knowledge (e.g. empirical facts) or methods. This is by far the most useful question to ask since it helps contextualize the contributions of the author against your own personal context, understanding and knowledge.

The core idea here is that reading is not a passive process. You have to actively engage with the text and think about what you read. It's natural to just scan over the text without actually retaining anything. To counteract this, you need to make a conscious effort to stop and think about what the author is saying. A good way of doing this is to write down a list of questions about what the author says. This will force you to stop and think about the content. When answering these questions, you need to write them in your own words. This means that you can't just parrot the authors words back at them. Instead, you need to rephrase the question in your own words. This will help you engage with the ideas in a more meaningful way.

💡Pro Tip💡: Generate questions about the content of the reading. For example, convert headings and sub-headings into questions, and then look for answers in the content of the text. Other more general questions may also be formulated: - What is this paper about? - What question is this section trying to answer? - How does this information help me?

Optional, critique and share your thoughts with others. This step is dead last. Only after having read the entire paper, or sets of papers, and thoughtfully answering the 4 Key Questions can you critique or have meaningful discussions about the paper. - For starters, a reasonable critique asks more questions ("Did they assume X?", "What would happen if I replaced method M1 with M3?") than it makes verdicts ("The paper is amazing", "This is stupid"). - The second point is to fight the temptation to write a paper summary instead of a critique. That's the fad these days but isn't quite meaningful enough. You writing a measly, annoying Medium blog might feel like an achievement to yourself (thanks to cheap applause) but doesn't improve your understanding as much as writing a critique will. - Also, you don't have to take everything that the authors claim as God's Gospel. Having a disagreement is completely fine and valid. But give them the benefit of doubt and ask questions instead of making assumptions. It goes without saying, that you don't have to disagree or agree with every part of the paper. You can freely completely love one part and ignore the rest. There is no need (or advantage) to have an opinion about everything.

💡Pro Tip💡: You can use the Question Generation idea even during "Intelligently Skimming", especially for topics where you're comfortable. This will save you a lot of time and energy during Analytical Reading.

Syntopical — This is mostly used by researchers and professors. It’s where you read multiple papers on a single subject and form a thesis or original thought by comparing and contrasting various other authors’ thoughts. This is time and research intensive, and it’s not likely that you’ll do this type of reading very much, unless your day job is paying you to read and write papers. I do not have the relevant expertise to help you with this.

To quickly recap:

  • Use Inspectional Reading when you're first reading a paper
  • Use Analytics Reading and Answer the 4 questions when you're looking to get a deeper, better gist of the paper

Four Questions You Should Be Able to Answer

  • What is this book about?
  • What is being said in detail, and how?
  • Is this paper true in whole or in part?
  • So what?

Resources

Thanks for Gokula Krishnan and Pratik Bhavasar for reviewing early versions of this.