Skip to content

machine-learning

Read deep learning paper

Who is this for?

Practitioners who are looking to level up their game in Deep Learning

Why Do We Need Instructions on How to Read a Deep Learning Paper?

Quantity: There are more papers than we can humanly read even within our own niche. For instance, consider EMNLP - which is arguably the most popular Natural Language Processing conference selects more than 2K papers across a variety of topics. And NLP is just one area!

Some people read academic papers like they read novels: Open link. Read the text. Scroll Down. Finish. Close tab. Some people read like a math book with problems, obsessing over every detail. Their Zotero or hypothes.is accounts are filled with annotations which they are probably never going to revisit in their lifetimes. Others skim but without a coherent structure. All of these are valid ways to read a paper.

Here, I am trying to distill and form a better structure for myself to improve the return on my very limited energy.

Four Types of Reading

  • In his cult classic book, "How to Read a Book", Mortimer J Adler explains his Four Types of Reading, mainly keeping a non-fiction book in mind. I am adapting these to the context of Deep Learning paper for us.

Elementary

This is the point where you're when you finish a 101 course in Machine Learning. You know the key terms and vocabulary e.g. convergence, loss functions and optimizers. You can understand what the words in the paper mean and read them, maybe follow the narrative, but not much more. Since you're reading this blog, I assume you are already reading at a level above this.

Inspectional

This is basically skimming. You look at the headings, read the beginning and end of some sections, some of the statements in bold. The intent is to get a fast & superficial sense of what the author is trying to say.

Intelligently Skimming

  • The first type of inspectional reading is systematic skimming, which you can easily put into practice today. This is most useful when you're reading within a topic you've some familiarity with. For instance, within most topics around intent classification in dialogue systems -- this is how I would read. Here’s how you start:

  • Read the title and abstract. This might seem obvious, but authors do put in the effort to compress their key ideas, findings or contributions in these places. This effect is even more amplified since these are the most important fields on arXiv. If you spend a minute of full attention, you should get a feel for the intent and scope of the work. This will not only prime you for what you might be reading next, but also mentally map this work to ideas which you might be already familiar with.

  • Skim the Section Headings, which will give you a feel for organization of the paper. Is the paper emphasizing datasets? A new architecture? Or an empirical work which is basically throwing compute and sharing that "See, It works!". I am always a little annoyed when I am discussing a paper when it turns out that the reader has not even got the intent of why the paper sections are organized in a specific way. Obviously, many conferences have specific templates which make it even easier to discover the structure which the authors actually wanted you to pay attention to.

  • Get a sense of the context This means skimming the Related Work section. The intent isn't to read every paper or idea mentioned in this section, but only the topics mentioned here. This will help you get a sense of the jargon used, the variety of topics and what the authors consider adjacent problems/areas.

  • Read the Conclusion. Authors generally do a good job summarizing their work in the last few pages. This where they sum up what they think is most important about their work. Just jump to this first.

💡Pro Tip💡: Check out their interview, podcast, oral presentation, or Twitter thread or poster. While this has nothing to do with the actual paper, these can be a great way to get the gist of a paper in 30 minutes or so. Authors do so much promotion now that it's relatively easy to find interviews. Many selected papers have oral presentations. And of course, they use the best examples from the book in these interviews.

Superficial Reading

This is most useful when you’re reading outside your usual comfort zone. Here is the key idea: Read without stopping

If you read a lot of papers, you will find that there are some things that you don’t understand. If you stop and try to figure out what it means, it will take a long time to finish the paper. But if you keep on reading, the next thing that happens will help explain what the first thing meant. And so on.

You might get very little of what is being said in the first pass and that's fine. You now know the lay of the land, and when you make a second pass -- you can connect the dots much better and faster.

Analytical Reading

This is where you really dive into a text. You read slowly and closely, you take notes, you look up words or references you don’t understand, and you try to get into the author’s head in order to be able to really get what’s being said.

Don't Google Too Early. If there is a math formula, concept, or word you don’t know, first look at the context to try to discern its meaning. See if the author explains what happens when or why they used it. Warm up and use your brain to get started. If it’s something you simply can’t get past, or the word is clearly too important for you to glance over, then check the citations. If even that isn't enough, then finally Google it. The main point is that you can use the tools around you, but don’t lean on them. Let your brain work a little bit before letting Google work for you.

Get a sense of the author's background. Look at what institutions do they mention. Are they from academia? An applied AI lab like Apple or GoogleAI? Or an academic lab, sponsored by industry like DeepMind/FAIR? Two examples of how it can inform your reading:

    1. There are some companies/labs where a person has to write a certain number of papers every year in order to get promoted (or even retain their jobs) -- they typically have narrow ideas which solve a specific problem incredibly well, but are mostly not adaptable to another domain or context.
    2. Teams and labs have distinct flavors and sometimes work on specific themes. This can help you quickly get a sense of whether the paper is part of a longer series and see the papers before and after the one you're reading.

Answer the 4 Key Questions

  • This, Adler says, is actually the key to analytical reading. To be able to answer these questions shows that you have at least some understanding of the paper and what you've read. If you can’t answer them, you probably haven’t quite paid attention well enough. I also find it personally helpful that you should actually write (or type) these answers out. Consider it to be like a book journal. It’ll stay with you and become much more ingrained than if you just answer them in your head.

What is the paper about, as a whole? This is essentially the abstract or conclusion. You could cheat, but that's not going to be very helpful. Instead use your own words and write a the highlights of what you can recall about the paper. See if you can connect it to the wider knowledge base which you've read in the past.

What is being said in detail, and how? This is where you start to dig a little deeper. Briefly go back and skim through the paper, jogging your memory of the key points, formulae, section headings, graphs and tables with results. With most papers, outlining is pretty straightforward since the section headings do bulk of the job for you. For short papers, this could be as short as 5-10 lines. Pay special attention to what datasets, experiment configurations and ablation results if they're mentioned.

Is the paper true - in whole or in part? If you're reading within your own comfort zone, you'll begin to see by now the scenarios/tasks/areas where the paper falls short. For instance, if you're reading up on Long Range Transformers -- based on your knowledge of pre-trained Transformer models like BERT, RoBERTa or T5, you should expect them do better at summarization and Q&A tasks than these. If the paper falls short, you can quickly jot that down as a question to ask, ponder upon or experiment yourself. This is true for both peer-reviewed and pre-print papers - they can often have glaring errors and mistakes which you might notice.

So what? What’s the significance? Most papers are incremental in their contribution to the world. This is not necessarily a bad thing. As long as the paper made you see the field or area in a new light, or even a new nugget of knowledge - it was helpful. We should aim for reading papers which at-least give us something valuable in either perspective, knowledge (e.g. empirical facts) or methods. This is by far the most useful question to ask since it helps contextualize the contributions of the author against your own personal context, understanding and knowledge.

The core idea here is that reading is not a passive process. You have to actively engage with the text and think about what you read. It's natural to just scan over the text without actually retaining anything. To counteract this, you need to make a conscious effort to stop and think about what the author is saying. A good way of doing this is to write down a list of questions about what the author says. This will force you to stop and think about the content. When answering these questions, you need to write them in your own words. This means that you can't just parrot the authors words back at them. Instead, you need to rephrase the question in your own words. This will help you engage with the ideas in a more meaningful way.

💡Pro Tip💡: Generate questions about the content of the reading. For example, convert headings and sub-headings into questions, and then look for answers in the content of the text. Other more general questions may also be formulated: - What is this paper about? - What question is this section trying to answer? - How does this information help me?

Optional, critique and share your thoughts with others. This step is dead last. Only after having read the entire paper, or sets of papers, and thoughtfully answering the 4 Key Questions can you critique or have meaningful discussions about the paper. - For starters, a reasonable critique asks more questions ("Did they assume X?", "What would happen if I replaced method M1 with M3?") than it makes verdicts ("The paper is amazing", "This is stupid"). - The second point is to fight the temptation to write a paper summary instead of a critique. That's the fad these days but isn't quite meaningful enough. You writing a measly, annoying Medium blog might feel like an achievement to yourself (thanks to cheap applause) but doesn't improve your understanding as much as writing a critique will. - Also, you don't have to take everything that the authors claim as God's Gospel. Having a disagreement is completely fine and valid. But give them the benefit of doubt and ask questions instead of making assumptions. It goes without saying, that you don't have to disagree or agree with every part of the paper. You can freely completely love one part and ignore the rest. There is no need (or advantage) to have an opinion about everything.

💡Pro Tip💡: You can use the Question Generation idea even during "Intelligently Skimming", especially for topics where you're comfortable. This will save you a lot of time and energy during Analytical Reading.

Syntopical — This is mostly used by researchers and professors. It’s where you read multiple papers on a single subject and form a thesis or original thought by comparing and contrasting various other authors’ thoughts. This is time and research intensive, and it’s not likely that you’ll do this type of reading very much, unless your day job is paying you to read and write papers. I do not have the relevant expertise to help you with this.

To quickly recap:

  • Use Inspectional Reading when you're first reading a paper
  • Use Analytics Reading and Answer the 4 questions when you're looking to get a deeper, better gist of the paper

Four Questions You Should Be Able to Answer

  • What is this book about?
  • What is being said in detail, and how?
  • Is this paper true in whole or in part?
  • So what?

Resources

Thanks for Gokula Krishnan and Pratik Bhavasar for reviewing early versions of this.

Building a Data Science Team at a Startup

Hello!

If we are meeting for the first time, a short version of my story so far: After doing research engineering for almost 4 years across startups and a BigCo, I joined as an early machine learning engineer at Verloop.io - a B2B startup that makes customer support automation SaaS in 2019. I was there till April 2021.

We were directly responsible for most Natural Language Processing needs within the business.

While there is plenty of good advice on making ML work and making a career as a Data Scientist, I wanted to write my experience as a manager/early engineer who built out a ML team at a startup.

I hope you find it useful!

Who is this for?

People with these questions:

  • "What would it be like to be a Data Science Manager?"
  • "How can I build a Machine Learning team from scratch?"
    • This is typically founders e.g. CEO/CTO looking to build out a Machine Learning function

Bonus: Beliefs

Hiring

Hiring for Machine Learning Engineering is hard.

Despite what the media might tell you about the glut of data science talent. There are too many people who can model churn well or do the Titanic dataset right with Logistic Regression, not enough people who can come up with a simple TF-IDF model in a language they've not ever worked with e.g. not English.

In Summer of 2019, I did over 120 introductory calls between February and August. Averaging, 6 calls every week from April to June. This converted to 1 summer intern, and 3 full time people.

One Role to Rule Them All

For >1 year, we have hired for exactly ONE role: Machine Learning Engineer.

As explained in our MLE Prep Guide, we collapsed several roles into one. This means we don't hire a specialized Data Scientist, Researcher and then separate Product and Data Engineers to productionize the models.

At Verloop, the same engineer takes the entire system live from research to production, and occassionally several months into maintenance and growth as well.

This allows us to keep the hiring process straightforward, highly repeatable, while still maintaining enough faith in the process.

T Shaped Skill Maps

Our hiring process evaluates all candidates on primarily 3 skills:

  1. Low Level API Design for Web Services
  2. Programming Mindset and Quality
  3. Natural Language Processing/Machine Learning Skills (see Prep Guide if you're curious)

I biased the top of the funnel of the interview process to hire a team of compliments around my weaknesses.Each compliment should bring in a skill which is my weakness, but that individual's strength.

Of each of the first 3 full time hires, each person was atleast 2x better than me on that skill. This is important as each person has a multiplier effect on the productivity/quality of other developers on the team.

To give you a sense of these:

  1. Person A has clear, proven strengths in High Level System Design and Databases
  2. Person B has clear, proven strengths in DevOps and Cloud Deployments
  3. Person C has clear, proven strengths in "hacking" with Deep Learning, optimizing for speed

This input biasing seems to work amusingly well because it enables developers to always keep learning from each other.

For instance, one of the things which I have asked from each team is a load test of their service - this led one developer to simply build an internal wrapper around the popular locustio which works with our service design out of the box.

It frankly makes my job a lot easier if developers do amazing work without me pushing them to do it - specially by observing each other. I can honestly sit back and simply steer their curiosity, instead of investing my attention into solving the "motivation" problem.

Process

Our hiring process consists of two primary rounds: a programming challenge and ML challenge. You can actually take a deeper look here: Verloop ML/NLP Interview Prep Guide.

The programming challenge is a take home exercise which is focussed on low level design and straightforward API design. We typically give 2-3 days to the candidate for this.

For the ML Challenge, we share a dataset for a take home challenge and then discuss the same with the candidate. This round has no right answer. It is deliberately open ended.

It gives us a lot of signals on a wide variety of things we care about, for instance:

  • how the candidate formulates the problem,
  • measures the model performance,
  • thinks about model selection and important of loss functions,
  • prior/acquired experience with real world datasets,
  • literature review/comfort with close to research work,
  • ability to write clean, readable code,
  • whether they include failed experiments, indicates their confidence with sharing honest results

To me, what has suprised most is the number of otherwise skilled people who use "this is what I saw on Medium" as a valid explanation for selecting a particular approach. This lack of agency (autonomy?) is a red flag.

Onboarding

I am almost devilish when it comes to designing the best possible onboarding experience that I can. In the case of 1 intern, where I let it go off my radar - I think we didn't do our best as a team. The onboarding at Verloop ML consists of two specific pieces:

1. Before Joining Verloop

We share 6-8 week learning calendar focussed mostly on Deep Learning and Natural Language Processing.

Each week is expected to take 8-10 hours of your time (but interns have told me that it takes closer to 30 hours) - and then I get on with them on a 1-1 call and discuss what they did well and what they missed. I share context from research or our own systems when relevant. One core byproduct of this onboarding calendar is that the candidate should get very comfortable writing tons of experimental code of varying quality.

This onboarding calendar is custom designed to each candidate, depending on their strengths which I should ideally understand during the interview process.

2. After Joining Verloop

The onboarding process for each engineer is customizer to their weaknesses. For instance, an engineer coming from a stronger systems background will be first given a project which is mostly data cleaning and benchmarking ML models. So that they can get a deeper, more intimate understanding of how experiment design and evaluation works.

Similarly, freshers out of college, who typically come from weaker engineering backgrounds (but stronger DL skills) - will spend the first few months paying off tech debt, learning to read legacy code or building new CRUD services.

This is obviously in addition to the one heuristic which I've tried to follow: Ensure that every engineer gets one release into production within 4 weeks of joining Verloop.

Stakeholders

Managing Up

A common refrain from most senior ML people I spoke to is that: Leadership does not understand AI/Data Science

To me, this has always been an opportunity than a handicap. The most inspirational to me in this sense is the work of the likes of DJ Patil. He is the guy who coined the term, "Data Scientist" and was the Chief Data Scientist of the United States of America under Obama administration.

I somehow think that an analytical & numerical leadership can be worked with, independent of their own training within the domain. For instance, I don't think Barack Obama can tell a random forest from a convolutional neural network.

Communication and convincing non-experts outside your domain is always hard, painstaking and tedious. I should clarify to say that I don't think this is an easy challenge and, hopefully, our peers in Design will hopefully agree. I think that it's worth the effort.

Here is what I would want to do in the future to make this better:

  1. Highlight opportunities e.g. this can be our moat/IP or unlock new value with the caveat: "if it works"
  2. State assumptions e.g. cost, development time -- this gives you a feedback loop on your assumptions as well
  3. Call out checkpoints in "state of work" and not timeline. E.g. better to say, once we have done 5 experiments instead of 5 weeks, since you might end up realising that cleaning the data itself is going to take 3x as long

It'd be extremely stupid to assume that any of this would have been possible without the high degree of support and autonomy from the CEO, Gaurav himself.

As much as I'd like to say/think that I earned that unfailing trust, Lord knows that I have made some messes which he had to clean up.

Managing People

Of all things, I have received more support here than I deserved here - and I'm truly grateful for that. I think I have made quite a few people/psychology mistakes here. For instance, assuming that people want to be pushed and given maximum autonomy possible, instead of being led and they build mastery on their craft instead.

I also found myself being extremely angry at quite a lot. Although I'd read Andy Grove's notion of Task Relevant Maturity, I don't think I did anywhere a decent job of implementing it.

Managing my own mental state has been more work than I'd expected. My blast radius is much larger -- and deeper than I'd expected. I am sure there are plenty of people with far more nuance, patience and empathy who'd have done a better job at this.

Early on, I'd decided that I'd not repeated any of the mistakes that my previous managers had made. I ended up making a different version of the same f**king mistakes anyway.

After >1 year, the only thing which works for me is to listen to what people want and then do that.

Reading Reccos

There is so much good written about People Management and Engineering Management in general, that I'd be stupid to add to that clutter. Instead, I should point out books that have shaped how I think about Engineering/ML managment as well:

  • The Manager's Path by Camille Fournier
  • Effective Engineer by Edmond Lau

On Managing Myself:

  • Managing Oneself by Peter Drucker
  • Standout 2.0 by Marcus Buckingham

Books which are highly recommended, but didn't help me enough:

  • The Elegant Puzzle by William Larson
  • Randical Candor by Kim Scott

Books which will probably have high impact, but in the future:

  • HBR's 10 Must Reads: On Communication
  • Effective Executive, by Peter Drucker
  • Nonviolent Communication, by Rosenberg

Data Science Management

What we did:

  1. Encouraged every team to manage the project on their won by few key metrics,
    • Seprated out goal and minimum metrics
  2. Measuring the metrics at some cadence, even if this was erratic in the beginning
  3. Each team drives their entire process from research to production to deployment
    • This encourages teams to think about engineering challenges pretty early and gives them reasonably high autonomy

What we should have also done:

  1. Encourage every team member to spend time data-ing i.e. exploring datasets, building a mental model around it, tagging it on their own
  2. Introduce Engineering Practices Early: Stronger emphasis on software engineering practices once they joined the team e.g. doing TDD, code hygiene,
  3. Have every

Peer Management

While I'd expected this to be easy, this turned out to be quite emotionally hard. Since I transitioned out of backend engineering roles pretty early in my career - there are entire topics and concepts which I am not great at. This is made worse by the fact that I'm familiar with them, but not comfortable.

So if 2 developers are discussing something, I can very well follow their conversation - but I don't have anything to contribute.

This is quite frustrating. The sheer, persistent feeling of incompetency. Luckily, I don't have too much of a self respect/ego to care about this. I've always gravitated towards the best what I can do -- and what others cannot do well. At Verloop, that is Machine Learning/Data Science Management right now.

That said, within Verloop - ML has been the first adopter of almost all new dev tooling: - alerting and monitoring systems - inhouse logging library to improve our microservice observability - porting from previous Proto serving solution to twirpy

This is atleast partially because I didn't want the devs in my team to suffer.

Beliefs

Don't be Clever

Machine Learning is a game where 87% models never go to production - almost every model we have picked, has gone to production.

I've worked in B2B SaaS companies, doing Machine Learning research, engineering and deployment for almost 4 years. Hence, I've some opinions on what doesn't work. I don't have strong opinions on what works though.

I encourage my teammates to come up with new ideas and propose everything from the annotation process to the alerting and monitoring configuration. The only place where I really intervene is if the rigour is absent, or they're solving for a problem without complete context.

So far, this mindset of simply trying to avoid mistakes, instead of trying to be clever has worked well. I suspect as long as ML deployment is a failure-prone business, this mindset will serve me well. If something like GPT3/GPT4 lowers the risk considerably well, I will have to adopt a different mindset altogether.

This has not always worked. We've made some stupid (in hindsight) bets which didn't work for a multitude of reasons, including our own overconfidence in our technical skill and market reasons.

For those who are into the risk vs uncertainty nuance, I should add that I think because we deal with narrow Machine Learning problems - we mostly deal with technology risk and not so much uncertainty. It can be quantified, estimated and analyzed - it's just that I don't have the training to do so formally.

Premortems

Pre-mortems is a habit/tool which is still somewhat better known in Product Management than Data Science. In fact, it's better understood in Military Strategy than Data Science:

He who knows his enemy ... - Sun Tzu

I typically list down the top 3 causes which will kill a project - and then actively monitor them till the project is so stable that I can pay attention to something else. Despite my best attempts, I have failed to inculcate this mindset in devs working with me.

It seems to be that discipline and optimism are fairly orthogonal mindsets in software engineering culture at-large. To them, I remind that Microsoft Teams, TikTok, IRCTC and Instagram have shipped better than some of our B2B "Scaling" Engineering teams. In a world lost to chaos, I think the disciplined optimism ethos is basically a competitive advantage in my line of work.

Do Less

I am a low key fan of Auren Hoffman's advice that great things come from focus and not from building optionality. In the design and selection of tasks which our team has picked, this has been my guiding idea.

Almost every Data Science team can be considered to be built to serve two "purposes": Analytics and Product

Analytics teams contribute in two ways: 1. Inform decision makers 2. Measure and monitor internal metrics

At Verloop, the Analytics function is purely owned by Product Management. We assist but don't own any outcomes.

We build a reasonable number of ETL and Data Exploration tools for our own use (e.g. a Metabase installation) - which we make available to every Verloop employee, but we don't own the outcomes.

This also has a direct bearing on our team size, scope and skill set: We don't need to hire anyone to handle your common churn, forecasting or similar insights problems. Everyone is a competent, contributing NLP Engineer.

In Hindsight

In the last 18 months, our team has grown from 1 engineer (me) to 6 engineers. When I joined, ML was a blocker for the wider org with both latency and performance challenges, which quickly compounded because of legacy code and engineering exits.

We were behind the curve where Machine Learning was seen as a cost center.

Today, almost 18 months later, we're almost definitely ahead of the curve in terms of shipping. In the best case scenario, we can also become a profit center in as early as 6-12 months.

Similarly, there is still a lot of room when it comes to our ability to explore quickly and prototyping production-grade software faster. We can shrink this from present 6-8 weeks to 1-3 weeks in the best case scenarios.

A large part of the impact comes from our excellent customer support team, product and engineering. Machine learning is a mere amplifier of what they already do well.

This has been one of the most fulfilling and hard things I have done.

If you are considering a career in Data Science, I hope this helps you see beauty and effort beyond our love for data and ever increasing technical intricacies.

Till we meet again,

Nirant

ML Model Monitoring

Mayank asked on Twitter:

Some ideas/papers/tools on monitoring models in production. A use case would be say a classification task over large inputs. I want to visualise how are the predicted values or even confidence scores vary over time? (paraphrased)

Quick Hacks

pandas-profiling

If you are logging confidence scores, you can begin there. The quickest hack is to visualize with pandas-profiling: https://github.com/pandas-profiling/pandas-profiling/

Rolling means

Calculate rolling aggregates (e.g. mean, variance) of your confidence scores. pandas inbuilt. Quite quick. Add them to your set of monitoring and alerting product metrics.

A better version of this would be to do it on cohort level. Actually, doing all the following analysis on cohort level makes sense.

Confidence Scores and Thresholds

One of the most common mistakes is to use static threshold(s) on a confidence score(s).

If you hear someone saying that they do not use thresholds for a classification problem. Stop and think. They are using a threshold, usually 0.5 from within the ML library that you are using.

This is sub-optimal. The better option would be to use a holdout validation set and determine the threshold from that.

Tagging Data

It is obvious that you will tag the predictions for which the model is least confident -- so that the model can learn.

What you should also do is this:

  • Find out samples which have high confidence and tag them first, this is a form of negative sample mining

  • For multi-class classification: Figure out samples which did not clear your threshold, and the prediction is correct. Add these back to your new training+validation set

  • Tag samples which are too close to the threshold. This will help you understand your model and dataset's margin of separation better

Training-Serving

The most common causes of trouble in production ML models is training-serving skews or differences.

The differences can be on 3 levels: Data, Features, Predictions

Data Differences

Data differences can be of several types, the most frequest are these: Schema change - someone dropped a column!, Class Distribution Change - When did this 10% training class have 20% predictions, or Data Input Drift - users have started typing instead of copy-pasting!

Schema skew (from Google's ML Guide)

Training and serving input data do not conform to the same schema. The format of the serving data changes while your model continues to train on old data.

Solution? Use the same schema to validate training and serving data. Ensure you separately check for statistics not checked by your schema, such as the fraction of missing values

Class Distribution check with Great Expectations

Training and serving input data should conform to the same class frequency distribution. Confirm this. If not, update the model by training with updated class frequency distribution.

For monitoring these first two, check out: https://github.com/great-expectations/great_expectations

For understanding data drift, you need to visualize data itself. This is too data-domain specific (e.g. text, audio, image). And more often than not, it is just as better to visualize features or vectors.

Feature Viz for Monitoring

Almost all models for high dimensional data (images or text) vectorize data. I am using features and vectorized embedding as loosely synonymous here.

Let's take text as an example:

Class Level with umap

Use any dimensionality reduction like PCA or umap (https://github.com/lmcinnes/umap) for your feature space. Notice that these are on class level.

umap-tweet-plots

Plot similar plots for both training and test, and see if they have similar distributions.

Prediction Viz for Monitoring

Here you can get lazy, but I'd still recommend that you build data-domain specific explainers

Sample Level with LIME

Consider this for text:

lime-viz

Check out other black box ML explainers: https://lilianweng.github.io/lil-log/2017/08/01/how-to-explain-the-prediction-of-a-machine-learning-model.html by the amazing @lilianweng

Class Level

You can aggregate your predictions across multiple samples on a class level:

agg-lime-viz

Training Data Checks

Expanding on @aerinykim's tweet

Robustness

Adding in-domain noise or perturbations should not change the model training and inference both.

Citations and Resources

[1] Machine Learning Testing in Production: https://developers.google.com/machine-learning/testing-debugging/pipeline/production

[2] Recommended by DJ Patil as "Spot On, Excellent": http://www.unofficialgoogledatascience.com/2016/10/practical-advice-for-analysis-of-large.html

[3] Practical NLP by Ameisen: https://bit.ly/nlp-insight. The images for umap, LIME, and aggregated LIME are all from nlp-insight

[4] Machine Learning:The High-Interest Credit Card of Technical Debt: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43146.pdf