Skip to content


ML Model Monitoring

Mayank asked on Twitter:

Some ideas/papers/tools on monitoring models in production. A use case would be say a classification task over large inputs. I want to visualise how are the predicted values or even confidence scores vary over time? (paraphrased)

Quick Hacks


If you are logging confidence scores, you can begin there. The quickest hack is to visualize with pandas-profiling:

Rolling means

Calculate rolling aggregates (e.g. mean, variance) of your confidence scores. pandas inbuilt. Quite quick. Add them to your set of monitoring and alerting product metrics.

A better version of this would be to do it on cohort level. Actually, doing all the following analysis on cohort level makes sense.

Confidence Scores and Thresholds

One of the most common mistakes is to use static threshold(s) on a confidence score(s).

If you hear someone saying that they do not use thresholds for a classification problem. Stop and think. They are using a threshold, usually 0.5 from within the ML library that you are using.

This is sub-optimal. The better option would be to use a holdout validation set and determine the threshold from that.

Tagging Data

It is obvious that you will tag the predictions for which the model is least confident -- so that the model can learn.

What you should also do is this:

  • Find out samples which have high confidence and tag them first, this is a form of negative sample mining

  • For multi-class classification: Figure out samples which did not clear your threshold, and the prediction is correct. Add these back to your new training+validation set

  • Tag samples which are too close to the threshold. This will help you understand your model and dataset's margin of separation better


The most common causes of trouble in production ML models is training-serving skews or differences.

The differences can be on 3 levels: Data, Features, Predictions

Data Differences

Data differences can be of several types, the most frequest are these: Schema change - someone dropped a column!, Class Distribution Change - When did this 10% training class have 20% predictions, or Data Input Drift - users have started typing instead of copy-pasting!

Schema skew (from Google's ML Guide)

Training and serving input data do not conform to the same schema. The format of the serving data changes while your model continues to train on old data.

Solution? Use the same schema to validate training and serving data. Ensure you separately check for statistics not checked by your schema, such as the fraction of missing values

Class Distribution check with Great Expectations

Training and serving input data should conform to the same class frequency distribution. Confirm this. If not, update the model by training with updated class frequency distribution.

For monitoring these first two, check out:

For understanding data drift, you need to visualize data itself. This is too data-domain specific (e.g. text, audio, image). And more often than not, it is just as better to visualize features or vectors.

Feature Viz for Monitoring

Almost all models for high dimensional data (images or text) vectorize data. I am using features and vectorized embedding as loosely synonymous here.

Let's take text as an example:

Class Level with umap

Use any dimensionality reduction like PCA or umap ( for your feature space. Notice that these are on class level.


Plot similar plots for both training and test, and see if they have similar distributions.

Prediction Viz for Monitoring

Here you can get lazy, but I'd still recommend that you build data-domain specific explainers

Sample Level with LIME

Consider this for text:


Check out other black box ML explainers: by the amazing @lilianweng

Class Level

You can aggregate your predictions across multiple samples on a class level:


Training Data Checks

Expanding on @aerinykim's tweet


Adding in-domain noise or perturbations should not change the model training and inference both.

Citations and Resources

[1] Machine Learning Testing in Production:

[2] Recommended by DJ Patil as "Spot On, Excellent":

[3] Practical NLP by Ameisen: The images for umap, LIME, and aggregated LIME are all from nlp-insight

[4] Machine Learning:The High-Interest Credit Card of Technical Debt:

The Silent Rise of PyTorch Ecosystem

While Tensorflow has made peace with Keras as it’s high level API and mxNet now support Gluon — PyTorch is the bare matrix love.

PyTorch has seen rapid adoption in academia and all the industrial labs that I have spoken to as well. One of the reasons people (specially engineers doing experiments) like PyTorch is the ease of debugging.

What I don’t like about PyTorch is it’s incessant requirement of debugging because of inconsistent dimensions problems. In fact, one of the most recommended speed hacks for faster development: assert tensor shapes!

This is something which Keras abstracts out really well. Additionally, PyTorch has no high level abstractions which picks good defaults for most common problems.

This leads us to the observation that there are three niche problems unsolved in the PyTorch ecosystem:

Unsolved Problems

  • General Purpose Abstraction: Over PyTorch similar to Keras or tf.learn
  • Adoption: Something to help traditional ML practitioners adopt PyTorch more easily
  • Production Use Cases: Something which allows engineers to take Pytorch code as-is in production or port to Caffe2 with minimal effort. I like Gluon for this, it has no community support but is backed by MSFT and AWS both.

Few specialized efforts like AllenAI’s NLP though built for NLP, or PyTorch torchvision & torchtext are domain specific instead of a generic abstraction similar to Keras. They deserve their own discussion space, separate from here.

The Better Alternatives

fastai has outrageously good defaults for both vision and NLP. They have several amazing implementations for Cyclic Learning Rate, learning rate schedulers, data augmentation, decent API design, interesting dataloaders, and most important: extremely extensible!

It as seen some rather great adoption among Kagglers and beginners alike for faster experimentation. It is also helped by their amazing MOOC course.


Ignite helps you write compact but full-featured training loops in a few lines of code. It is fairly extensible, and results in a lot of compact code. There is no peeking under the hood. This is the best contender for Keras for PyTorch power users.

I do not know of any power users of Ignite, despite their elegant design. Nor have I seen it’s adoption in the wild.

PTL: PyTorch-Lightning

Built by folks over at NYU and FAIR, Lightning is gives you the skeleton to flesh our your experiments. The best contender to Keras for Researchers. The built in mixed precision support (via apex) and distributed training is definitely helpful.

The biggest value add I guess will be explicit decision, all in one class— instead of the scattered pieces we see with PyTorch. Yay Reproducibility!

The lib is still very new, and that shows up in it’s lack of adoption but is getting a lot of star counts in first week of launch!

Check out detailed comparison between Lightning and Ignite from the creator of Lightning


skorch is attacking the bringing ML people to Deep Learning problem above

skorch is a scikit-learn style wrapper (with metrics and pipelines support!) for Pytorch by a commercial entity invested in it’s adoption. It is being developed fairly actively (most recent master commit is less than 15 days old) and marching to v1.

Summary researchers, rapid iterators like Kagglers skorch: welcome people coming from more traditional Machine learning backgrounds PyTorch Lightning: custom built for DL experts looking for experimentation tooling

Ending Note: What are using for deep learning experiments? Have you seen the light with PyTorch or still going with Tensorflow? Tell me @nirantk

How to prepare for a Data Science job from college?

A Getting Started Guide

Let us get our facts straight, shall we?

I am writing from my non-existent but probably relevant experience. I worked in a Machine Learning role at Samsung Research, Bengaluru. It is only 1 of the 4 research enterprises which hire Machine Learning researchers from Indian colleges — the other being Microsoft, Xerox, and IBM Watson.

I am now in a even more Computer Vision focused role for a small enterprise tech company. Here are some pointers:

Forget the courses

I am from BITS Pilani, Pilani Campus. College courses and even a lot of popular MOOCs are mostly useless in getting a Machine Learning or Data Science role. They don’t have enough of a learning curve at most colleges. Neither in theory nor in programming skills.

Build a project worth noting

Have you done any decent Machine Learning projects? What is the largest data size that you have handled? What is the most complex data set that you handled? How important was the problem that you applied Machine Learning to the society? Participate in Kaggle competitions and Hackathons, if you don’t have good answers to these questions.

Intern in your summers

Summers and semester internship programs in a Machine Learning or Data Science role. I did my semester internship at a startup and skipped Amazon against lot of prevailing (and probably correct) wisdom at the time. I was grilled on my intern project in my campus interview.

Share your results

Share like a madman: In a Medium blog, put your code on Github and get a paper published. It is easier (and more tedious) than most people think. My friend’s first paper was in a reputed Springer Lecture Notes in Computer Science. He did not get any guidance from any Professor.

Demo or Die

Projects on the web, projects which can be demo'd using a video or something similar. Essentially, a portfolio that you can showcase to potential recruiters. I walked into an interview with a video of my previous project on phone.

Linkedin India hires as Software Engineers but allows you to grow into a Data Science role. Microsoft Research has among the best research organisations in Computer Science in India. I’d love to work there.

Organisations like IBM Research, Xerox tend to prefer Masters and PhD students over plain undergraduates. You might want to bring that on the table. A Masters in CS can also give you the time to polish your Machine Learning portfolio too.

The simplified formula to get to a Data Science role is this: Build, build more, share and sell

A 2016 version of this is available on Medium