A Practitioner’s Guide to Monitoring Machine Learning Applications

How to prioritize your monitoring efforts while avoiding alert fatigue

Lina Weichbrodt
6 min readOct 26, 2022

The machine learning monitoring landscape is evolving fast. You may be tempted to use the latest tool and hope that it works out of the box. However, this could lead to receiving many false alerts or missing issues.

I present an easy-to-implement prioritization approach that you can use with either your own backend monitoring tools or a vendor monitoring tool. It is based on more than 30 large-scale models I have run in production over the last ten years.

Note: As the image below shows, machine learning monitoring should be added on top of typical backend monitoring. For engineers and data scientists without production experience, read the gentle introduction to backend monitoring.

Machine Learning monitoring is added on top of traditional backend monitoring

Traditional Software Monitoring is not Sufficient for Machine Learning Applications

Google’s paper on Machine Learning Stack Evaluation shows the higher complexity of ML-Ops.

When you apply only traditional backend monitoring to machine learning applications you will experience silent failures. These failures have a massive negative impact on the quality of your application’s response which adversely affects user experience or the company’s revenue.

Some examples of silent failures I’ve personally observed are:

  • Changes in input data: The client changed the unit in a fraud model from sec to msec.
  • Business rules are overly aggressive: A new filter rule is unexpectedly aggressive. A rule works well when it is created but is too aggressive during the sale season.
  • Bugs in our code: “get last ten orders” vs. “last ten bought articles”
  • Model performance: The model was automatically trained and released, but the performance was worse
  • Dependency updates: We got a faulty version of Tensorflow because we failed to pin the version.
  • Client changes how the product works without notifying us: The wishlist no longer requires users to be logged in.

These examples are by no means exhaustive, but they do highlight the need for additional monitoring. To make matters worse, many of these bugs are permanent, degrading our performance by as much as 10 or 15%, while only massive problems (>50% worse) will be detected by users or stakeholders.

How to prioritize monitoring for impact and avoid alert fatigue

We introduced a machine learning observability tool and now we get several alerts per week that an input field’s distribution changed. The reasons are mostly upstream business changes or unexplained changes in the input data. We did not take any action on the alerts.”
Source: Data scientist working at a market-leading financing platform

Alerts that don’t prompt any clear actions will be ignored in the not-so-distant future. Many tools and vendors offer input data monitoring as a major component. While input data monitoring is valuable, I advise against it as a first step or a standalone measure.

Instead, I advocate taking a page out of site reliability engineering’s book and recommend focusing on customer impact-based metrics. You prioritize backward from the output:

  1. Evaluation Metrics in Production and Stakeholder Concerns
  2. Service Response
  3. Model Predictions
  4. Calculated Features
  5. Input Data
Symptom-based monitoring: prioritize backward from output

I will cover the top priorities in this article. For a longer form of this article with all the steps, watch my PyData talk.

1. Measure Evaluation Metrics in Production

For some machine learning applications, you get to know the true value of your prediction, usually with a delay. For example: Predict the delivery time of food. After the food arrives, you can compare your prediction to the actual observed value. The metrics are then calculated over many examples. You can compare them to metrics measured on historical data during model development.

To monitor the evaluation metrics in production take the following steps:

  • Store the prediction for each request and later the observed actual value.
  • Run a job that joins predictions and actual values and calculates the same metrics used during the model training and evaluation. Schedule the job every 10 mins, hourly or daily; shorter is better to get real-time detection.
  • Add metrics to a dashboard and create an alert (see the article Backend Monitoring Basics).
Dashboard with real-time recall and precision for a loan rejection prediction model at DKB bank.

2. Monitor the distribution of your response value

You should also monitor your application’s response distribution. The response is the return value after all postprocessing steps and business rules. For classification models, this can be a prediction score. For regression models, it is also a numerical value.

The response value is an excellent proxy for quality monitoring. It does not measure how well the model fits its target function, like evaluation metrics. However, it does change when the quality goes down (e.g., an aggressive filter removes many good quality predictions or an important input variable changes the output score drastically).

Measuring the response distribution offers many significant benefits:

  • available in real-time which allows fast detection, e.g. during deployment
  • easy to collect compared to the actual outcome or downstream user interactions like clicks
  • superior detection power since it has less statistical noise, e.g. a click rate reduction is only detectable for huge problems while shifts in the response score are detectable for smaller changes like 5 or 10%

How you collect the response value:

  • Record the returned score(s) of each request to a histogram metric, e.g. as a Prometheus histogram or using the metrics library of your choice. Monitor the whole traffic as well as important segments by adding labels to your Prometheus metrics (e.g. the country, customer segments, android/iOS/browser)
  • Display the histograms over time. Pick a quantile like 80% to 95% depending on how much traffic you have and how stable it is. If you prefer to compare the whole distribution instead of one quantile, you can use distribution comparison metrics like Google’s D1 metric.
  • Alert on the metric.

Bonus tip: Monitor negative user experiences in a separate metric, e.g. your service returns empty, a low certainty response, or a fallback. Brainstorm proxies for your use case. Alert on the percentage of bad responses. Downside monitoring is very important. Don’t wait until your stakeholders or users notify you.

The Limit of Today’s Machine Learning Monitoring

It is worth mentioning that today’s machine learning monitoring methods will not alert you to all individual bad predictions. It works instead on the whole traffic or on segments. If you run anything where even a single failure is potentially catastrophic like health-related predictions, consider measures like easy-to-find objection mechanisms for end-users, partial automation over full automation, and humans in the loop.

Key Takeaways

  • Prioritize monitoring output metrics (user impact!) like response monitoring and evaluation metrics in production
  • You can use existing backend monitoring tools of your company to get started and invest in a vendor tool when are at the maturity where you need it

Did I forget anything? Leave your comment below or drop me a line at lina@linaweichbrodt.com.

Thank you to @bocytko, Eval Simpson, and Vlad Minzatu for their great reviews.

About the author

Hi, I am Lina Weichbrodt. I am a machine learning consultant with 10+ years of experience developing scalable machine learning models for millions of users and running them in production. Say hi at lina@linaweichbrodt.com