## MLOps

# Estimating Performance of Regression Models Without Ground-Truth

## Sounds like magic, but it’s actually quite simple.

Deploying a machine learning model to production is just the first step in the model’s lifecycle. After the go-live, we need to continuously monitor the model’s performance to make sure the quality of its predictions stays high. This is relatively simple if we know the ground-truth labels of the incoming data. Without them, the task becomes much more challenging. In one of my previous articles, I’ve shown how to monitor classification models in the absence of ground truth. This time, we’ll see how to do it for regression tasks.

# Performance monitoring, again

You might have heard me use this metaphor before, but let me repeat it once again since I find it quite illustrative. Just like in financial investment, the past performance of a machine learning system is no guarantee of future results. The quality of machine learning models in production tends to deteriorate over time, mainly because of data drift. That’s why it is essential to have a system in place for monitoring the performance of live models.

## Monitoring with known ground-truth

In cases when the ground-truth labels are available, performance monitoring is no rocket science. Think about demand prediction. Each day, your system predicts the sales for the next day. Once the next day is over, you have the actual groud-truth to compare against the model’s predictions. This way, with just a 24-hour delay, you can calculate whichever performance metrics you deem important, such as the Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE). Should the model’s quality start to deteriorate, you will be alerted on the following day.

## Monitoring with direct feedback

In other scenarios, we might not observe the ground-truth directly but instead, we receive some other form of feedback on the model’s performance. In this case, performance monitoring is still a relatively easy task. Think about a recommender system that you use to suggest to your users the content they would like best. In this case, you don’t know whether each user enjoyed each piece of content that was suggested to them. Actually, measuring the concept of enjoyment is quite a challenge on its own. But you can easily monitor how often the users consume the suggested content. If this frequency stays constant over time, the model’s quality is likely stable. As soon as the model breaks, you can expect the users to start ignoring the suggested content more often.

## Monitoring without ground-truth: classification

Then, there are situations when ground truth is not available at all, or at least not for a very long time. In one of my previous projects, my team and I were predicting users’ locations to present them with relevant marketing offers that they could take advantage of on the spot. Some business metrics could be computed based on how often the users found the offers interesting, but there were no ground-truth targets for the model — we never actually *knew *where each user was.

Performance monitoring in this scenario has long been a challenge. Recently, a clever method called Confidence-Based Performance Estimation (CBPE) for classification tasks has been proposed by NannyML, an open-source library for post-deployment data science. Based on the assumption that the classifier is calibrated, it delivers reliable performance metrics even when no ground-truth labels are available. I have explained CBPE in detail in a previous article — do check it out if you have missed it.

## Monitoring without ground-truth: regression

Unfortunately, the CBPE approach is specific to classification problems. It works thanks to the fact that calibrated classifiers provide a probability distribution (or a predictive posterior, as a Bayesian would have it). In other words, we know all possible outcomes and the associated probabilities. Most commonly used regression models don’t provide such insights. That’s why regression problems require a different approach.

# Direct Loss Estimation

In order to use CBPE for regression, the question to be answered is how to obtain the probability distribution of the prediction in a regression task. An obvious approach that comes to mind is to use Bayesian methods. NannyML’s developers have tested this approach but it turned out it had some convergence issues and, as is typical for Bayesian posterior sampling, it took a long time to produce the results. Finally, one would be limited to only using Bayesian models should one want to do performance estimation, which is quite restrictive. Luckily, they came up with a much simpler, faster, and more reliable approach which they dubbed Direct Loss Estimation or DLE.

## The DLE algorithm

The DLE method is brilliant in its simplicity. It boils down to training another model, referred to as a *nanny model*, to predict the loss of the model being monitored or the *child model*. If it brings gradient boosting to your mind, you are quite right. The idea is similar, but with one twist. The nanny model is predicting the *loss* of the child model, rather than its *error* as boosting methods would*. *It will become clear why this is the case shortly. But first, let’s go through the algorithm step by step.

The DLE method is brilliant in its simplicity. It boils down to training another model, referred to as a nanny model, to predict the loss of the model being monitored or the child model.

We need three subsets of data. First, there is *training data*, on which the child model is trained. Then, there is *reference data*, which will be used for training the nanny model. Both training and reference data have targets available. Finally, there is *analysis data, *which is the data fed into the child model in production. For these, there are no targets.

First, we train the child model on the training data. The child can be any model solving a regression task. You can think of linear regression, gradient-boosted decision trees, and whatnot.

Then, we pass reference features to the child model to get the predictions for the reference set.

Next, we train the nanny model. It can be any regression model. In fact, it can even be the same type of model as the child, e.g. linear regression or gradient-boosted decision trees. As training features, we pass it the features from the reference set as well as the child’s predictions for the reference set. The target is the child’s loss on the reference set, which can be expressed as the absolute or squared error, for instance. Notice that as a result, the nanny model is able to predict the child’s loss based on its predictions themselves and the features used to generate them.

Once the analysis features are available in production, they are passed to our child model and we receive the predictions. We would like to know how good these predictions are, but there are no targets to compare them against.

In the final step, we pass the child’s predictions for the analysis data, together with the analysis features, to the nanny model. What we obtain as output is the predicted loss for the analysis data. Notice that even though we don’t know `y_analysis`

, we can predict the monitored model’s loss. Magic!

All the steps of the algorithm are pretty straightforward, with the exception perhaps of Step 3, training the nanny model. How come this model is able to accurately predict the child’s loss? And how is it possible that we don’t need the nanny to be a more complex model than the child to predict it? Let’s find out!

## Why the loss and not the error

The key trick of the DLE approach is the realization that as long as we are using absolute or squared performance metrics such as MAE, MSE, RMSE, and the like, we are not actually interested in the model’s *error *at all, but rather in its *loss.*

The trick is to notice that as long as we are using absolute or squared performance metrics, we are not interested in the model’s error, but rather in its loss.

The error is simply the difference between the ground-truth target value and the model’s prediction. It is signed, that is positive when the target is larger than the prediction, and negative otherwise. The loss, on the other hand, is unsigned. Absolute loss metrics such as the MAE remove the sign by taking the absolute value of the error, while squared loss metrics such as the MSE or RMSE raise the error to the second power, which always yields a positive result.

Predicting the unsigned loss is a lot easier than predicting the signed error. Here is one way to think about it. Being able to accurately predict a model’s error is the same as predicting the ground-truth: since the error is the ground-truth minus the prediction, if we knew the error, we could add it to the ground-truth and voila, there we have the targets. No need to have two models at all. Usually, however, predicting the error is not possible.

Predicting the loss, on the other hand, only requires guessing *how wrong *the model was, not *in which direction *it was wrong*. *The error provides more information about the model’s performance than the loss, but this information is not needed for the purpose of RMSE-based or MAE-based performance estimation.

## Go nanny yourself

As we have said, loss prediction can be achieved by the same type of model as the one originally used. Let’s illustrate how it works with a simple example.

Take a look at these randomly generated data. We have one feature, `x1`

, and the target `y`

. The data were generated in such a way that there is a linear relationship between the feature and the target, but the larger the feature’s value, the stronger the noise. We can fit a linear regression model to these data and it captures the linear trend pretty well. Notice, however, that the model’s errors are small for small values of `x1`

and larger when `x1`

is large.

Imagine we would like to predict these *errors *using another linear regression model, based on the predictions (the red regression line) and `x1`

. This would not be possible (go ahead and verify it yourself!). For instance, for an example where `x1`

is 1, and the model’s prediction is around 2, the ground truth can be as different as 5 or -1. There is no way to predict the error.

Now think about predicting the *loss *instead. Say, the MAE. It’s as though all the blue dots below the red line have disappeared. This task is pretty simple! The larger `x1`

, the higher the loss, and the relationship is linear; a regression line would fit in quite well. If you’re interested in proof, you will find it in this Nanny ML’s tutorial, where they actually generate the data and fit two linear regression models to show how the latter can easily predict the former’s loss.

To sum up, any regression model can “nanny itself”, that is: the same type of model can be used both as the child model (the one in production, to be monitored) and the nanny model (the one predicting the child’s loss).

Any regression model can “nanny itself”: the same type of model can be used both as the child model and the nanny model.

## Danger zone: assumptions

You know there are no free lunches, right? Just like most other statistical algorithms, the DLE comes with some assumptions that need to hold for the performance estimation to be reliable.

First, the algorithm works as long as there is no *concept drift*. Concept drift is the data drift’s even-more-evil twin. It’s a change in the relationship between the input features and the target. When it occurs, the data patterns learned by the model are no longer applicable to new data.

DLE only works in the absence of concept drift.

Second, predictive models are often more accurate for some combinations of feature values than for others. If this is the case for our child model, then DLE will need enough data in the reference set for the nanny model to learn this pattern.

DLE only works when the nanny has enough data to learn the feature combinations for which the child is more and less accurate.

For example, when predicting house prices based on square footage, say our model works better for small houses (where each squared foot adds much to the value) than for very large ones (where the exact squared footage is a weaker price driver). We need enough examples of small and large houses of various prices in the reference set so that the nanny model can associate the relationship between the house’s squared footage and the loss in the child model’s price prediction.

# DLE with nannyML

Let’s get our hands dirty with some data and models to see how easy it is to perform the DLE with the NannyML package!

For this demonstration, we will be using the Steel Industry Energy Consumption dataset available freely from the UCI Machine Learning Repository. The dataset contains more than 35,000 observations of electricity consumption in a Korean steel-industry company. The electricity consumption in kilowatt-hours (kWh) is measured every 15 minutes for the entire year of 2018. The task is to predict it with a set of explanatory features such as reactive power indicators, the company’s load type, day of the week, and others.

Let’s start with the boring but necessary part: loading and cleaning the data. We will use the first nine months for training, then two months for reference, and treat the last month of December as the analysis set.

Now we can fit our child model to the training data and make predictions for the reference and analysis sets. We’ll use linear regression for simplicity.

Finally, we can fit NannyML’s DLE estimator. By default, it will use LightGBM as the nanny model. As arguments, we need to pass the original feature names, the features holding ground-truth and predicted values for the reference set, the metrics we are interested in (let’s go with RMSE), and optionally the time feature, which will be used for plotting.

Once the estimator has been fitted, we can neatly visualize the estimated performance with its `plot`

method.

As we can see from the plot, the estimated performance in the analysis period is quite similar to what the model has shown before. Actually, it even depicts a slight improvement trend (recall that the lower the RMSE, the better). Okay, but how good is this performance estimation? Let’s find out!

We can do it quite simply in this case since, in reality, we do have the ground-truth target values for the analysis period — we just didn’t use them. So we can calculate the actual, realized RMSE and plot it against the DLE estimation.

To compute the realized performance, we can use `nannyml`

‘s `PerformanceCalculator`

. The plotting part is slightly more involved it requires us to write some glue code, but I expect the package developers to make it easier in future releases.

As we can see, the DLE performance estimation is pretty decent. The algorithm has even correctly predicted the performance improvement at the end of the analysis period.

Thanks for reading!

If you liked this post, why don’t you **subscribe for email updates** on my new articles? And by **becoming a Medium member**, you can support my writing and get unlimited access to all stories by other authors and myself.

Need consulting? You can ask me anything or book me for a 1:1 **here**.

You can also try one of my other articles. Can’t choose? Pick one of these: