Estimating Performance of Regression Models Without Ground-Truth

Sounds like magic, but it’s actually quite simple.

Michał Oleszak
12 min readSep 30, 2022


Deploying a machine learning model to production is just the first step in the model’s lifecycle. After the go-live, we need to continuously monitor the model’s performance to make sure the quality of its predictions stays high. This is relatively simple if we know the ground-truth labels of the incoming data. Without them, the task becomes much more challenging. In one of my previous articles, I’ve shown how to monitor classification models in the absence of ground truth. This time, we’ll see how to do it for regression tasks.

Performance monitoring, again

You might have heard me use this metaphor before, but let me repeat it once again since I find it quite illustrative. Just like in financial investment, the past performance of a machine learning system is no guarantee of future results. The quality of machine learning models in production tends to deteriorate over time, mainly because of data drift. That’s why it is essential to have a system in place for monitoring the performance of live models.

Monitoring with known ground-truth

In cases when the ground-truth labels are available, performance monitoring is no rocket science. Think about demand prediction. Each day, your system predicts the sales for the next day. Once the next day is over, you have the actual groud-truth to compare against the model’s predictions. This way, with just a 24-hour delay, you can calculate whichever performance metrics you deem important, such as the Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE). Should the model’s quality start to deteriorate, you will be alerted on the following day.

Monitoring with direct feedback

In other scenarios, we might not observe the ground-truth directly but instead, we receive some other form of feedback on the model’s performance. In this case, performance monitoring is still a relatively easy task. Think about a recommender system that you use to suggest to your users the content they would like best. In this case, you don’t know whether each user enjoyed…