Monitoring Machine Learning: Interview with Oren Razon

This is the third post in a 3-part blog series about monitoring machine learning models in production. In this post I had the opportunity to interview Oren Razon Co-Founder and CTO of superwise.ai. We discussed challenges of operationalizing machine learning, the different types of model drift that can occur, and who owns ML monitoring in organizations. Be sure to check out Part 1 and Part 2 of the series.

In a previous post we introduced the topic of monitoring machine learning models in production. We established that machine learning applications have specific monitoring needs distinct from those of traditional software applications because of the non-deterministic nature of ML. This non-determinism, as well as the complex infrastructure required to operate and support deployed models, requires novel monitoring and testing solutions.

But despite its importance, most companies deploying ML today are just beginning to think about monitoring. Rather than plan monitoring needs ahead of time, companies deploying ML typically create ad-hoc monitoring solutions after they run into problems. While some companies plan ahead by automating procedures like model retraining, even these steps can be insufficient when guarding against production issues. According to Oren Razon: “Retraining ‘in the dark’ is not enough. Once there’s data drift or a performance incident, you need to be able to investigate the underlying change, and understand what actually happened.”

Oren Razon is the CoFounder and CTO of superwise.ai, a startup building a machine learning monitoring platform. Before co-founding the company, Oren led ML activities at Intel and ran a machine learning consultancy helping organizations across industries like finance, marketing, and gaming build and deploy machine learning applications. I had the chance to interview Oren about the challenges of monitoring machine learning in industry today.

Interview with Oren Razon

1. What are the main challenges you see in organizations seeking to operationalize their machine learning efforts?

There’s a consensus among ML professionals today that there’s a gap in the industry. As more and more models are moving to the deployment or production phase, ML professionals lack a clear and practical understanding of the KPIs necessary to monitor the health and performance of their models.

But metrics are not the only missing piece in the MLOps puzzle. Monitoring ML in production also requires bridging an organizational gap: one that exists between the needs of the data science teams for clear KPIs, and the needs of the operational/business teams for more visibility into, and better understanding of, the processes that lead to predictions being made. And this gap can only be bridged with a more encompassing view, and not solely through the eyes of the data science teams. Especially if companies want to scale their ML activities.

At the end of the day, the main challenges I’ve witnessed are around these two gaps: the technical one that focuses on the needs of the data science teams, and the organizational one that’s about ownership and the question of whose responsibility it is to manage the AI and its results in the organization. Because once the models go live, they have a life of their own that’s closer to the business and the operations.

For instance, for one of our customers, a productivity app leader, the value of monitoring impacts their marketing operations as much as it impacts their data science teams. As we progress with more use cases, the operational team is learning how to leverage AI assurance as a way to gain more value through better granularity: by being alerted on overlooked segments or on a model’s weak spots. The data science team benefits by being able to focus on their models, rather than being pulled into firefighting tasks trying to troubleshoot the models

2. Of the companies that succeed in operationalizing machine learning, how many of them implemented monitoring solutions? And do you notice any correlations between when teams begin planning for monitoring and their levels of success? For instance, if a team starts planning for monitoring at the project planning stage, do they succeed more often than teams that planned for monitoring only after they had initially deployed solutions?

I’m sorry to say that I have yet to see anyone planning it in advance.

Some of the teams I’ve worked with have decided NOT to deploy to production at the last minute once they realized they were about to “lose control”. Those who go ahead with deployment usually create ad-hoc local solutions without any dedicated planning or strategic vision. They just address a specific need for their use case without being too forward thinking.

In the last piece I published with ML in Production, I mention some of those specific steps and how they fail to provide a view of the big picture. It can be about only focusing on performance metrics, or overlooking specific segments due to a lack of granularity, or simply believing that automatic retraining is enough, the list is long.

Although these steps are sometimes necessary, they aren’t sufficient to assure that models have the impact they were designed for. Take the example of GPT-3 that’s been extensively discussed in the past few weeks. Even the most sophisticated models need to be monitored when applied in specific use cases to ensure that value is provided over time.

I do believe that as more teams move to the production phase and as conversations about the best way to manage AI and MLOps continue, monitoring will become an integral part of the production planning process, and a critical measure of success.

3. Based on your experiences, what opportunities did you see that influenced your decision to create an ML monitoring platform?

Identifying the opportunity started with an empirical observation: models break in production, and much more often than people think.

Machine learning models depend on the data they’re fed and data can change for various reasons. We usually collect data from multiple sources, both from within and outside the organization, and these sources ignore the downstream models that rely on them. This makes it highly probable that critical changes can occur in some parts of the model at any given time.

Once we observed the vulnerability caused by this reliance on data sources, it was easy to see it being an issue for companies across all verticals and sizes. Even though monitoring practices and needs are different across use cases, the essence of what should be done is quite general.

As AI becomes more ubiquitous and organizations look to scale their use of AI, there’s a clear need for tools that help data science teams be data-driven and help the operational teams generate more value. We have customers telling us that they were operating in the dark until they deployed a monitoring solution that helps them scale their AI activities. That’s the main takeaway – the opportunity is to help shape the operationalization of AI at scale.

4. From my own experience as a practitioner, I know there are very few monitoring solutions available today, both in terms of open source tools and 3rd party vendors. For the companies you worked with that did invest in monitoring, how do they do this? I imagine you saw a lot of hack-y, ad-hoc solutions.

These ad-hoc solutions are usually planned, executed and maintained by the DS team. They can range from a set of scripts to dashboards that leverage BI systems but focus only on a very small set of KPIs. Paradoxically ad hoc solutions are always very high level. They describe few KPIs, take a very broad view (without any segmentation), and don’t have the ability to trigger smart alerts that notify users once selected parameters are deviating.

There’s also the pitfall of looking at ML monitoring as simply another monitoring conundrum. Some of our customers initially chose to adopt a regular alerting system, only to realize that it didn’t give them the insights they needed to optimize their ML operations. Such alerts just add noise with a flock of alarms triggered based on naive thresholds that don’t consider the nature and the temporality of the data the model is working on. E.g. triggering an alert whenever a feature has missing values, even though the feature naturally has 0-20% of missing values depending on the day and on the specific distribution of the data. The lack of aggregation between the different alerts into one single cohesive health status also adds to the lack of efficiency.

5. The monitoring use-case that’s most familiar is that of detecting drift. How do you think about detecting drift?

It’s interesting, because usually people think of “drift”: as a long, slow, and gradual change. But actually there are many different types of drift.

Here are the most common ones: Gradual, Sudden, Recurrent (or seasonal), and Blips.

It’s important to recognize them, because different statistical processes are needed to detect each drift type, and more importantly, each type usually implies a different possible root cause. For example, “blips” are usually due to technical issues, while gradual drifts classically occur due to external changes. Understanding the root cause of the drift is the first and most important step to detect and correct it.

6. One debate in the ML tooling space is end-to-end platforms vs specialized components. Do you think that monitoring will become a built-in part of any AI/ML platform?

There’s a need for both.

We’ve already seen platform players like AWS SageMaker and DataRobot invest in features around monitoring. These platform offerings usually serve the needs of small and medium (SMB) type of companies that are looking for a one-stop-shop. Larger data science teams and enterprises usually prefer a more “best of breed” type of approach.

But as the ML deployment and serving ecosystem becomes more fragmented, organizations need a real platform agnostic solution that specializes in monitoring and isn’t tied to a specific deployment vendor to avoid blindspots.

Additionally, as compliance gears up in the coming years, we’re even more convinced that organizations will need a 3rd party monitoring tool to act as an “objective” player to watch on the deployed solution.

7. Infrastructure challenges aside, what are your thoughts on using machine learning methods to monitor other machine learning models? I’ve written that as companies operationalize more and more models, effective monitoring won’t be possible without utilizing advanced statistical techniques. Do you agree or disagree?

Personally I totally agree, and that’s exactly what we do at superwise. We call it the AI for AI.

Extracting the relevant KPIs at the right granularity is one part of the solution. But having too many KPIs doesn’t make sense either, as that could distract users from what’s really important. Moreover, due to the nature of ML data – many features, with each one having different temporality – it’s almost impossible to manually configure alerts.

That’s why we’ve invested so much time to embed our domain expertise together with a wide set of ML capabilities, to automatically extract and detect insights as well as alert on different types of anomalies. Our system can detect issues that may have big negative impacts, and do so before it’s too late.

8. You have mentioned the organizational change: In terms of organizational process, do you observe any practice emerging around who “owns” monitoring ML models in a company?

Yes, ownership is clearly a gap in the operationalization of AI. And because we’re still in the very early days of the production ML era, there aren’t clear processes or best practices in place. However we do see a growing trend in organizations that have more mature AI activities, and especially in cases, like fraud detection, where the AI system makes critical business decisions.

In such verticals, existing or emerging operational teams take the responsibility for the health of the AI decision making process. It usually consists mainly of business and data analyst type of personas. These personas are usually highly connected to the business stakeholders and responsible for the health of the entire business process that the AI model supports. On the other hand, these people are still relatively technical and data-focused, in the sense that they can understand and analyze the nature of data that run through the system.

9. What’s the role of the compliance teams in this stage of the life cycle?

In highly regulated industries such as health or finance, the compliance aspects are part of the model production. For instance, I’m thinking about the Fair Credit Act for underwriting. We know that elements that enable us to assure fairness and manage risks are essentials.

For less regulated industries, compliance teams are much less invested in AI governance, as the regulations are non-prescriptive when it comes to principles of transparency – such as GDPR, CCPA, or the Algorithmic Accountability Act.

Conclusion

What are your thoughts on monitoring machine learning systems? Regardless of if you’ve built your own monitoring solutions or have experienced monitoring challenges, I’d love to hear from you! Share your experiences by commenting below!