With great ML comes great responsibility: 5 key model development questions

Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! Watch here.

The rapid growth in machine learning (ML) capabilities has led to an explosion in its use. Natural language processing and computer vision models that seemed far-fetched a decade ago are now commonly used across multiple industries. We can make models that generate high-quality complex images from never before seen prompts, deliver cohesive textual responses with just a simple initial seed, or even carry out fully coherent conversations. And it’s likely we are just scratching the surface.

Yet as these models grow in capability and their use becomes widespread, we need to be mindful of their unintended and potentially harmful consequences. For example, a model that predicts creditworthiness needs to ensure that it does not discriminate against certain demographics. Nor should an ML-based search engine only return image results of a single demographic when looking for pictures of leaders and CEOs.

Responsible ML is a series of practices to avoid these pitfalls and ensure that ML-based systems deliver on their intent while mitigating against unintended or harmful consequences. At its core, responsible AI requires reflection and vigilance throughout the model development process to ensure you achieve the right outcome. 

To get you started, we’ve listed out a set of key questions to ask yourself during the model development process. Thinking through these prompts and addressing the concerns that come from them is core to building responsible AI.

1. Is my chosen ML system the best fit for this task?

While there is a temptation to go for the most powerful end-to-end automated solution, sometimes that may not be the right fit for the task. There are tradeoffs that need to be considered. For example, while deep learning models with a massive number of parameters have a high capacity for learning complex tasks, they are far more challenging to explain and understand relative to a simple linear model where it’s easier to map the impact of inputs to outputs. Hence when measuring for model bias or when working to make a model more transparent for users, a linear model can be a great fit if it has sufficient capacity for your task at hand. 

Additionally, in the case that your model has some level of uncertainty in its outputs, it will likely be better to keep a human in the loop rather than move to full automation. In this structure, instead of producing a single output/prediction, the model will produce a less binary result (e.g. multiple options or confidence scores) and then defer to a human to make the final call. This shields against outlier or unpredictable results—which can be important for sensitive tasks (e.g. patient diagnosis).

2. Am I collecting representative data (and am I collecting it in a responsible way)?

To mitigate against situations where your model treats certain demographic groups unfairly, it’s important to start with training data that is free of bias. For example, a model trained to improve image quality should use a training data set that reflects users of all skin tones to ensure that it works well across the full user base. Analyzing the raw data set can be a useful way to find and correct for these biases early on.

Beyond the data itself, its source matters as well. Data used for model training should be collected with user consent, so that users understand that their information is being collected and how it is used. Labeling of the data should also be completed in an ethical way. Often datasets are labeled by manual raters who are paid marginal amounts, and then the data is used to train a model which generates significant profit relative to what the raters were paid in the first place. Responsible practices ensure a more equitable wage for raters.

3. Do I (and do my users) understand how the ML system works?

With complex ML systems containing millions of parameters, it becomes significantly more difficult to understand how a particular input maps to the model outputs. This increases the likelihood of unpredictable and potentially harmful behavior. The ideal mitigation is to choose the simplest possible model that achieves the task. If the model is still complex, it’s important to do a robust set of sensitivity tests to prepare for unexpected contexts in the field. Then, to ensure that your users actually understand the implications of the system they are using, it is critical to implement explainable AI in order to illustrate how model predictions are generated in a manner which does not require technical expertise. If an explanation is not feasible (e.g. reveals trade secrets), offer other paths for feedback so that users can at least contest or have input in future decisions if they do not agree with the results.

4. Have I appropriately tested my model?

To ensure your model performs as expected, there is no substitute for testing. With respect to issues of fairness, the key factor to test is whether your model performs well across all groups within your user base, ensuring there is no intersectional unfairness in model outputs. This means collecting (and keeping up to date) a gold standard test set that accurately reflects your base, and regularly doing research and getting feedback from all types of users.

5. Do I have the right monitoring in production?

Model development does not end at deployment. ML models require continuous model monitoring and retraining throughout their entire lifecycle. This guards against risks such as data drift, where the data distribution in production starts to differ from the data set the model was initially trained on, causing unexpected and potentially harmful predictions. A best practice is to utilize a model performance management platform to set automated alerts on model performance in production, helping you respond proactively at the first sign of deviation and perform root-cause analysis to understand the driver of model drift. Critically, your monitoring needs to segment across different groups within your user base to ensure that performance is maintained across all users.

By asking yourself these questions, you can better incorporate responsible AI practices into your MLOps lifecycle. Machine learning is still in its early stages, so it’s important to continue to seek out and learn more; the items listed here are just a starting point on your path to responsible AI.

Krishnaram Kenthapadi is the chief scientist at Fiddler AI.
DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Read More From DataDecisionMakers