Six ways to manage missing data
Missing data is messy. Here are six methods on how to declutter and clean up your data sets and prepare them for model training
Data science things to know before you dive in:
Features – a measurable piece of data that can be used for analysis such as: name, age, sex, ticket number, and so on. Features are also sometimes referred to as “variables” or “attributes” in data sets.
So you just signed with an AI-automated underwriting company like Zest AI, and you’re wondering what happens next. Generally, this is the time when model production really kicks into gear and lending teams want to know what’s going on under the hood. And though it may seem like the reverse order of steps in a project, the first thing that needs to be done in the modeling process is a bit of cleaning up.
Data clean up, that is. In order to create accurate models for AI-based lending, they have to be trained on good data. So we can’t have missing, or incorrect, data sets. Luckily for us, data scientists have a couple of simple solves for missing data.
It’s not a “Where’s Waldo” situation when it comes to loan underwriting
Sometimes missing data is just that — missing. Realistically, most datasets contain missing data, incorrectly encoded data, or other data that cannot be used for modeling. For example, when there is no actual value for a feature in a data set or when "missing" data is encoded with a special keyword or a string. Some common keywords you'd stumble across are NA, N/A, None, and -1.
In order to be able to use this data set, you need to fix those missing fields so they can be used for analysis and modeling.
Now, having missing values in your data is not necessarily a setback. In fact, you can probably glean a lot of useful insights from missing variables. You might be able to understand more about the data set (or the group of things it's based on) if data is missing, or simply unavailable. But take those insights with a grain of salt — if there is a feature in the dataset that has a really high percentage of missing values, then that feature (just like any other low variance feature) should be dropped.
Here are some common ways of dealing with missing data:
- Encode NAs as -1 or -9999. This works reasonably well for numerical features that are predominantly positive in value and for tree-based models, in general. This used to be a more common method in the past when the out-of-the box machine learning libraries and algorithms were not very adept at working with missing data.
- Case-wide deletion of missing data. In the case of a very large dataset with very few missing values, this approach could potentially work really well. Here, you simply drop all cases/rows from the dataset that contain missing values. However, if the missing values are in cases that are also otherwise statistically distinct, this method may seriously skew the predictive model for which this data is used. Another major problem with this approach is that it will be unable to process any future data that contains missing values. If your predictive model is designed for production, this could create serious issues in deployment.
- Replace missing values with the mean/median value of the feature in which they occur. This works for numerical features. The choice of median/mean is often related to the form of distribution that the data has. For imbalanced data, the median may be more appropriate, while for symmetrical and more normally distributed data, the mean could be a better choice.
- Label encode NAs as another level of a categorical variable. This works with tree-based models and other models if the feature can be numerically transformed (one-hot encoding, frequency encoding, etc.). This technique does not work well with logistic regression.
- Run predictive models that impute the missing data. This should be done in conjunction with some kind of cross-validation scheme in order to avoid leakage. This can be very effective and can help with the final model.
- Use the number of missing values in a given row to create a new engineered feature. As mentioned above, missing data can often have lots of useful signal in its own right, and this is a good way to encode that information.
There is no single “silver bullet” that will work equally well for all datasets and domains. Like many other aspects of data science, there is a fair amount of art and skill involved with how to deal with missing data.
Part of the modeling process involves experimenting and evaluating different approaches to dealing with missing values. Determining which one to use will depend on several criteria, including the performance of the predictive model, the agility and reliability of feature processing, and the speed of execution. And while there are additional concerns around deployed models that run on a data stream that is changing over time, as features shift and change, the most appropriate way of handling this is through monitoring and modification when necessary.
Luckily for you, Zest AI’s got a team of data scientists and client success folks who are ready to help you take on the challenges of modeling so that your organization can use AI to enhance lending practices and create an overall better underwriting experience for your lending teams with the use of AI credit decisioning technology.