Six ways to manage missing data

Zest AI
September 22, 2023

Missing data is messy. Here are six methods on how to declutter and clean up your datasets and prepare them for model training

So you just signed with an AI-automated underwriting company like Zest AI, and you’re wondering what happens next. Generally, this is the time when model production kicks into gear, and lending teams want to know what’s going on under the hood. And though it may seem like the reverse order of steps in a project, the first thing that needs to be done in the modeling process is the clean-up around missing data.

Different models handle missing data differently, such as linear versus tree based models, but the objective of handling missing data remains the same: to capture any signal the missingness represents. 

Data cleanup is a critical step in order to create accurate models for AI-based lending. Models have to be trained on good data, so you can’t have missing or incorrect datasets. To help you navigate this, our data scientists have a couple of simple solutions for approaching missing data.

What does missing data mean in loan underwriting? 

Sometimes, missing data is just that—missing. Realistically, most datasets contain missing data, incorrectly encoded data, or other data that cannot be used for modeling. For example, an applicant may have missing data in the field: Amount past due on credit cards. And, it can mean any of the following:

  • The applicant does not have any amount past due on credit cards.
  • The applicant does not have any credit cards.
  • The applicant does not have any tradelines or accounts.

The obstacle in data processing and clean up is creating the right missingness representation that can potentially capture the correct type of missingness at the right level. 

In order to do this, you need to fix those missing fields so they can be used for analysis and modeling.

Now, having missing values in your data is not necessarily a setback. In fact, you can probably glean a lot of useful insights from missing variables. You might be able to understand more about the dataset (or the group of things it’s based on) if data is missing or simply unavailable. But take those insights with a grain of salt—if there is a feature* in the dataset that has a really high percentage of missing values, then that feature (just like other low variance features) may need to be excluded.

*Feature: a measurable piece of data that can be used for analysis, such as: credit card balances, amount past due on accounts, number of recent inquiries, and so on. Features are also sometimes referred to as “variables” or “attributes” in data sets.

Here are some common ways of dealing with missing data:

1. Encode “Not Available” or “Not a Number” with out-of-range values (e.g., -1, or -9999).

This works reasonably well for numerical features that are predominantly positive in value and for tree-based models, in general. 

Technical note: Care must be taken with monotonicity as it applies to modeling parameters (common in credit risk modeling) since extreme encoding can distort feature relationships. 

2. Case-wide deletion of rows with missing data.

In the case of a very large dataset with very few missing values, this approach could potentially work really well. Here, you simply drop all cases/rows from the dataset that contain missing values. 

Technical note: If the missing values are in cases that are also otherwise statistically distinct, this method may seriously skew the predictive model for which this data is used. Especially, in credit underwriting, where missingness often carries a signal that the model is trying to learn (e.g., no reported tradelines)

3. Replace missing values with the mean or median

Replace missing values with the mean/median value of the feature in which they occur. This works for numerical features. 

Technical note: The choice of median or mean is often related to the distribution of the data—median may be more appropriate for skewed or imbalanced data, while mean can be a better choice for symmetrical, normally distributed data. This approach also works best when the missingness is random and the average value is the best estimate for the missing entries.

4. Label or targeting encoding for missing categorical values

Label encode NAs as another level of a categorical variable. This works with tree-based models and other models if the feature can be numerically transformed (one-hot encoding, frequency encoding, etc.). 

Technical note: This technique does not work well with logistic regression

5. Predictive model-based imputations

Run predictive models that impute the missing data. 

Technical note: This should be done in conjunction with some kind of cross-validation scheme in order to avoid leakage. This can be very effective and can help with the final model

6. Engineer missingness as a new feature

Create engineered features that can capture elements of missingness. As mentioned above, missing data can often have lots of useful signals in its own right, and this is a good way to encode that information.

Technical note: This is especially helpful when missingness correlates with credit risk (e.g., no credit bureau data may signal thin-file applicants). Engineered missingness features often improve performance and interpretability by explicitly modeling patterns that would otherwise be hidden. 

There is no single “silver bullet” that will work equally well for all datasets and domains. Like many other aspects of data science, there is a fair amount of art and skill involved with how to deal with missing data.

Part of the modeling process involves experimenting with and evaluating different approaches to dealing with missing values. Determining which one to use will depend on several criteria, including the performance of the predictive model, the agility and reliability of feature processing, and the speed of execution. While there are additional concerns around deployed models that run on a data stream that is changing over time, as features shift and change, the most appropriate way of handling this is through monitoring and modification when necessary.

The good news is, Zest AI’s team of data scientists and client success experts are here to guide you through the complexities of modeling. We help lenders tackle challenges like handling missing data, adapting to shifting features, and keeping models reliable over time—so your organization can harness AI to enhance lending practices and deliver a smoother, smarter underwriting experience with AI-driven credit decisioning.

To learn more about how Zest AI can help you, schedule a call with our team.

Latest Articles