Most datasets in the real world contain missing data, incorrectly encoded data, or other data that cannot be used for modeling. Sometimes missing data is just that — missing. There is no actual value in a given field; for example: an empty string in a csv file. Other times data is encoded with a special keyword or a string. Some common encodings are NA, N/A, None, and -1. Before you can use data with missing data fields, you need to transform those fields so they can be used for analysis and modeling. There are machine learning algorithms and packages that can automatically detect and deal with missing data, but it’s still a good practice to transform that data manually.
Like many other aspects of data science, there is a fair amount of art and skill involved with how to deal with missing data, and data science may actually be more art than science. Understanding the data and the domain from which it comes is very important. For instance, calculating a mean makes more sense for certain features and domains than for others.
Having missing values in your data is not necessarily a setback. In fact, oftentimes one can glean a lot of useful information from missing values, and they can be used for the purposes of feature engineering. One has to be careful though: if there is a feature in the dataset that has a really high percentage of missing values, then that feature, just like any other low variance feature, should be dropped.
Here are some common ways of dealing with missing data:
- Encode NAs as -1 or -9999. This works reasonably well for numerical features that are predominantly positive in value, and for tree-based models in general. This used to be a more common method in the past when the out-of-the box machine learning libraries and algorithms were not very adept at working with missing data.
- Casewise deletion of missing data. Here you simply drop all cases/rows from the dataset that contain missing values. In the case of a very large dataset with very few missing values, this approach could potentially work really well. However, if the missing values are in cases that are also otherwise statistically distinct, this method may seriously skew the predictive model for which this data is used. Another major problem with this approach is that it will be unable to process any future data that contains missing values. If your predictive model is designed for production, this could create serious issues in deployment.
- Replace missing values with the mean/median value of the feature in which they occur. This works for numerical features. The choice of median/mean is often related to the form of distribution that the data has. For imbalanced data, the median may be more appropriate, while for symmetrical and more normally distributed data, the mean could be a better choice.
- Label encode NAs as another level of a categorical variable. This works with tree-based models and other models if the feature can be numerically transformed (one-hot encoding, frequency encoding, etc.). This technique does not work well with logistic regression.
- Run predictive models that impute the missing data. This should be done in conjunction with some kind of cross-validation scheme in order to avoid leakage. This can be very effective and can help with the final model.
- Use the number of missing values in a given row to create a new engineered feature. As mentioned above, missing data can often have lots of useful signal in its own right, and this is a good way to encode that information.
There are additional concerns for models that need to be deployed in production on a data stream that is changing over time. As some features shift and change, the most appropriate way of dealing with them during the training process (including how you dealt with the missing data) needs to be reevaluated and potentially modified.
Similar to most other modeling techniques, there is no single “silver bullet” that will work equally well for all datasets and domains. Part of the modeling process involves experimenting and evaluating different approaches to dealing with missing values. Determining which one to use will depend on several criteria, including the performance of the predictive model, the agility and reliability of feature processing, and the speed of execution. In addition to these criteria, there may be some other domain-specific considerations that influence the solution you choose.
Bojan Tunguz is a Machine Learning Engineer at ZestFinance. By training he is a Theoretical Physicist, with degrees from Stanford and the University of Illinois. He is a Kaggle Triple Master, and also enjoys reading, hiking, and digital photography.