Answering your machine learning and data questions

Zest AI

January 28, 2019

In using machine learning to power credit decisions, more data put through the right models means better outputs. Clients using our machine learning (ML) modeling tools employ up to 10 to 100 times more variables than they used to when they built models with traditional math (linear regression).

That vastly improves their ability to make more good loans and fewer bad ones. But we also know that switching to ML models raises lots of data questions: how much, what kind, how clean, etc. Here are the most common questions we get about what kind of data we look for, how we use that data, and whether that data needs to leave your data center (spoiler alert: it doesn’t).

‍

How do you deal with missing data or unlabeled data?

ML models are exceedingly good at finding a way to relate even messy and missing data in a meaningful way. They’re also really good at dealing with unstructured data such as search and browsing history, and time-series structured data (CRM/transactions). You can come up with all kinds of creative ways to join these types of datasets in an ensembled machine learning model that unearths meaning from the mix. For example, rather than merely looking at whether there is a bankruptcy or two in a file (about the most that a linear model can handle), an ML model can look at all the information for each bankruptcy event, the time between events and the length of time since the most recent event. Our datasets typically include large swaths of missing data such as when data are only available if the customer applied online, or when data went missing due to an upgrade or change in the data source. These are par for the course for us.

‍

Do I need to bring huge amounts of outside data into my model?

Generally not. We see great results from using application and bureau data you already have or pay for, which can generate hundreds of features or variables. If it makes sense, we can help you identify outside data sources that can add lift to a model, such as LexisNexis or telecom payment histories. Our tools can help you calculate the economic impact of any variable or feature to decide what’s worth including and what’s not.

‍

Do you use social media data?

No! Despite a lot of hype around insights that can be gained by stalking someone on Facebook or Twitter, the truth is that social data provides nowhere near the signal that typical credit variables do. Besides, it’s creepy. Our bank customers and the credit bureaus have plenty of great data that can build an accurate profile of a customer without having to find out where they went on their last vacation or if they’re fond of IPAs.

‍

How do you clean up the data?

Spending time with the data is a key part of the modeling process. Our tools are designed to reveal inconsistencies and oddities that indicate issues our customers didn’t even know they had. The process typically starts with a manual inspection of the data based on several dimensions including distribution, the rate of missingness, and outlier occurrence, among other things. This process results in thousands of visualizations of the data used to drive modeling decisions based on each customer’s business needs. For example, while working with one customer, we noticed that one variable they provided, the age of a credit account on file, exhibited a wildly different distribution in the training data than it did in the validation data. We flagged the conflict for the customer, and they corrected it.

‍

Does the data need to leave my data center?

Nope. You can build your models with Zest AI’s tools entirely on your own premises. If you’d like to have us build models with you, we can work on dedicated hardware within your data center with VPN access.

‍

What’s your approach to creating new variables from the data we have?

We can take lots of different types of complex data sources (such as trade lines, CRM, and application data) and build them into your ML models — inputting as much data as possible to generate the most predictive model. Some of the engineered variables sound more complex than they are, such as the ratios of types of closed accounts to other types of closed accounts or to the total number of closed accounts. The result is improved accuracy and performance. For one customer, we converted raw tradeline data from three different credit bureaus into features such as the rate of increased use of revolving credit products. This variable, combined with credit limit data, enabled us to identify a population that was on its way to running out of credit on existing products and therefore represented a much higher credit risk. The model was able to make use of data that was already present but was not being used by existing models.