Data Science & AI

The Real World of Predictive Modeling: Best Practices for the New Data Scientist

Zest AI team

April 16, 2018

Whether you’re a recent college graduate or looking to switch careers, moving to the real world of predictive modeling can be intimidating for the new data scientist. It means the models you build will actually be put into production. Every decision you make, from defining the problem to deploying the model, becomes critically important, especially in high-stakes industries like finance and healthcare.

Here’s a set of guidelines to help you successfully translate your new skills into the real-world of 24/7 automated decisioning using predictive modeling and analytics.
Defining the problem clearly

Prior to starting a real-world data science project, you need to identify your objective and understand the needs of key stakeholders. Answering the following questions can help set you on the right course:
Identifying the data set

  • What is the business or product objective?
  • Why is that objective important?
  • What are your goals and timeline for completion?
  • What makes the problem compelling?
  • What are the perceived costs and benefits of solving the problem?
  • Which problems are you not trying to solve?

Data is abundant. But it can be expensive to collect and process, whether you pay for third-party data or implement an app to gather data. Data can also take a long time to collect. For example, it can take a mortgage lender years to find out whether a customer will default on a 30-year loan.

To identify your data set, ask yourself the following questions:
Determining how the model will be evaluated

  • Which data is needed to solve the problem?
  • Which data do you have access to?
  • Do the data sets contain a data dictionary that describes the semantic meaning of each field?
  • What transformations, if any, have been performed on the data?
  • Which third-party data sources are relevant? How much do they cost? Quantify the value of each data source prior to acquisition.
  • How much budget do you have to acquire third-party data?
  • How much data do you need?
  • How should you request access (via authorization and authentication) to data sources of interest?
  • How much Extract Transform Load (ETL) work is needed to create a training data set ready for modeling?
  • How long does it take to process the data?
  • How is missing data represented?
  • How is each field computed? Has the algorithm to compute changed over time?

When determining the target metric for real-world projects, take additional care in the process to determine how the model would be evaluated. Some things to consider:
Gauging the speed of prediction

  • Legality: Just because a feature is useful doesn’t mean you can use it.
  • Is there bias in the model? In the United States, practices in employment, housing, and other areas that adversely affect one group of people of a protected class more than another are considered disparate impact. It’s your responsibility as a data science professional to ensure that the model conforms to the law.
  • Which tools or methodologies will you use to trust machine learning model outputs? Which metrics do you plan to track to ensure transparency and fairness throughout the product life cycle?
  • Compliance with regulations: You can’t use black boxes in high-stakes use cases.
  • For example, in financial services, lenders are required to give declined applicants adverse action notifications, which help prospective borrowers understand why they’ve been declined for a loan. In order to provide these notifications, you need to understand the signals your model is taking into account and to what extent.
  • Business
  • What are the economic benefits of this new model, vs. incumbent or existing processes?
  • What are the costs of prediction errors?
  • Marketing
  • Suppose your newly deployed model along with its predictions were on the front page of the The New York Times. What would be the impact to your company’s brand?

In school, it probably didn’t matter how long it took for your models to make predictions. However, for most businesses, end users expect little or no delays when interacting with a product.

Here are some considerations with respect to speed of inference:
Deploying the model

  • Does this model need to make predictions in real time or batch scoring?
  • Feature engineering can be a likely source of latency during prediction. Determine whether incorporating additional data has a significant impact on the results of your evaluation.
  • Complex models take longer to make predictions due to ensembling, stacking, and more complex feature engineering. On the other hand, they improve evaluation metrics (like accuracy).
  • A simpler model might be more explainable to regulators and sufficient for your purposes. Simpler models are optimized to reduce time latency at the cost of a better performing model.

In real-world predictive modeling, a project is considered successful if and when the model has been operational in production. When taking your model from theoretical to applied, ask yourself:
Data scientists are incredibly significant to the success of an organization. They extract actionable insights out of large volumes of data and help executive teams make informed decisions. Education and even competing in data science competitions will help you hone your skills, as long as you can put those skills into practice while avoiding the typical pitfalls described above.

  • What does the production software stack look like?
  • Which open / closed source technologies are used? How will the model be hosted into this stack?
  • What does the production data pipeline look like?
  • Do you have access to all variables in production that were used during training? If not, then you can’t use those features during scoring.
  • Model robustness: What happens if a feature (or set of features) suddenly becomes unavailable due to a data outage?
  • What is the feature distribution drift?
  • What happens if the range of values for features you used in training change over time? Define and test feature invariants.
  • How will you detect this change?
  • What is the impact of this change to your model?
  • Which code library packages does the model depend on? Package management is a crucial part of ensuring that model results are repeatable and reproducible.
  • Training and prediction code paths must compute the same feature transformations.
  • DevOps must be able to roll forward as well as roll back model deployments.

Armen Donigian is the Director of Predictive Modeling Tools & Explainability at ZestFinance. He has over a decade of experience as a software/data engineer with degrees in Computer Science from UCLA & USC. He is a lifelong learner and also enjoys teaching, traveling, and spending quality time with family and friends.

Thank you for subscribing!
Something went wrong while submitting the form.