Ensuring consistent adverse action reasons for ML models

Machine learning has been adopted by lenders of all sizes for credit underwriting. And for good reason: machine learning models are more accurate, and technologies like adversarial debiasing can be used to make machine learning models more fair and inclusive. Even so, it’s important to accurately and consistently explain recommendations coming from these models. While there are best practices for doing so, nuanced differences in how best practices are applied and small differences in the models they are applied to can lead to differences in the explanations that appear significant but really aren’t. In this article, we explain these nuances and their implications.
Methods developed by game theorists in decades past have made it possible to analyze how models work. This is important because lenders must comply with laws which require them to provide consumers with the principal reasons for denial or other adverse action. Game-theoretic analysis methods do just that: the Shapley values uniquely quantify the contribution of each variable to a model-based decision. Efficient algorithms based on Shapley and his colleagues’ methods have been implemented, which enable practitioners to accurately explain model-based outcomes.
Application of Shapley value-based methods can vary from practitioner to practitioner. A 2022 FinRegLab/Stanford study found that Shapley value-based methods could be used to explain machine learning underwriting model decisions with high fidelity. That is, the reasons the Shapley-based methods give for approve/deny decisions are accurate. However the same study found that there were inconsistencies in the reasons provided from implementation to implementation. For some, these inconsistencies resulted in concerns about the safety and soundness of these methods.
Taking a closer look, however, these inconsistencies are understandable given the fine grain size at which explanations were evaluated, and various technical differences between the model developers’ explanation techniques, which we describe in more detail below. With a better understanding of the reasons for differences, we can see how discrepancies can arise and how to control for them better in experiments.
Reason 1: If the analysis is too fine-grained, differences in explanations can be overstated
Models are built based on data that includes many predictive variables. Often, variations of variables are developed, and the “best” of the variables are selected during the model estimation process. Different variables can often encode similar information. Some variables can actually have identical values, despite being constructed with different logic. For example, the variable “count of inquiries in the last 90 days” might always be the same as “count of inquiries from 30 days ago to 90 days ago” given the time it takes for credit bureaus to update their records.
This is one reason why practitioners group variables into categories that correspond to easier to understand “reasons” for a model-based decision. Continuing with the example above,”count of inquiries in the last 90 days” and “count of inquiries from 30 days ago to 90 days ago” will be mapped to a higher-level reason, “inquiries.” Thus if explanation A says the most important variable leading to a denial was ”count of inquiries in the last 90 days” and explanation B says the most important variable leading to a denial was “count of inquiries from 30 days ago to 90 days ago” both explanations would still indicate the reason the applicant was denied is “inquiries” and should be considered equivalent for the purposes of evaluating consistency. Otherwise, explanations might appear less consistent than they actually are.
To understand this better, consider a straightforward analogy. Let’s say you wanted to determine whether hand painted vases coming off a factory assembly line are consistent enough to be sold at market. If you use an electron microscope, you will notice all kinds of inconsistencies. The microscope will detect differences in the fingerprints of the various artisans that handled the vase. These differences are not meaningful to consumers. What would be meaningful is a crack or a mistake in the design. But those inconsistencies are more easily detected with the naked eye.
Having the right resolution or grain size when considering matters of consistency is really important. Too fine grained, and meaningless inconsistencies can be exaggerated.
Reason 2: Seemingly similar models might actually be different, and so inconsistencies in explanations are to be expected
When two variables encode the same information, like in the example above, it is often left to chance which variable will be selected by a learning algorithm, even when the same algorithm is used. As a result, even models trained on the same data with the same algorithm with similar predictive performance can have differences, even major ones. This phenomenon was called the “Rashomon Effect” by University of California at Berkeley statistician Leo Breiman, who cleverly exploited it in his algorithm Random Forests to generate more accurate predictions. This concept eventually gave rise to the modern machine learning methods we use today, like XGBoost.
Shapley-based methods tell us what variables are most important to a difference in score according to a given model. They do not necessarily tell us much about how the world works, rather, they tell us about how the model works. Therefore, when we have different models, we should expect to see different explanations.
Variations in models often arise when second line model risk management teams build their own challenger models to determine whether a model proposed for use is truly the best it can be. Challenger models can sometimes reveal new variables that can be used in subsequent model development efforts. But variations are a fact of life when we are dealing with data that contains an imperfect picture of the causal relationships between a borrower’s economic circumstances and their probability of default, as is the case in every practical predictive modeling setting.
As such, it is important to ensure the model itself is held constant when evaluating the consistency of explanations, and that we don’t confuse two very different tasks: (1) explaining how the model makes predictions versus (2) explaining causal relationships in the world.
Reason 3: There are many implementations of Shapley’s methods, and some are inherently inconsistent. Use the wrong one and you will get inconsistent results.
There are variations among the Shapley-based approaches that make them more or less appropriate for various types of models and use cases. As of the writing of this post (Oct 2022) the SHAP package, one of the most popular, contains six different Shapley-based explainers, each appropriate for a different task.
For example, TreeSHAP (2017) is appropriate for tree-based models, as it implements a very fast method for computing Shapley values for predictive models such as XGBoost. In contrast, KernelSHAP (2017) applies a linear kernel to estimate the Shapley values based on a sampling approach that is inherently variable (as the authors have recently admitted). KernelSHAP does have the virtue of being model-agnostic but this comes at an accuracy and consistency tradeoff. At Zest, we prefer to use the faster and more consistent TreeSHAP method for explaining tree-based models.
Other models require different explanation techniques. For artificial neural networks, there are several competing methods including: (1) DeepSHAP, which is based on DeepLIFT (2019), and (2) Integrated Gradients (2017). As Integrated Gradients most closely follows the original game-theoretic approach outlined by Shapley and his colleagues for explaining infinitesimal games, we use this method when explaining neural networks.
Similarly, for ensembles of trees and neural networks, a different Shapley-based approach is needed. For this we developed the Generalized Integrated Gradients method, which allows for even complex ensembles to be explained. As we show in the GIG paper, and as others have shown in other papers, the choice of Shapley-based explainer can result in different results due to the underlying assumptions and optimizations used in the explanation algorithm.
As such, it is important to understand whether the selected explainer is expected to generate consistent results, and to control for the implementation when evaluating consistency of explanations.
Reason 4: Changing the reference will result in different explanations.
The original Shapley values evaluated the importance of a player in a multiplayer game. Implicitly, the method used a 0-based reference. That is, when nobody played (the null game) the score would always be zero. In credit risk contexts, we are often not trying to evaluate from zero. Instead, we are trying to evaluate one person versus someone else (or one person versus a group, or one group versus another). The following diagram depicts several of these cases, one of which is adverse action. In each case, the reference population is different, because we are seeking to answer different questions.

By comparing “all applicants” with “all applicants” we can determine the average marginal contribution of the variables in the model. By selecting the reference group of “good borrowers” we can determine the reasons a model generated a score that resulted in a denial (by quantifying the extent to which each variable caused the model to score the denied borrower differently from the good borrowers, on average). By comparing protected applicants with unprotected applicants, we can determine the drivers of differences in model score between groups. The choice of reference will yield different results, which is by design because we are trying to understand different things.
Regulation B, 12 C.F.R. § 1002.9, requires creditors provide consumers with the principal reasons for a denial of credit or other adverse action. The specific reasons disclosed must “relate to and accurately describe the factors actually considered or scored by a creditor.” 12 C.F.R. part 1002, Supp. I, ¶ 9(b)(2)-2. If a creditor bases the denial or other adverse action on a credit scoring system, the reasons disclosed must relate only to those factors actually scored in the system and the actual reasons for denial must be disclosed. Id. at ¶ 9(b)(2)-4.
The commentary to Regulation B clarifies that the “regulation does not require that any one method be used,” and that “[v]arious methods will meet the requirements of the regulation.” 12 C.F.R. part 1002, Supp. I, ¶ 9(b)(2)-5. The commentary goes on to provide examples of two methods, with different reference groups, that meet those requirements. Id. The commentary goes on to state that, among the various methods that meet the regulatory requirements, “[a]ny other method that produces results substantially similar to either of these methods is also acceptable under the regulation.” Id. In other words, this interpretation allows the use of any demonstrably accurate method that involves comparing denied applicants to a reference group norm and determining those principal reasons that caused the model to score the applicant lower than the reference. Which reference group to use is not specified, in fact multiple examples are given.
In our experience now having worked with scores of lenders, we have observed several variations in the choice of reference group. Some lenders use the set of approved applicants. Others use a set of applicants close to the decision boundary. Others use the best of the approved applicants.
As expected, different choices of reference population yield different explanations. To see this, consider the following experiment which was conducted on a real XGBoost underwriting model. The model and test datasets were held constant, as was the explanation method (in this case, TreeSHAP). We evaluated explanations based on different choices of reference datasets. To determine the level of agreement between explanations generated based on different references, we computed Spearman’s Rank Correlation Coefficient. For the Spearman coefficient, closer to 100% means closer to perfect agreement. In the first experiment, we compare explanations when using the best 10% versus the worst 10% of borrowers by model score as the reference population (very different reference populations). In the second experiment, we compare the best 10% versus the best 20% of borrowers by score (more similar populations).

As expected, when we compare explanations generated based on more divergent reference populations (those with greater differences in predicted default risk) the explanations agree less, in this case, only 64%. Likewise, when the reference populations are more similar in terms of predicted risk, the explanations agree more, in this case, 95%.
Thus, one would expect to see inconsistencies in the explanations generated by Shapley-based methods when the choice of reference is not consistent.
Conclusion
We show above how Shapley-based explanations can easily be correct, but inconsistent, depending on several choices practitioners can make. These choices include: the grain size of the analysis, variations in models, choice of explanation algorithm implementation, and the choice of reference population. Making reasonable choices and controlling for variation is key to ensuring conclusions about the consistency of explanations from the Shapley-based methods are valid. When practitioners take care to document and justify their choices, more predictive and fair machine learning models can safely be adopted.