Unlocking the Secrets of Machine Learning: SHAP for Interpretable Predictions

Machine Learning: SHAP

What is SHAP?

SHAP (SHAPley Additive exPlanations) is a powerful tool that enhances the explainability of machine learning models by visualizing their outputs. It allows us to understand the contribution of each feature to the model’s prediction. SHAP is a combination of several tools, including lime, SHAPely sampling values, DeepLift, QII, and more.

The key component of SHAP is the SHAPley values, which facilitate optimal credit allocation and local explanations. These values distribute the contribution among features, providing an accurate and interpretable representation of the model’s outcomes.

One of the notable advantages of SHAP is its compatibility with popular data modeling libraries such as SciKit-Learn, PySpark, TensorFlow, Keras, PyTorch, and more. These libraries are widely used in data modeling, but their model outcomes are often challenging to interpret. By leveraging SHAP, we can make these outcomes more comprehensible, even for users who lack expertise in machine learning. Furthermore, SHAP enables effective data visualization. Now, let’s explore how to install and utilize the SHAP tool in our environment.

Installing SHAP

To install the SHAP tool, use the following pip command:

!pip install SHAP

Output:

Successfully installed SHAP

With SHAP installed, we can now proceed to create models using simple data.

Simple Implementation of SHAP

As mentioned earlier, SHAP can be used with various modeling libraries. In this section, we’ll demonstrate how straightforward it is to utilize SHAP to enhance the interpretability of simple models. We’ll begin by loading the data. Fortunately, the SHAP package comes with some ready datasets, and for this example, we’ll use the IRIS dataset for classification.

Loading the data

import SHAP
X, y = SHAP.datasets.iris(display=True)

Splitting the data

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Checking the data

from google.colab import data_table
data_table.enable_dataframe_formatter()
X_train

Output:

| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) |
|-------------------|------------------|-------------------|------------------|
| 6.4               | 2.8              | 5.6               | 2.1              |
| 5.0               | 3.0              | 1.6               | 0.2              |
| 5.4               | 3.9              | 1.7               | 0.4              |
| ...               | ...              | ...               | ...              |

For classification, we’ll use the SVM model from the SK-Learn library.

Importing and fitting the model

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

svc = SVC(kernel='linear', probability=True)
svc.fit(X_train, Y_train)
y_pred = svc.predict(X_test)
accuracy_score(Y_test, y_pred)

Output:

1.0

Here, we obtained an accuracy of 100%. Now, let’s utilize the SHAP tool to explain the predictions on the test set using visualization.

Explaining Predictions with SHAP

To explain the predictions using SHAP, we need an explainer.

explainer = SHAP.KernelExplainer(svc.predict_proba, X_train)
SHAP_values = explainer.SHAP_values(X_test)

Plotting the prediction

SHAP.initjs()
SHAP.force_plot(explainer.expected_value[0], SHAP_values[0], X_test)

Output:

Output will be a visualization. Please refer to the original article for the image.

By visualizing the force plot generated by SHAP, we can understand the impact of each feature on the model’s prediction for a specific instance of the data. The force plot showcases the influence of each feature on the prediction, where values in blue indicate a positive influence, and values in red indicate a negative influence.

In the example above, we demonstrated a general application of the SHAP tool to models. Now, let’s delve into the explanation of SHAPley values, which were created during the previous modeling process.

Explaining Models with SHAPley Values

In this section, we will explore how SHAPley values can make machine learning models more explainable. To illustrate this, we will employ a simple linear regression model on the IRIS dataset, which we previously used.

Let’s begin by training the model on the loaded data.

import sklearn.linear_model

model = sklearn.linear_model.LinearRegression()
model.fit(X, y)

Output:

LinearRegression()

Examining the Model Coefficients

One common approach to explaining the success of a linear model is to examine the coefficients learned by the model for each feature. Since SHAPley values consider every value as important for the output, examining the coefficients provides insights into how much the output can change when the feature is altered.

# Coefficient examination

In the code above, we followed a traditional method to examine the model. By inspecting the coefficients, we can determine the extent to which the output changes with variations in each feature.

Having SHAP at our disposal allows us to create a clearer picture using partial dependence plots.

Partial Dependence Plots

The importance of a feature can be gauged by understanding its impact on the output or by examining its distribution. By plotting both the model and the distribution in a single graph, we can derive even more meaningful insights. Let’s explore how to achieve this using SHAP.

SHAP.plots.partial_dependence(
    "petal length (cm)", model.predict, X50, ice=False,
    model_expected_value=True, feature_expected_value=True
)

Output:

Output will be a visualization. Please refer to the original article for the image.

In the resulting plot, the x-axis represents the histogram of the data distribution, and the blue line represents the average value of the model output. The blue line passes through the center point, which is also the intersection of the expected value lines.

Through this plot, we can observe the SHAP value, which corresponds to the SHAPley values applied to any conditional expectation function of a model.

For example, we can extract a few values from the data and utilize them as a sample for the background distribution. Let’s say we extract 50 instances and compute the SHAP values.

X50 = SHAP.utils.sample(X, 50)
explainer = SHAP.Explainer(model.predict, X50)
SHAP_values = explainer(X)

Partial dependence plot

sample_ind = 18
SHAP.partial_dependence_plot(
    "petal length (cm)", model.predict, X50, model_expected_value=True,
    feature_expected_value=True, ice=False,
    SHAP_values=SHAP_values[sample_ind:sample_ind+1,:]
)

Output:

Output will be a visualization. Please refer to the original article for the image.

The partial dependence plot demonstrates a close correspondence with the SHAP value. This indicates that we have plotted a mean-centered version of the partial dependence plot for that particular feature.

Now, let’s examine the distribution of the SHAP values.

SHAP.plots.scatter(SHAP_values[:,"petal length (cm)"])

Output:

Output will be a visualization. Please refer to the original article for the image.

The scatter plot provides a clearer visualization of the distribution of SHAP values, which aligns with the distribution of the petal length feature.

Waterfall Plot

The SHAP values for all input features are aggregated to yield the difference between the expected output from the model and the actual output. The waterfall plot vividly illustrates this process.

SHAP.plots.waterfall(SHAP_values[sample_ind])

Output:

Output will be a visualization. Please refer to the original article for the image.

By examining the waterfall plot, we can gain insight into how the predicted values are obtained using SHAP.

Final Words

In this article, we explored the SHAP (SHAPley Additive exPlanations) tool and its ability to enhance the explainability of machine learning models. We demonstrated how SHAP can be easily applied to models, making their outcomes more interpretable. Additionally, we learned how SHAPley values contribute to improving the explainability of any model.

By utilizing SHAP, data scientists and analysts can gain deeper insights into their models and effectively communicate the impact of different features on predictions. This tool empowers users to build more transparent and trustworthy machine learning models.