pip install shap
Model Interpretability Made Easy with SHAP in Python
Introduction
Machine learning models are often black boxes, making it hard to understand their predictions. SHAP (SHapley Additive exPlanations) helps explain individual predictions and feature importance using game theory.
What is SHAP?
SHAP assigns each feature a contribution value for a prediction, based on Shapley values from cooperative game theory. It works across models and provides both global (overall) and local (per prediction) explanations.
Installing SHAP
Data Description
We’ll use the California Housing dataset from sklearn.datasets, which contains housing and location data from the 1990 census. The target is the median house value, and the features include income, average rooms, house age, population, etc.
Column | Description |
---|---|
MedInc | Median income in the district |
HouseAge | Median age of the houses |
AveRooms | Average number of rooms per household |
AveBedrms | Average number of bedrooms per household |
Population | Total population in the district |
AveOccup | Average number of people per household |
Latitude | Latitude coordinate of the district |
Longitude | Longitude coordinate of the district |
Target | Median house value (in $100,000s) |
Train a Model for Explanation
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import xgboost as xgb
import shap
# Load data
= fetch_california_housing()
housing = housing.data, housing.target
X, y = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Train XGBoost model
= xgb.XGBRegressor()
model model.fit(X_train, y_train)
XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, feature_weights=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, feature_weights=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, ...)
Apply SHAP Explainer
= shap.Explainer(model, X_train)
explainer = explainer(X_test) shap_values
Explainer: The tool that calculates SHAP values by analyzing how each feature influences a model’s prediction.
shap_values: The output containing Shapley values, which explain the contribution of each feature to individual predictions.
SHAP Visualizations
Summary Plot
generates a SHAP summary plot, which provides a high-level overview of feature importance and impact.
=housing.feature_names) shap.summary_plot(shap_values, X_test, feature_names
🔍 What it shows:
Each dot represents a prediction for one instance.
X-axis shows the SHAP value — how much that feature pushed the prediction higher or lower.
Color indicates the feature’s actual value (red = high, blue = low).
Features are sorted top-to-bottom by overall importance (mean absolute SHAP value).
Bar Plot
creates a bar plot showing the average absolute SHAP value for each feature.
shap.plots.bar(shap_values)
🔍 What it shows:
Each bar represents a feature.
The length of the bar shows how much that feature contributes to predictions on average (i.e., feature importance).
Features are sorted from most to least important.
Pros and Cons
Pros:
Works with any ML model
Provides both local and global explanations
Offers intuitive and rich visualizations
Cons:
Can be slow for large datasets (especially KernelExplainer)
Interpretation still needs domain knowledge
Conclusion
SHAP provides a transparent way to understand machine learning models by attributing the contribution of each feature to the prediction. It enhances model interpretability, builds trust, and helps identify important features, making it an essential tool for ethical and responsible AI deployment.