How to Calculate Feature Importance in Python
Feature importance tells you which input variables have the most influence on your model's predictions. This matters for three critical reasons: you can identify which features to focus on during...
Key Insights
- Tree-based models provide built-in feature importance, but permutation importance offers a more reliable, model-agnostic alternative that works across all algorithms
- SHAP values deliver both local and global interpretability by calculating each feature’s contribution to individual predictions, making them ideal for stakeholder communication
- Always scale features before interpreting linear model coefficients, and be cautious with correlated features as they can split importance arbitrarily across related variables
Introduction to Feature Importance
Feature importance tells you which input variables have the most influence on your model’s predictions. This matters for three critical reasons: you can identify which features to focus on during feature engineering, reduce dimensionality by removing low-importance features, and explain model behavior to stakeholders who need to trust your predictions.
The methods fall into two categories. Model-specific approaches like tree-based importance or linear coefficients are fast and built into the algorithms. Model-agnostic methods like permutation importance and SHAP values work with any model but require more computation. Choose based on your needs: use built-in methods for quick iteration, switch to model-agnostic approaches when you need reliability or are comparing multiple models.
Tree-Based Model Feature Importance
Random Forests and Gradient Boosting models calculate feature importance during training. The default method measures how much each feature decreases impurity (Gini importance for classification, variance for regression) across all trees. Features that create cleaner splits get higher scores.
The problem with Gini importance: it’s biased toward high-cardinality features and can inflate the importance of features that aren’t actually predictive. Use it for quick exploration, but verify with permutation importance before making decisions.
from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
# Load dataset
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.3, random_state=42
)
# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# Extract feature importance
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]
# Visualize
plt.figure(figsize=(10, 6))
plt.title("Feature Importance - Random Forest")
plt.bar(range(X_train.shape[1]), importances[indices])
plt.xticks(range(X_train.shape[1]),
[wine.feature_names[i] for i in indices],
rotation=45, ha='right')
plt.tight_layout()
plt.show()
# Print top features
print("Top 5 Features:")
for i in range(5):
print(f"{wine.feature_names[indices[i]]}: {importances[indices[i]]:.4f}")
This gives you immediate feedback on which features the model considers most valuable. For the wine dataset, you’ll typically see features like flavonoids and color intensity ranking high.
Permutation Feature Importance
Permutation importance works by shuffling each feature’s values and measuring how much the model’s performance drops. If accuracy plummets when you scramble a feature, that feature was important. If nothing changes, the model doesn’t rely on it.
This approach is more reliable than Gini importance because it measures actual predictive power on held-out data. It works with any model—neural networks, SVMs, whatever you’re using. The downside is computational cost: you need to re-evaluate the model multiple times for each feature.
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
# Train multiple models
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train, y_train)
xgb = XGBClassifier(random_state=42, eval_metric='logloss')
xgb.fit(X_train, y_train)
# Calculate permutation importance for Random Forest
perm_importance_rf = permutation_importance(
rf, X_test, y_test, n_repeats=10, random_state=42
)
# Calculate for Logistic Regression
perm_importance_lr = permutation_importance(
lr, X_test, y_test, n_repeats=10, random_state=42
)
# Compare results
print("\nPermutation Importance - Random Forest:")
for i in perm_importance_rf.importances_mean.argsort()[::-1][:5]:
print(f"{wine.feature_names[i]}: "
f"{perm_importance_rf.importances_mean[i]:.4f} "
f"+/- {perm_importance_rf.importances_std[i]:.4f}")
print("\nPermutation Importance - Logistic Regression:")
for i in perm_importance_lr.importances_mean.argsort()[::-1][:5]:
print(f"{wine.feature_names[i]}: "
f"{perm_importance_lr.importances_mean[i]:.4f} "
f"+/- {perm_importance_lr.importances_std[i]:.4f}")
The standard deviation tells you how stable the importance estimate is. High variance means the feature’s contribution is inconsistent across different permutations.
SHAP (SHapley Additive exPlanations)
SHAP values come from game theory. They calculate each feature’s contribution to moving the prediction from a baseline (typically the mean prediction) to the actual prediction. This gives you both local explanations (why did the model predict this specific instance?) and global importance (which features matter most overall?).
SHAP is the gold standard for model interpretability. It’s theoretically sound, provides detailed insights, and generates compelling visualizations for non-technical audiences. The tradeoff is computation time, especially for large datasets.
import shap
# Create SHAP explainer for tree-based model
explainer_rf = shap.TreeExplainer(rf)
shap_values_rf = explainer_rf.shap_values(X_test)
# Summary plot (global importance)
shap.summary_plot(shap_values_rf[1], X_test,
feature_names=wine.feature_names,
show=False)
plt.tight_layout()
plt.show()
# Force plot for a single prediction (local explanation)
shap.initjs()
shap.force_plot(
explainer_rf.expected_value[1],
shap_values_rf[1][0],
X_test[0],
feature_names=wine.feature_names
)
# Bar plot of mean absolute SHAP values
shap.summary_plot(shap_values_rf[1], X_test,
feature_names=wine.feature_names,
plot_type="bar", show=False)
plt.tight_layout()
plt.show()
The summary plot shows both importance (position on x-axis) and effect (color indicates feature value). You can see not just which features matter, but how they influence predictions. High values of a feature pushing predictions up or down? The plot shows you.
Coefficient-Based Importance for Linear Models
Linear and logistic regression coefficients directly represent feature importance—with a critical caveat. Coefficients are only comparable when features are on the same scale. Always standardize your features before interpreting coefficients.
The absolute value of a coefficient tells you how much a one-standard-deviation change in that feature affects the prediction. For logistic regression, coefficients represent log-odds, but the relative magnitudes still indicate importance.
from sklearn.preprocessing import StandardScaler
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train logistic regression on scaled data
lr_scaled = LogisticRegression(max_iter=1000, random_state=42)
lr_scaled.fit(X_train_scaled, y_train)
# Extract coefficients (for binary or multiclass, take mean across classes)
if len(lr_scaled.coef_.shape) > 1:
coef_importance = np.abs(lr_scaled.coef_).mean(axis=0)
else:
coef_importance = np.abs(lr_scaled.coef_[0])
# Sort and visualize
coef_indices = np.argsort(coef_importance)[::-1]
plt.figure(figsize=(10, 6))
plt.title("Feature Importance - Logistic Regression Coefficients")
plt.bar(range(len(coef_importance)), coef_importance[coef_indices])
plt.xticks(range(len(coef_importance)),
[wine.feature_names[i] for i in coef_indices],
rotation=45, ha='right')
plt.tight_layout()
plt.show()
print("\nTop 5 Features by Coefficient Magnitude:")
for i in range(5):
print(f"{wine.feature_names[coef_indices[i]]}: "
f"{coef_importance[coef_indices[i]]:.4f}")
This approach is fast and interpretable. For linear models, it’s often sufficient. Just remember: coefficients assume linear relationships and independence between features.
Comparing and Visualizing Results
Different methods highlight different aspects of feature importance. Tree-based importance reflects what the model learned during training. Permutation importance shows predictive power. SHAP values reveal contribution to specific predictions. Compare multiple methods to get a complete picture.
import pandas as pd
# Create comparison dataframe
comparison_df = pd.DataFrame({
'Feature': wine.feature_names,
'RF_Gini': rf.feature_importances_,
'RF_Permutation': perm_importance_rf.importances_mean,
'LR_Permutation': perm_importance_lr.importances_mean,
'LR_Coefficient': coef_importance,
'SHAP': np.abs(shap_values_rf[1]).mean(axis=0)
})
# Normalize each method to 0-1 scale for comparison
for col in comparison_df.columns[1:]:
comparison_df[col] = (comparison_df[col] - comparison_df[col].min()) / \
(comparison_df[col].max() - comparison_df[col].min())
# Plot comparison
fig, axes = plt.subplots(1, 5, figsize=(20, 4), sharey=True)
methods = ['RF_Gini', 'RF_Permutation', 'LR_Permutation',
'LR_Coefficient', 'SHAP']
for idx, method in enumerate(methods):
sorted_df = comparison_df.sort_values(method, ascending=True)
axes[idx].barh(sorted_df['Feature'], sorted_df[method])
axes[idx].set_title(method.replace('_', ' '))
axes[idx].set_xlabel('Normalized Importance')
axes[0].set_ylabel('Features')
plt.tight_layout()
plt.show()
Look for consensus across methods. Features that rank high across multiple approaches are genuinely important. Discrepancies reveal interesting model behavior worth investigating.
Practical Considerations and Best Practices
Correlated features split importance between them. If you have two highly correlated features, each might show moderate importance even though together they’re critical. Use correlation matrices to identify these situations, and consider grouping correlated features in your analysis.
Choose your method based on context. For quick iteration during development, use built-in tree importance. For model comparison or critical decisions, use permutation importance. For stakeholder presentations or debugging specific predictions, use SHAP. For simple linear models, coefficients are sufficient.
Watch out for these pitfalls: forgetting to scale features before interpreting linear coefficients, using Gini importance as your only measure, calculating importance on training data instead of test data, and ignoring the computational cost of SHAP on large datasets.
Computational costs vary dramatically. Tree-based importance is essentially free—it’s calculated during training. Permutation importance requires n_features × n_repeats model evaluations. SHAP can take minutes or hours depending on model complexity and dataset size. Budget accordingly.
The most important practice: use multiple methods and look for agreement. Feature importance isn’t a single number—it’s a multifaceted concept. Different methods capture different aspects. When they agree, you can be confident. When they disagree, you’ve learned something valuable about your model.