How to Use Permutation Importance in Python

Key Insights

Permutation importance measures feature relevance by randomly shuffling feature values and observing model performance degradation—it works with any model type, unlike tree-based importance metrics
The method captures feature interactions and provides more reliable importance scores than default feature importances, but requires careful interpretation when features are correlated
Use multiple permutations (n_repeats=10+) to get stable estimates with confidence intervals, and consider grouped permutation for correlated features to avoid misleading results

Introduction to Permutation Importance

Permutation importance answers a straightforward question: how much does model performance suffer when a feature contains random noise instead of real data? By shuffling a feature’s values and measuring the resulting performance drop, you get a model-agnostic measure of feature importance that works with any scikit-learn compatible estimator.

Unlike tree-based feature importance (which counts how often a feature is used for splits), permutation importance directly measures predictive power. Tree-based methods can be misleading—they favor high-cardinality features and don’t account for feature interactions properly. Permutation importance sidesteps these issues by evaluating the trained model’s actual predictions.

The algorithm is simple: take a trained model, shuffle one feature’s values in the validation set, make predictions, and measure performance degradation. Repeat for all features. Features that cause large performance drops when shuffled are important; those that cause minimal impact aren’t.

When to Use Permutation Importance

Permutation importance shines in several scenarios. First, it’s model-agnostic—use it with neural networks, support vector machines, or any black-box model. Second, it captures feature interactions naturally since it evaluates the complete trained model. Third, it uses out-of-sample data, giving you a realistic view of feature importance for generalization.

The main advantages include reliability (based on actual model performance), interpretability (intuitive concept), and flexibility (works with any metric). However, be aware of limitations. Computational cost scales linearly with features and permutations. Correlated features create interpretation challenges—when features are correlated, shuffling one breaks the correlation structure, potentially overstating importance. For datasets with thousands of features, computation time becomes significant.

Use permutation importance when you need model-agnostic explanations, when comparing feature importance across different model types, or when tree-based importance seems unreliable. Avoid it as your only tool for highly correlated feature sets without additional analysis.

Basic Implementation with scikit-learn

Let’s start with a straightforward example using the wine quality dataset and a Random Forest classifier:

import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

# Load data
wine = load_wine()
X, y = wine.data, wine.target
feature_names = wine.feature_names

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Compute permutation importance
perm_importance = permutation_importance(
    rf_model, X_test, y_test,
    n_repeats=10,
    random_state=42,
    n_jobs=-1
)

# Create results dataframe
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance_mean': perm_importance.importances_mean,
    'importance_std': perm_importance.importances_std
}).sort_values('importance_mean', ascending=False)

print(importance_df)

The permutation_importance function returns an object with importances_mean (average performance drop across permutations), importances_std (standard deviation), and importances (raw scores from each permutation). Setting n_repeats=10 performs 10 permutations per feature, providing stable estimates. Use n_jobs=-1 to parallelize computation across CPU cores.

Interpreting the Results

Permutation importance scores represent the decrease in model performance when a feature is shuffled. Higher scores indicate more important features. The standard deviation reveals consistency—low std means the feature’s importance is stable across permutations; high std suggests variability or potential interactions.

Visualize results to make interpretation easier:

import matplotlib.pyplot as plt

# Get top 10 features
top_features = importance_df.head(10)

# Create horizontal bar plot with error bars
fig, ax = plt.subplots(figsize=(10, 6))
y_pos = np.arange(len(top_features))

ax.barh(y_pos, top_features['importance_mean'], 
        xerr=top_features['importance_std'],
        align='center', alpha=0.7, capsize=5)
ax.set_yticks(y_pos)
ax.set_yticklabels(top_features['feature'])
ax.invert_yaxis()
ax.set_xlabel('Permutation Importance (Mean ± Std)')
ax.set_title('Top 10 Feature Importances')
plt.tight_layout()
plt.show()

Look for features with high mean importance and low standard deviation—these are reliably important. Features with high std might interact with other features or have non-linear effects. Zero or negative importance means the feature doesn’t help prediction or that random noise performed better (rare, usually indicates overfitting).

Advanced Usage: Custom Scoring and Multiple Models

Compare permutation importance across models to understand which features different algorithms rely on:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import make_scorer, f1_score

# Custom scorer for multi-class F1
f1_scorer = make_scorer(f1_score, average='weighted')

# Train multiple models
models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}

importance_comparison = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    
    perm_imp = permutation_importance(
        model, X_test, y_test,
        n_repeats=15,
        random_state=42,
        scoring=f1_scorer,
        n_jobs=-1
    )
    
    importance_comparison[name] = pd.DataFrame({
        'feature': feature_names,
        'importance': perm_imp.importances_mean
    }).sort_values('importance', ascending=False)

# Compare top 5 features for each model
for name, imp_df in importance_comparison.items():
    print(f"\n{name} - Top 5 Features:")
    print(imp_df.head())

This reveals model-specific feature dependencies. Random Forests might rely on different features than Gradient Boosting, informing your feature engineering and model selection decisions.

Handling Real-World Scenarios

For large datasets, computing permutation importance on all data is expensive. Sample strategically:

# For large datasets, sample test set
sample_size = min(1000, len(X_test))
sample_indices = np.random.choice(len(X_test), sample_size, replace=False)
X_test_sample = X_test[sample_indices]
y_test_sample = y_test[sample_indices]

perm_importance = permutation_importance(
    rf_model, X_test_sample, y_test_sample,
    n_repeats=10,
    random_state=42
)

For correlated features, implement grouped permutation to shuffle related features together:

def grouped_permutation_importance(model, X, y, feature_groups, n_repeats=10):
    """
    Compute permutation importance for groups of correlated features.
    
    feature_groups: dict mapping group names to lists of feature indices
    """
    from sklearn.metrics import accuracy_score
    
    baseline_score = accuracy_score(y, model.predict(X))
    group_importances = {}
    
    for group_name, feature_indices in feature_groups.items():
        scores = []
        
        for _ in range(n_repeats):
            X_permuted = X.copy()
            # Shuffle all features in the group together
            permutation = np.random.permutation(len(X))
            X_permuted[:, feature_indices] = X_permuted[permutation][:, feature_indices]
            
            permuted_score = accuracy_score(y, model.predict(X_permuted))
            scores.append(baseline_score - permuted_score)
        
        group_importances[group_name] = {
            'mean': np.mean(scores),
            'std': np.std(scores)
        }
    
    return group_importances

# Example: group correlated features
feature_groups = {
    'alcohol_related': [0, 10],  # Alcohol and color intensity
    'acidity': [1, 2, 3]         # Various acidity measures
}

group_imp = grouped_permutation_importance(
    rf_model, X_test, y_test, feature_groups, n_repeats=10
)
print(group_imp)

Best Practices and Takeaways

Set n_repeats based on your needs—10 is a reasonable default, but use 30+ for publication-quality results or when importance scores are close. Always use held-out test data, never training data, to avoid overfitting bias.

Combine permutation importance with other interpretability methods. Use SHAP for instance-level explanations and permutation importance for global feature ranking. Use partial dependence plots to understand how important features affect predictions.

Watch for common pitfalls: correlated features can have deflated importance scores when permuted individually; extrapolation issues occur when permutation creates unrealistic feature combinations; computational cost grows with features and repeats. For highly correlated features, use grouped permutation or correlation analysis first.

Permutation importance is most valuable when you need model-agnostic feature rankings, when comparing models, or when tree-based importance seems unreliable. It provides actionable insights for feature selection, model debugging, and stakeholder communication. The performance drop directly translates to business impact—a feature with 0.05 importance means your accuracy drops 5% without it.

Use permutation importance as part of your standard model evaluation workflow. It takes minutes to compute and provides insights that raw model performance metrics miss. Combined with domain knowledge, it guides feature engineering decisions and helps build more interpretable, maintainable models.