How to Use Recursive Feature Elimination in Python
Feature selection is critical for building effective machine learning models. More features don't always mean better predictions. High-dimensional datasets introduce the curse of dimensionality—as...
Key Insights
- RFE eliminates features iteratively by training models and removing the weakest performers, making it more robust than univariate selection methods that ignore feature interactions
- RFECV automatically determines the optimal number of features through cross-validation, eliminating the guesswork of manual feature count selection
- Always apply RFE within cross-validation folds to prevent data leakage—fitting RFE on your entire dataset before splitting will artificially inflate performance metrics
Introduction to Feature Selection and RFE
Feature selection is critical for building effective machine learning models. More features don’t always mean better predictions. High-dimensional datasets introduce the curse of dimensionality—as features increase, you need exponentially more data to maintain model performance. Extra features also increase training time, introduce noise, and make models harder to interpret.
Recursive Feature Elimination (RFE) addresses this by systematically identifying the most important features. Unlike filter methods that evaluate features independently using statistical tests, RFE uses a machine learning model to assess feature importance while considering feature interactions. It trains a model, ranks features by importance, eliminates the weakest ones, and repeats until reaching your target feature count.
This wrapper method is computationally expensive but delivers superior results because it evaluates features in the context of your actual model, not through generic statistical measures.
Understanding the RFE Algorithm
RFE follows a straightforward recursive process:
- Train your chosen estimator on all features
- Extract feature importance scores (coefficients or feature importances)
- Rank features based on these scores
- Eliminate the lowest-ranked feature(s)
- Repeat steps 1-4 until the desired number of features remains
The algorithm assumes your estimator provides feature importance—either through coef_ attributes (linear models) or feature_importances_ (tree-based models). This backward elimination strategy often outperforms forward selection because it starts with the full feature context.
Here’s a conceptual view of the process:
# Pseudocode for RFE logic
features = all_features
while len(features) > n_features_to_select:
model.fit(X[features], y)
importances = get_feature_importances(model)
weakest_feature = features[argmin(importances)]
features.remove(weakest_feature)
return features
Each iteration refines your feature set, progressively removing noise while retaining signal.
Basic RFE Implementation with Scikit-learn
Let’s implement RFE on a real dataset. We’ll use the breast cancer dataset and logistic regression:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd
# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Initialize RFE with logistic regression
estimator = LogisticRegression(max_iter=10000, random_state=42)
rfe = RFE(estimator=estimator, n_features_to_select=10)
# Fit RFE
rfe.fit(X_train, y_train)
# Display selected features
selected_features = X.columns[rfe.support_]
print("Selected features:", selected_features.tolist())
# Show feature rankings (1 = selected, higher = eliminated earlier)
feature_ranking = pd.DataFrame({
'feature': X.columns,
'ranking': rfe.ranking_
}).sort_values('ranking')
print("\nFeature Rankings:")
print(feature_ranking.head(15))
# Evaluate performance
train_score = rfe.score(X_train, y_train)
test_score = rfe.score(X_test, y_test)
print(f"\nTrain accuracy: {train_score:.3f}")
print(f"Test accuracy: {test_score:.3f}")
The support_ attribute returns a boolean mask of selected features, while ranking_ shows elimination order. Features ranked 1 were selected; higher numbers indicate earlier elimination.
This basic implementation requires you to specify n_features_to_select upfront. That’s where RFECV becomes valuable.
RFECV: RFE with Cross-Validation
RFECV extends RFE by using cross-validation to find the optimal feature count automatically. It performs RFE for different feature counts and selects the number that maximizes CV performance:
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
# Initialize RFECV
estimator = RandomForestClassifier(n_estimators=100, random_state=42)
rfecv = RFECV(
estimator=estimator,
step=1, # Remove 1 feature per iteration
cv=5, # 5-fold cross-validation
scoring='accuracy',
n_jobs=-1
)
# Fit RFECV
rfecv.fit(X_train, y_train)
print(f"Optimal number of features: {rfecv.n_features_}")
# Plot CV scores vs number of features
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(rfecv.cv_results_['mean_test_score']) + 1),
rfecv.cv_results_['mean_test_score'])
plt.xlabel('Number of Features')
plt.ylabel('Cross-Validation Score')
plt.title('RFECV: Optimal Feature Count')
plt.grid(True)
plt.tight_layout()
plt.savefig('rfecv_scores.png', dpi=300, bbox_inches='tight')
# Compare with all features
full_model = RandomForestClassifier(n_estimators=100, random_state=42)
full_model.fit(X_train, y_train)
print(f"\nRFECV test accuracy: {rfecv.score(X_test, y_test):.3f}")
print(f"Full model test accuracy: {full_model.score(X_test, y_test):.3f}")
print(f"Features reduced from {X.shape[1]} to {rfecv.n_features_}")
The step parameter controls how many features to remove per iteration. Setting step=1 is thorough but slow; larger values speed up computation but might miss the optimal point. For datasets with hundreds of features, start with step=5 or step=10.
RFECV’s cv_results_ dictionary contains mean test scores for each feature count, letting you visualize the performance curve and identify diminishing returns.
RFE with Different Estimators
RFE’s effectiveness depends heavily on your estimator choice. Different models produce different feature rankings:
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
# Scale features (important for SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Define estimators
estimators = {
'Logistic Regression': LogisticRegression(max_iter=10000, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'SVM': SVC(kernel='linear', random_state=42),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}
results = {}
for name, estimator in estimators.items():
# Use scaled data for SVM, original for tree-based
X_tr = X_train_scaled if name == 'SVM' else X_train
X_te = X_test_scaled if name == 'SVM' else X_test
rfe = RFE(estimator=estimator, n_features_to_select=10)
rfe.fit(X_tr, y_train)
results[name] = {
'features': X.columns[rfe.support_].tolist(),
'test_score': rfe.score(X_te, y_test)
}
print(f"\n{name}:")
print(f"Test Accuracy: {results[name]['test_score']:.3f}")
print(f"Selected Features: {results[name]['features'][:5]}...") # First 5
# Find consensus features
all_selected = [set(results[name]['features']) for name in results]
consensus = set.intersection(*all_selected)
print(f"\nConsensus features (selected by all models): {consensus}")
Linear models (Logistic Regression, SVM) and tree-based models (Random Forest, Gradient Boosting) often select different features. Linear models prefer features with strong individual correlations, while tree-based models can capture complex interactions.
For most applications, use Random Forest or Gradient Boosting with RFE. They handle non-linear relationships better and don’t require feature scaling. Reserve linear models for when interpretability is paramount.
Real-World Application and Best Practices
Here’s a complete pipeline demonstrating proper RFE usage with cross-validation to prevent data leakage:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
# Create pipeline - RFE inside ensures no data leakage
pipeline = Pipeline([
('scaler', StandardScaler()),
('rfe', RFE(estimator=RandomForestClassifier(random_state=42))),
('classifier', RandomForestClassifier(random_state=42))
])
# Define parameter grid
param_grid = {
'rfe__n_features_to_select': [5, 10, 15, 20],
'classifier__n_estimators': [50, 100],
'classifier__max_depth': [5, 10, None]
}
# Grid search with cross-validation
grid_search = GridSearchCV(
pipeline,
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")
# Evaluate on test set
y_pred = grid_search.predict(X_test)
print(f"\nTest accuracy: {grid_search.score(X_test, y_test):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Extract selected features from best model
best_rfe = grid_search.best_estimator_.named_steps['rfe']
selected_features = X.columns[best_rfe.support_]
print(f"\nFinal selected features ({len(selected_features)}):")
print(selected_features.tolist())
Critical best practices:
- Always use pipelines: Wrapping RFE in a pipeline ensures it’s refit during cross-validation, preventing data leakage
- Scale appropriately: Apply scaling before RFE for distance-based models (SVM, KNN) but not for tree-based models
- Start with RFECV: Let cross-validation determine the optimal feature count before manual tuning
- Monitor computation time: RFE is expensive—for datasets with 100+ features, consider increasing the
stepparameter or using feature importance thresholds first - Validate feature stability: Run RFE multiple times with different random states to ensure selected features are stable, not artifacts of randomness
RFE transforms feature selection from guesswork into a systematic, model-driven process. While computationally intensive, it delivers feature sets optimized for your specific model and data, leading to simpler, faster, and often more accurate predictions.