How to Perform Feature Selection in Python
Feature selection is the process of identifying and keeping only the most relevant features in your dataset while discarding redundant or irrelevant ones. It's not just about reducing...
Key Insights
- Filter methods like variance threshold and chi-square tests are computationally cheap and work well for initial feature reduction, but they ignore feature interactions and model-specific performance.
- Wrapper methods like Recursive Feature Elimination provide better accuracy by evaluating feature subsets with your actual model, but they’re exponentially more expensive as feature count increases.
- Embedded methods like Lasso regression and tree-based feature importance offer the best balance—they’re faster than wrapper methods while considering feature interactions during model training.
Introduction to Feature Selection
Feature selection is the process of identifying and keeping only the most relevant features in your dataset while discarding redundant or irrelevant ones. It’s not just about reducing dimensionality—it’s about building better models.
The benefits are concrete: models train faster, generalize better to unseen data, and become easier to interpret. A model with 10 well-chosen features will almost always outperform one with 100 noisy features. Feature selection also reduces overfitting, especially critical when you have limited training data.
There are three main approaches. Filter methods use statistical measures to score features independently of any machine learning model. Wrapper methods evaluate feature subsets by actually training models and measuring performance. Embedded methods perform feature selection as part of the model training process itself. Each has distinct trade-offs between computational cost and effectiveness.
Preparing Your Dataset
Let’s work with a real dataset throughout this article. We’ll use scikit-learn’s breast cancer dataset—it’s sufficiently complex with 30 features and provides a realistic classification problem.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')
print(f"Dataset shape: {X.shape}")
print(f"\nFirst few features:\n{X.head()}")
print(f"\nBasic statistics:\n{X.describe()}")
# Check for missing values
print(f"\nMissing values: {X.isnull().sum().sum()}")
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Scale features (important for many algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Understanding feature correlations is crucial before selection. Highly correlated features often provide redundant information:
# Correlation matrix
correlation_matrix = X.corr()
# Plot heatmap for subset of features
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()
# Find highly correlated pairs
high_corr = np.where(np.abs(correlation_matrix) > 0.8)
high_corr_pairs = [(correlation_matrix.index[x], correlation_matrix.columns[y],
correlation_matrix.iloc[x, y])
for x, y in zip(*high_corr) if x != y and x < y]
print(f"\nHighly correlated pairs (>0.8):\n{high_corr_pairs[:5]}")
Filter Methods
Filter methods are your first line of defense. They’re fast, model-agnostic, and perfect for initial feature reduction on large datasets.
Variance Threshold removes features with low variance—features that don’t vary much probably won’t help your model distinguish between classes:
from sklearn.feature_selection import VarianceThreshold
# Remove features with variance below threshold
selector = VarianceThreshold(threshold=0.1)
X_high_variance = selector.fit_transform(X_train_scaled)
print(f"Original features: {X_train_scaled.shape[1]}")
print(f"After variance threshold: {X_high_variance.shape[1]}")
print(f"Removed features: {np.sum(~selector.get_support())}")
SelectKBest with chi-square test works well for classification with non-negative features:
from sklearn.feature_selection import SelectKBest, chi2, f_classif
# Chi-square requires non-negative features
X_train_positive = X_train - X_train.min() + 1
# Select top 10 features
selector = SelectKBest(chi2, k=10)
X_train_chi2 = selector.fit_transform(X_train_positive, y_train)
# Get selected feature names
selected_features = X.columns[selector.get_support()].tolist()
print(f"Top 10 features by chi-square: {selected_features}")
# Show scores
scores = pd.DataFrame({
'feature': X.columns,
'score': selector.scores_
}).sort_values('score', ascending=False)
print(scores.head(10))
Mutual Information captures non-linear relationships better than correlation:
from sklearn.feature_selection import mutual_info_classif
# Calculate mutual information
mi_scores = mutual_info_classif(X_train_scaled, y_train, random_state=42)
# Create dataframe and sort
mi_df = pd.DataFrame({
'feature': X.columns,
'mi_score': mi_scores
}).sort_values('mi_score', ascending=False)
print(mi_df.head(10))
# Visualize
plt.figure(figsize=(10, 6))
plt.barh(mi_df['feature'][:15], mi_df['mi_score'][:15])
plt.xlabel('Mutual Information Score')
plt.title('Top 15 Features by Mutual Information')
plt.tight_layout()
plt.show()
Wrapper Methods
Wrapper methods are more sophisticated—they evaluate feature subsets by training actual models. Recursive Feature Elimination (RFE) is the most practical wrapper method.
RFE works by recursively training models, ranking features by importance, and eliminating the least important ones:
from sklearn.feature_selection import RFE, RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# RFE with fixed number of features
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rfe = RFE(estimator=rf_model, n_features_to_select=10)
X_train_rfe = rfe.fit_transform(X_train_scaled, y_train)
X_test_rfe = rfe.transform(X_test_scaled)
# Get selected features
selected_features_rfe = X.columns[rfe.get_support()].tolist()
print(f"Features selected by RFE: {selected_features_rfe}")
# Train and evaluate
rf_model.fit(X_train_rfe, y_train)
y_pred = rf_model.predict(X_test_rfe)
print(f"Accuracy with RFE features: {accuracy_score(y_test, y_pred):.4f}")
RFECV automatically finds the optimal number of features using cross-validation:
from sklearn.model_selection import StratifiedKFold
# RFE with cross-validation
rfecv = RFECV(
estimator=RandomForestClassifier(n_estimators=100, random_state=42),
step=1,
cv=StratifiedKFold(5),
scoring='accuracy',
n_jobs=-1
)
rfecv.fit(X_train_scaled, y_train)
print(f"Optimal number of features: {rfecv.n_features_}")
# Plot number of features vs. cross-validation scores
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(rfecv.cv_results_['mean_test_score']) + 1),
rfecv.cv_results_['mean_test_score'])
plt.xlabel('Number of Features')
plt.ylabel('Cross-Validation Score')
plt.title('RFECV: Optimal Feature Count')
plt.grid(True)
plt.show()
Embedded Methods
Embedded methods integrate feature selection into model training. They’re faster than wrapper methods while still considering feature interactions.
Tree-based feature importance is straightforward and effective:
# Train Random Forest and extract feature importance
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train_scaled, y_train)
# Create feature importance dataframe
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print(feature_importance.head(10))
# Visualize
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'][:15],
feature_importance['importance'][:15])
plt.xlabel('Importance')
plt.title('Top 15 Features by Random Forest Importance')
plt.tight_layout()
plt.show()
# Select features above threshold
from sklearn.feature_selection import SelectFromModel
sfm = SelectFromModel(rf, threshold='median')
sfm.fit(X_train_scaled, y_train)
X_train_important = sfm.transform(X_train_scaled)
print(f"Features selected: {X_train_important.shape[1]}")
Lasso (L1 regularization) drives irrelevant feature coefficients to zero:
from sklearn.linear_model import LogisticRegression
# Lasso regularization
lasso = LogisticRegression(penalty='l1', C=0.1, solver='liblinear',
random_state=42)
lasso.fit(X_train_scaled, y_train)
# Get coefficients
lasso_coef = pd.DataFrame({
'feature': X.columns,
'coefficient': np.abs(lasso.coef_[0])
}).sort_values('coefficient', ascending=False)
print(lasso_coef.head(10))
# Features with non-zero coefficients
non_zero_features = lasso_coef[lasso_coef['coefficient'] > 0]['feature'].tolist()
print(f"\nFeatures with non-zero coefficients: {len(non_zero_features)}")
Comparing Methods and Best Practices
Different methods excel in different scenarios. Here’s a practical comparison:
import time
from sklearn.metrics import accuracy_score, precision_score, recall_score
methods = {
'All Features': (X_train_scaled, X_test_scaled),
'Variance Threshold': (
VarianceThreshold(threshold=0.1).fit_transform(X_train_scaled),
VarianceThreshold(threshold=0.1).fit(X_train_scaled).transform(X_test_scaled)
),
'SelectKBest (k=10)': (
SelectKBest(f_classif, k=10).fit_transform(X_train_scaled, y_train),
SelectKBest(f_classif, k=10).fit(X_train_scaled, y_train).transform(X_test_scaled)
),
'RFE (n=10)': (X_train_rfe, X_test_rfe),
'Random Forest Importance': (X_train_important, sfm.transform(X_test_scaled))
}
results = []
for method_name, (X_tr, X_te) in methods.items():
start_time = time.time()
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_tr, y_train)
y_pred = model.predict(X_te)
train_time = time.time() - start_time
results.append({
'Method': method_name,
'Features': X_tr.shape[1],
'Accuracy': accuracy_score(y_test, y_pred),
'Precision': precision_score(y_test, y_pred),
'Recall': recall_score(y_test, y_pred),
'Time (s)': train_time
})
results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))
Best practices:
- Start with filter methods for initial exploration and reducing thousands of features to hundreds
- Use embedded methods for most production scenarios—they balance speed and accuracy
- Reserve wrapper methods for critical applications where you need maximum accuracy and can afford the computational cost
- Always validate your feature selection with cross-validation, not just train/test split
- Consider domain knowledge—sometimes a “less important” feature matters for interpretability or business rules
Conclusion
Feature selection isn’t optional—it’s fundamental to building robust machine learning models. Start with filter methods to eliminate obvious noise, then apply embedded methods for your production pipeline. Use wrapper methods when accuracy is paramount and you have the computational budget.
The best approach depends on your constraints. Working with 10,000 features? Filter methods are essential. Need maximum accuracy on 50 features? Try RFECV. Building a production pipeline? Embedded methods with tree-based importance or Lasso offer the best balance.
Don’t treat feature selection as a one-time preprocessing step. Revisit it as you gather more data, and always validate your selections with proper cross-validation. The features that matter most can change as your dataset evolves.
For next steps, explore automated feature engineering with libraries like Featuretools, or build feature selection into your pipelines using scikit-learn’s Pipeline class for cleaner, more maintainable code.