How to Perform Feature Selection in R
Feature selection is the process of identifying and retaining only the most relevant variables for your predictive model. It's not just about improving accuracy—though that's often a benefit. Feature...
Key Insights
- Filter methods like correlation analysis are fast and model-agnostic but ignore feature interactions, while wrapper methods like RFE are computationally expensive but account for model-specific performance
- Lasso regression (L1 regularization) automatically zeros out irrelevant features during training, making it ideal for high-dimensional datasets where interpretability matters
- Tree-based importance scores from random forests or XGBoost provide robust feature rankings that capture non-linear relationships, often outperforming univariate statistical tests
Introduction to Feature Selection
Feature selection is the process of identifying and retaining only the most relevant variables for your predictive model. It’s not just about improving accuracy—though that’s often a benefit. Feature selection reduces overfitting by eliminating noise, decreases training time by working with fewer dimensions, and produces more interpretable models that are easier to deploy and maintain.
There are three main categories of feature selection methods. Filter methods use statistical measures to score features independently of any machine learning model. Wrapper methods evaluate subsets of features by training models and measuring performance. Embedded methods perform feature selection as part of the model training process itself. Each approach has distinct trade-offs in terms of computational cost, model dependency, and effectiveness.
Filter Methods: Statistical Approaches
Filter methods are fast, scalable, and model-agnostic. They work by computing statistical relationships between features and the target variable, or between features themselves.
Correlation-Based Selection
Highly correlated features are redundant—they provide similar information. Removing one from each correlated pair reduces dimensionality without losing much predictive power.
library(caret)
# Load sample data
data(mtcars)
df <- mtcars
# Create correlation matrix
cor_matrix <- cor(df)
print(cor_matrix)
# Find highly correlated features (threshold = 0.85)
high_cor <- findCorrelation(cor_matrix, cutoff = 0.85)
print(paste("Features to remove:", paste(names(df)[high_cor], collapse = ", ")))
# Remove highly correlated features
df_filtered <- df[, -high_cor]
print(paste("Original features:", ncol(df), "| Filtered features:", ncol(df_filtered)))
The findCorrelation() function from caret identifies features to remove based on a correlation threshold. It uses a heuristic that removes the feature with the largest mean absolute correlation when a pair exceeds the threshold.
Variance Thresholding
Features with near-zero variance provide little discriminatory power. If a feature has the same value across most observations, it won’t help your model distinguish between classes or predict continuous outcomes.
# Calculate variance for each feature
variances <- apply(df, 2, var)
print(variances)
# Set variance threshold (e.g., 1.0)
variance_threshold <- 1.0
low_variance_features <- names(variances[variances < variance_threshold])
print(paste("Low variance features:", paste(low_variance_features, collapse = ", ")))
# Remove low variance features
df_high_var <- df[, variances >= variance_threshold]
For classification problems, you can also use caret::nearZeroVar(), which identifies features with both low variance and highly unbalanced distributions.
Chi-Square Test for Categorical Features
When working with categorical features and a categorical target, chi-square tests measure independence between variables.
# For categorical data
# Convert a continuous variable to categorical for demonstration
df$mpg_category <- cut(df$mpg, breaks = 3, labels = c("Low", "Medium", "High"))
df$cyl_factor <- as.factor(df$cyl)
# Chi-square test
chi_result <- chisq.test(table(df$mpg_category, df$cyl_factor))
print(chi_result)
# p-value < 0.05 suggests the features are dependent (good for prediction)
Wrapper Methods: Recursive Feature Elimination
Wrapper methods treat feature selection as a search problem. They evaluate different feature subsets by actually training models and measuring performance through cross-validation.
Recursive Feature Elimination with Caret
RFE works by recursively removing features and building models until the optimal subset is found.
library(caret)
library(randomForest)
# Prepare data
set.seed(123)
data(mtcars)
X <- mtcars[, -1] # Remove mpg (target)
y <- mtcars[, 1] # mpg as target
# Define control using random forest
control <- rfeControl(functions = rfFuncs,
method = "cv",
number = 5,
verbose = FALSE)
# Run RFE
results <- rfe(X, y,
sizes = c(1:10),
rfeControl = control)
# Print results
print(results)
print(paste("Optimal features:", paste(predictors(results), collapse = ", ")))
# Plot results
plot(results, type = c("g", "o"))
RFE is computationally expensive because it trains multiple models, but it accounts for feature interactions and model-specific behavior.
Stepwise Selection with AIC
For linear models, stepwise selection using Akaike Information Criterion (AIC) provides a classical approach.
library(MASS)
# Fit full model
full_model <- lm(mpg ~ ., data = mtcars)
# Perform stepwise selection (both directions)
step_model <- stepAIC(full_model, direction = "both", trace = FALSE)
# Selected features
print(summary(step_model))
print(paste("Selected features:", paste(names(coef(step_model))[-1], collapse = ", ")))
The stepAIC() function balances model fit with complexity, penalizing models with too many parameters.
Embedded Methods: Regularization Techniques
Embedded methods incorporate feature selection directly into the model training algorithm. Regularization techniques like Lasso add penalties that drive irrelevant feature coefficients to zero.
Lasso Regression with glmnet
Lasso (L1 regularization) is particularly effective for feature selection because it produces sparse models.
library(glmnet)
# Prepare data
set.seed(123)
X <- as.matrix(mtcars[, -1])
y <- mtcars[, 1]
# Fit Lasso model with cross-validation
cv_lasso <- cv.glmnet(X, y, alpha = 1) # alpha = 1 for Lasso
# Plot cross-validation results
plot(cv_lasso)
# Best lambda
best_lambda <- cv_lasso$lambda.min
print(paste("Best lambda:", best_lambda))
# Fit final model
lasso_model <- glmnet(X, y, alpha = 1, lambda = best_lambda)
# Extract coefficients
lasso_coef <- coef(lasso_model)
print(lasso_coef)
# Get non-zero features
selected_features <- rownames(lasso_coef)[lasso_coef[,1] != 0]
selected_features <- selected_features[selected_features != "(Intercept)"]
print(paste("Selected features:", paste(selected_features, collapse = ", ")))
The alpha parameter controls the type of regularization: alpha = 1 for Lasso (L1), alpha = 0 for Ridge (L2), and values between 0 and 1 for Elastic Net (combination of both).
Tree-Based Feature Importance
Tree-based models naturally rank features based on how much they improve prediction accuracy or reduce impurity.
Random Forest Feature Importance
library(randomForest)
# Fit random forest
set.seed(123)
rf_model <- randomForest(mpg ~ ., data = mtcars, importance = TRUE, ntree = 500)
# Get importance scores
importance_scores <- importance(rf_model)
print(importance_scores)
# Visualize importance
varImpPlot(rf_model, main = "Feature Importance")
# Select top N features
top_n <- 5
top_features <- names(sort(importance_scores[, 1], decreasing = TRUE)[1:top_n])
print(paste("Top", top_n, "features:", paste(top_features, collapse = ", ")))
Random forests provide two importance measures: mean decrease in accuracy and mean decrease in Gini impurity. The former is generally more reliable.
XGBoost Feature Importance
library(xgboost)
# Prepare data
X_matrix <- as.matrix(mtcars[, -1])
y_vector <- mtcars[, 1]
# Train XGBoost model
xgb_model <- xgboost(data = X_matrix,
label = y_vector,
nrounds = 100,
verbose = 0)
# Get importance
importance_matrix <- xgb.importance(model = xgb_model,
feature_names = colnames(X_matrix))
print(importance_matrix)
# Plot importance
xgb.plot.importance(importance_matrix, top_n = 10)
XGBoost provides multiple importance metrics including gain, cover, and frequency. Gain measures the improvement in accuracy brought by a feature.
Practical Comparison and Best Practices
Let’s compare multiple feature selection methods on the same dataset and evaluate their impact on model performance.
library(caret)
library(randomForest)
library(glmnet)
# Prepare data with train/test split
set.seed(123)
data(mtcars)
train_index <- createDataPartition(mtcars$mpg, p = 0.8, list = FALSE)
train_data <- mtcars[train_index, ]
test_data <- mtcars[-train_index, ]
# Function to evaluate model performance
evaluate_model <- function(features, train, test) {
formula <- as.formula(paste("mpg ~", paste(features, collapse = " + ")))
model <- lm(formula, data = train)
predictions <- predict(model, newdata = test)
rmse <- sqrt(mean((test$mpg - predictions)^2))
return(rmse)
}
# Method 1: Correlation-based
cor_matrix <- cor(train_data[, -1])
high_cor <- findCorrelation(cor_matrix, cutoff = 0.75)
cor_features <- names(train_data[, -1])[-high_cor]
cor_rmse <- evaluate_model(cor_features, train_data, test_data)
# Method 2: Lasso
X_train <- as.matrix(train_data[, -1])
y_train <- train_data[, 1]
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1)
lasso_coef <- coef(cv_lasso, s = "lambda.min")
lasso_features <- rownames(lasso_coef)[lasso_coef[,1] != 0]
lasso_features <- lasso_features[lasso_features != "(Intercept)"]
lasso_rmse <- evaluate_model(lasso_features, train_data, test_data)
# Method 3: Random Forest Importance
rf_model <- randomForest(mpg ~ ., data = train_data, importance = TRUE)
importance_scores <- importance(rf_model)[, 1]
rf_features <- names(sort(importance_scores, decreasing = TRUE)[1:5])
rf_rmse <- evaluate_model(rf_features, train_data, test_data)
# Compare results
results_df <- data.frame(
Method = c("Correlation", "Lasso", "Random Forest"),
Features = c(length(cor_features), length(lasso_features), length(rf_features)),
RMSE = c(cor_rmse, lasso_rmse, rf_rmse)
)
print(results_df)
When to Use Each Method
Use filter methods when you need fast, exploratory feature selection or when working with extremely high-dimensional data where wrapper methods are prohibitively expensive. They’re also useful as a preprocessing step before applying more sophisticated methods.
Use wrapper methods when model performance is critical and you have sufficient computational resources. RFE works well with 20-100 features. Beyond that, consider filter methods first to reduce dimensionality.
Use embedded methods like Lasso when you need automatic feature selection during training, especially for linear models. Lasso is excellent for high-dimensional data and produces interpretable models.
Use tree-based importance when dealing with non-linear relationships and feature interactions. Random forests and XGBoost handle mixed data types well and don’t require feature scaling.
Recommended Workflow
Start with filter methods to remove obvious redundancies and low-variance features. Apply domain knowledge to eliminate features that don’t make logical sense. Then use embedded methods or tree-based importance for final feature selection. Always validate your selected features on a holdout test set to ensure they generalize well. Feature selection is not a one-time task—revisit it as you collect more data or when model performance degrades.