How to Use Random Forest for Feature Selection in R
Feature selection is critical for building interpretable, efficient machine learning models. Too many features lead to overfitting, increased computational costs, and models that are difficult to...
Key Insights
- Random Forest provides two robust feature importance metrics—Mean Decrease Accuracy and Mean Decrease Gini—that capture both predictive power and node purity contributions across hundreds of decision trees.
- Feature selection with Random Forest is non-parametric and handles non-linear relationships automatically, making it superior to correlation-based methods for complex datasets with interaction effects.
- Always validate your feature selection by comparing model performance on reduced versus full feature sets using cross-validation, as importance scores can be unstable on small datasets or with highly correlated predictors.
Introduction to Feature Selection with Random Forest
Feature selection is critical for building interpretable, efficient machine learning models. Too many features lead to overfitting, increased computational costs, and models that are difficult to explain to stakeholders. Random Forest excels at feature selection because it inherently evaluates feature importance while building an ensemble of decision trees.
Random Forest calculates two primary importance metrics. Mean Decrease Accuracy measures how much prediction accuracy drops when a feature is randomly permuted, breaking its relationship with the target variable. Features that cause large accuracy drops when scrambled are clearly important. Mean Decrease Gini (or Mean Decrease Impurity) measures how much each feature contributes to reducing node impurity across all trees in the forest. Features that consistently create pure splits are ranked higher.
These metrics work well because Random Forest builds hundreds of trees on bootstrapped samples with random feature subsets at each split. This randomization process provides robust importance estimates that account for feature interactions and non-linear relationships.
# Load required libraries
library(randomForest)
library(caret)
library(ggplot2)
library(dplyr)
# Set seed for reproducibility
set.seed(123)
Preparing Your Dataset
Let’s use a real-world dataset to demonstrate feature selection. We’ll work with the Sonar dataset from the mlbench package, which contains 60 features representing sonar signal measurements used to classify objects as rocks or mines.
library(mlbench)
data(Sonar)
# Initial exploration
dim(Sonar) # 208 observations, 61 variables (60 features + 1 target)
str(Sonar)
summary(Sonar)
# Check for missing values
sum(is.na(Sonar)) # Should be 0
# Check class distribution
table(Sonar$Class)
# Create train/test split (70/30)
train_index <- createDataPartition(Sonar$Class, p = 0.7, list = FALSE)
train_data <- Sonar[train_index, ]
test_data <- Sonar[-train_index, ]
# Verify split maintains class proportions
prop.table(table(train_data$Class))
prop.table(table(test_data$Class))
The Sonar dataset is clean with no missing values, but in real-world scenarios, you’d need to handle NAs through imputation or removal. The 70/30 split ensures we have enough data for training while reserving sufficient samples for unbiased performance evaluation.
Building the Random Forest Model
Random Forest has two key hyperparameters for feature selection purposes. ntree controls the number of trees in the forest—more trees provide more stable importance estimates but increase computation time. mtry specifies how many features are randomly sampled as candidates at each split. For classification, the default is sqrt(p) where p is the total number of features.
# Train Random Forest with default parameters
rf_model <- randomForest(
Class ~ .,
data = train_data,
ntree = 500,
mtry = floor(sqrt(ncol(train_data) - 1)),
importance = TRUE, # Critical: enables importance calculation
proximity = FALSE
)
# View model summary
print(rf_model)
# Check OOB error rate
plot(rf_model, main = "Error Rate vs Number of Trees")
legend("topright", legend = colnames(rf_model$err.rate),
col = 1:3, lty = 1:3)
The importance = TRUE parameter is essential—it tells Random Forest to calculate both importance metrics. The Out-of-Bag (OOB) error plot helps determine if 500 trees is sufficient; the error should stabilize, indicating convergence.
Extracting and Visualizing Feature Importance
Now we extract importance scores and create meaningful visualizations. The built-in varImpPlot() is quick but limited in customization.
# Extract importance scores
importance_scores <- importance(rf_model)
head(importance_scores)
# Built-in visualization
varImpPlot(rf_model, n.var = 20, main = "Top 20 Important Features")
# Create custom visualization with ggplot2
importance_df <- data.frame(
Feature = rownames(importance_scores),
MeanDecreaseAccuracy = importance_scores[, "MeanDecreaseAccuracy"],
MeanDecreaseGini = importance_scores[, "MeanDecreaseGini"]
)
# Sort by MeanDecreaseAccuracy and select top 20
importance_df <- importance_df %>%
arrange(desc(MeanDecreaseAccuracy)) %>%
head(20)
# Create horizontal bar plot
ggplot(importance_df, aes(x = reorder(Feature, MeanDecreaseAccuracy),
y = MeanDecreaseAccuracy)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 20 Features by Mean Decrease Accuracy",
x = "Feature",
y = "Mean Decrease Accuracy") +
theme_minimal() +
theme(axis.text.y = element_text(size = 8))
Mean Decrease Accuracy is generally more reliable for feature selection because it directly measures predictive contribution. Mean Decrease Gini can be biased toward features with many categories or high cardinality.
Selecting Features Based on Importance Thresholds
There are three common approaches for selecting features: taking the top N features, selecting features above a percentile threshold, or using an absolute importance cutoff.
# Approach 1: Select top N features
top_n <- 15
top_features <- importance_df$Feature[1:top_n]
# Approach 2: Select features above 75th percentile
all_importance <- data.frame(
Feature = rownames(importance_scores),
Importance = importance_scores[, "MeanDecreaseAccuracy"]
)
percentile_threshold <- quantile(all_importance$Importance, 0.75)
selected_features_percentile <- all_importance %>%
filter(Importance >= percentile_threshold) %>%
pull(Feature)
# Approach 3: Select features above absolute threshold
# Set threshold as proportion of max importance
threshold <- max(all_importance$Importance) * 0.3
selected_features_threshold <- all_importance %>%
filter(Importance >= threshold) %>%
pull(Feature)
# Create reduced datasets
train_reduced <- train_data[, c(as.character(top_features), "Class")]
test_reduced <- test_data[, c(as.character(top_features), "Class")]
# Display number of selected features
cat("Top N approach:", length(top_features), "features\n")
cat("Percentile approach:", length(selected_features_percentile), "features\n")
cat("Threshold approach:", length(selected_features_threshold), "features\n")
The choice of method depends on your goals. Fixed top-N is simple and predictable. Percentile-based adapts to your dataset’s importance distribution. Threshold-based gives you direct control over the minimum acceptable importance.
Validating Feature Selection
Feature selection is only valuable if it maintains or improves model performance while reducing complexity. Always validate by comparing models trained on full versus reduced feature sets.
# Train model on full feature set
rf_full <- randomForest(Class ~ ., data = train_data,
ntree = 500, importance = FALSE)
# Train model on reduced feature set
rf_reduced <- randomForest(Class ~ ., data = train_reduced,
ntree = 500, importance = FALSE)
# Predict on test set
pred_full <- predict(rf_full, test_data)
pred_reduced <- predict(rf_reduced, test_reduced)
# Compare performance
cm_full <- confusionMatrix(pred_full, test_data$Class)
cm_reduced <- confusionMatrix(pred_reduced, test_reduced$Class)
# Display results
cat("\nFull Model Performance:\n")
print(cm_full$overall['Accuracy'])
cat("\nReduced Model Performance (", ncol(train_reduced) - 1, " features):\n", sep = "")
print(cm_reduced$overall['Accuracy'])
# Cross-validation for more robust comparison
train_control <- trainControl(method = "cv", number = 10)
cv_full <- train(Class ~ ., data = train_data, method = "rf",
trControl = train_control, ntree = 500)
cv_reduced <- train(Class ~ ., data = train_reduced, method = "rf",
trControl = train_control, ntree = 500)
cat("\nCross-Validation Results:\n")
cat("Full model accuracy:", max(cv_full$results$Accuracy), "\n")
cat("Reduced model accuracy:", max(cv_reduced$results$Accuracy), "\n")
If the reduced model performs within 1-2% of the full model while using significantly fewer features, feature selection was successful. You’ve gained interpretability and efficiency without sacrificing predictive power.
Best Practices and Conclusion
Random Forest feature selection works best when you follow these guidelines. First, use cross-validation to ensure importance scores are stable—single train/test splits can give misleading results on small datasets. Second, be cautious with highly correlated features; Random Forest may arbitrarily favor one over another, so consider correlation analysis alongside importance scores. Third, combine domain knowledge with statistical importance—a feature with moderate importance but high business relevance might be worth keeping.
Random Forest feature selection outperforms filter methods like correlation analysis when features have non-linear relationships or interactions with the target. However, for datasets with thousands of features, consider using Random Forest after an initial filter step to reduce computational burden.
Here’s a wrapper function that encapsulates the entire pipeline:
rf_feature_selection <- function(data, target_var, n_features = 15,
ntree = 500, cv_folds = 10) {
# Prepare formula
formula <- as.formula(paste(target_var, "~ ."))
# Train Random Forest
rf_model <- randomForest(formula, data = data,
ntree = ntree, importance = TRUE)
# Extract and rank features
importance_scores <- importance(rf_model)
top_features <- rownames(importance_scores)[
order(importance_scores[, "MeanDecreaseAccuracy"],
decreasing = TRUE)[1:n_features]
]
# Cross-validate reduced model
reduced_data <- data[, c(top_features, target_var)]
train_control <- trainControl(method = "cv", number = cv_folds)
cv_model <- train(formula, data = reduced_data, method = "rf",
trControl = train_control, ntree = ntree)
# Return results
list(
selected_features = top_features,
cv_accuracy = max(cv_model$results$Accuracy),
importance_scores = importance_scores[top_features, ]
)
}
# Usage example
results <- rf_feature_selection(train_data, "Class", n_features = 15)
print(results$selected_features)
This approach to feature selection is production-ready and has proven effective across classification and regression tasks. The key is balancing model performance, interpretability, and computational efficiency based on your specific requirements.