How to Implement Decision Trees in R
Decision trees are supervised learning algorithms that split data into branches based on feature values, creating a tree-like structure of decisions. They excel at both classification (predicting...
Key Insights
- Decision trees in R are best implemented using the
rpartpackage, which provides robust classification and regression capabilities with built-in cross-validation - Proper pruning using complexity parameters (cp) is critical—unpruned trees will overfit your training data and perform poorly on new observations
- Always visualize your tree with
rpart.plotbefore deployment; if you can’t explain the decision logic to stakeholders, your model is too complex
Introduction to Decision Trees
Decision trees are supervised learning algorithms that split data into branches based on feature values, creating a tree-like structure of decisions. They excel at both classification (predicting categories) and regression (predicting continuous values) tasks, and they’re particularly valuable when you need interpretable models that non-technical stakeholders can understand.
R provides excellent support for decision trees through several packages. The rpart (Recursive Partitioning and Regression Trees) package is the industry standard, implementing the CART algorithm with efficient pruning capabilities. The tree package offers an alternative implementation, while caret provides a unified interface for model training and evaluation. For this article, we’ll focus on rpart because of its robust implementation and extensive documentation.
Decision trees work well for mixed data types, handle non-linear relationships naturally, and require minimal data preprocessing. However, they’re prone to overfitting and can be unstable—small changes in training data can produce completely different trees. Proper validation and pruning are essential.
Setting Up Your Environment
First, install and load the necessary packages. You’ll need rpart for building trees, rpart.plot for visualization, and caret for model evaluation.
# Install packages (run once)
install.packages(c("rpart", "rpart.plot", "caret"))
# Load packages
library(rpart)
library(rpart.plot)
library(caret)
# Load a sample dataset
data(iris)
# Examine the structure
str(iris)
head(iris)
For this tutorial, we’ll use the iris dataset, which contains measurements of 150 iris flowers across three species. It’s a classic classification problem that demonstrates decision tree concepts clearly. In production environments, you’d load your own data using read.csv() or database connections.
Building a Basic Classification Tree
Creating a decision tree with rpart follows a familiar R modeling syntax. The key is understanding the control parameters that govern tree growth.
# Create a basic decision tree
tree_model <- rpart(
Species ~ ., # Predict Species using all other variables
data = iris,
method = "class", # Classification tree
control = rpart.control(
minsplit = 20, # Minimum observations to attempt a split
maxdepth = 10, # Maximum tree depth
cp = 0.01 # Complexity parameter
)
)
# View the model summary
print(tree_model)
summary(tree_model)
The minsplit parameter prevents splitting nodes with too few observations, reducing overfitting. The maxdepth parameter limits tree depth—deeper trees capture more detail but risk overfitting. The complexity parameter (cp) is crucial: it sets the minimum improvement required for a split to be retained. Higher cp values produce simpler trees.
The model output shows the decision rules at each node. For example, you might see: “If Petal.Length < 2.45, classify as setosa.” These rules are exactly what make decision trees interpretable.
# Get variable importance
tree_model$variable.importance
Variable importance scores show which features contribute most to predictions. This information guides feature engineering and helps explain model behavior to stakeholders.
Visualizing the Decision Tree
Visualization transforms abstract splitting rules into intuitive diagrams. The rpart.plot package offers several styles.
# Basic visualization
rpart.plot(tree_model)
# More detailed plot with additional information
rpart.plot(
tree_model,
type = 4, # Draw node labels
extra = 101, # Display number of observations and percentage
fallen.leaves = TRUE, # Align leaf nodes at bottom
main = "Iris Classification Tree"
)
# Alternative: base R plotting
plot(tree_model, uniform = TRUE, margin = 0.1)
text(tree_model, use.n = TRUE, cex = 0.8)
The type and extra parameters control what information appears in the plot. I recommend type = 4 and extra = 101 for classification trees—they show class distributions and observation counts, making it easy to assess node purity and sample sizes.
When interpreting visualizations, look for:
- Short paths to leaves: Simpler decision rules are better
- Balanced splits: Nodes should split data relatively evenly
- Pure leaf nodes: All observations in a leaf should ideally belong to one class
If your tree has dozens of splits, it’s overfitting. Simplify it.
Evaluating Model Performance
Never evaluate a model on training data alone. Split your data into training and testing sets to assess generalization.
# Set seed for reproducibility
set.seed(123)
# Create training and testing sets (70/30 split)
train_indices <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
train_data <- iris[train_indices, ]
test_data <- iris[-train_indices, ]
# Train the model
tree_model <- rpart(Species ~ ., data = train_data, method = "class")
# Make predictions
predictions <- predict(tree_model, test_data, type = "class")
# Evaluate performance
conf_matrix <- confusionMatrix(predictions, test_data$Species)
print(conf_matrix)
# Extract key metrics
accuracy <- conf_matrix$overall['Accuracy']
precision <- conf_matrix$byClass[, 'Precision']
recall <- conf_matrix$byClass[, 'Recall']
cat(sprintf("Accuracy: %.3f\n", accuracy))
The confusion matrix shows true positives, false positives, true negatives, and false negatives for each class. Accuracy alone can be misleading with imbalanced classes—always examine precision and recall. Precision measures how many predicted positives are actually positive; recall measures how many actual positives were correctly identified.
For multi-class problems like iris, you get precision and recall for each class. Pay attention to classes with low recall—your model is missing those cases.
Pruning and Optimization
Unpruned trees overfit. The rpart package performs 10-fold cross-validation automatically and stores results in the model object.
# View cross-validation results
printcp(tree_model)
# Plot cross-validation error vs. complexity parameter
plotcp(tree_model)
# Find optimal cp value (minimum xerror)
optimal_cp <- tree_model$cptable[which.min(tree_model$cptable[,"xerror"]), "CP"]
# Prune the tree
pruned_model <- prune(tree_model, cp = optimal_cp)
# Compare before and after
rpart.plot(tree_model, main = "Before Pruning")
rpart.plot(pruned_model, main = "After Pruning")
# Evaluate pruned model
pruned_predictions <- predict(pruned_model, test_data, type = "class")
confusionMatrix(pruned_predictions, test_data$Species)
The printcp() output shows the cross-validated error (xerror) for different cp values. The plotcp() visualization makes it easy to identify where error plateaus. Choose the largest cp value within one standard deviation of the minimum error—this gives you the simplest tree that performs nearly as well as the most complex one.
Pruning often improves test set performance even when it slightly reduces training accuracy. That’s the entire point—you’re trading training performance for better generalization.
Regression Trees
Decision trees aren’t limited to classification. For continuous target variables, use regression trees with method = "anova".
# Load mtcars dataset
data(mtcars)
# Create regression tree predicting mpg
regression_tree <- rpart(
mpg ~ .,
data = mtcars,
method = "anova",
control = rpart.control(cp = 0.01)
)
# Visualize
rpart.plot(regression_tree, main = "MPG Prediction Tree")
# Make predictions
predictions <- predict(regression_tree, mtcars)
# Calculate RMSE
rmse <- sqrt(mean((mtcars$mpg - predictions)^2))
cat(sprintf("RMSE: %.3f\n", rmse))
# Calculate R-squared
ss_total <- sum((mtcars$mpg - mean(mtcars$mpg))^2)
ss_residual <- sum((mtcars$mpg - predictions)^2)
r_squared <- 1 - (ss_residual / ss_total)
cat(sprintf("R-squared: %.3f\n", r_squared))
Regression trees split based on minimizing squared error rather than maximizing class purity. Leaf nodes contain the mean value of observations in that node. Evaluate regression trees using RMSE (lower is better) and R-squared (higher is better).
The same pruning principles apply—use cross-validation to find the optimal complexity parameter and avoid overfitting.
Practical Recommendations
Start with default parameters and prune aggressively. Most practitioners set cp too low and end up with overfit models. Begin with cp = 0.01 and increase if your tree is still too complex.
Always validate with holdout data or cross-validation. Training accuracy is meaningless for decision trees—they can achieve 100% training accuracy by memorizing every observation.
Use ensemble methods like random forests or gradient boosting for production systems. Single decision trees are excellent for exploration and explanation but rarely achieve the best predictive performance. The randomForest and xgboost packages build on decision tree concepts to create more powerful models.
Document your decision rules. The interpretability of decision trees is their primary advantage—don’t waste it. Export the rules and include them in model documentation so domain experts can validate the logic.
Decision trees in R are straightforward to implement, but mastery requires understanding the bias-variance tradeoff and proper validation techniques. Start simple, prune ruthlessly, and always validate on unseen data.