How to Use tidymodels in R

• tidymodels provides a unified interface for machine learning in R that eliminates the inconsistency of dealing with dozens of different package APIs, making your modeling code more maintainable and...

Key Insights

• tidymodels provides a unified interface for machine learning in R that eliminates the inconsistency of dealing with dozens of different package APIs, making your modeling code more maintainable and readable.

• The framework’s workflow objects combine preprocessing and model specification into a single reusable unit, drastically reducing code duplication when comparing multiple models or tuning hyperparameters.

• Unlike base R’s fragmented modeling ecosystem, tidymodels enforces consistent naming conventions and data structures across all operations, from splitting data to evaluating predictions.

Introduction to tidymodels

The tidymodels ecosystem solves a fundamental problem in R’s machine learning landscape: every package uses different conventions. Random forests through randomForest look nothing like gradient boosting through xgboost, which looks nothing like linear models through glm. This fragmentation makes production pipelines brittle and difficult to maintain.

tidymodels brings the same tidy principles that revolutionized data manipulation with dplyr to the entire modeling workflow. It’s a meta-package that loads eight core packages, each handling a specific aspect of machine learning: rsample for data splitting, recipes for preprocessing, parsnip for model specification, workflows for pipeline management, tune for hyperparameter optimization, yardstick for metrics, broom for tidying model outputs, and dials for parameter grids.

# Install once
install.packages("tidymodels")

# Load the ecosystem
library(tidymodels)
library(palmerpenguins)  # Example dataset

# Check what's loaded
tidymodels_packages()

Data Splitting and Resampling

Proper data splitting prevents overfitting and gives honest performance estimates. The rsample package provides initial_split() which uses stratified sampling by default for classification targets, ensuring your train and test sets have similar distributions.

# Load and prepare data
data(penguins)
penguins_clean <- penguins %>% 
  drop_na()

# Create 75/25 train/test split
set.seed(123)
penguin_split <- initial_split(penguins_clean, 
                               prop = 0.75, 
                               strata = species)

penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)

# Verify split proportions
nrow(penguin_train) / nrow(penguins_clean)

For model evaluation during training, cross-validation provides more robust performance estimates than a single validation set. The vfold_cv() function creates k-fold splits that you’ll use during hyperparameter tuning.

# Create 5-fold cross-validation
penguin_folds <- vfold_cv(penguin_train, v = 5, strata = species)

# Examine the structure
penguin_folds

This creates five different train/validation splits from your training data. Each fold holds out approximately 20% of the data for validation while training on the remaining 80%.

Feature Engineering with recipes

The recipes package implements a preprocessing pipeline using a “recipe” metaphor. You specify transformation steps, then prep the recipe on training data to learn parameters (like means for normalization), and bake it to apply those transformations to new data.

# Define preprocessing recipe
penguin_recipe <- recipe(species ~ ., data = penguin_train) %>%
  # Remove ID columns that leak information
  step_rm(year, island) %>%
  # Impute missing values with median
  step_impute_median(all_numeric_predictors()) %>%
  # Normalize numeric features to mean=0, sd=1
  step_normalize(all_numeric_predictors()) %>%
  # Convert categorical variables to dummy variables
  step_dummy(all_nominal_predictors()) %>%
  # Remove zero-variance predictors
  step_zv(all_predictors())

# Prep learns parameters from training data
penguin_prep <- prep(penguin_recipe)

# Bake applies transformations
penguin_baked <- bake(penguin_prep, new_data = NULL)  # NULL returns training data
head(penguin_baked)

The all_numeric_predictors() and all_nominal_predictors() selectors make recipes resilient to changes in your data structure. Add a new numeric column and it automatically gets normalized without code changes.

Model Specification with parsnip

parsnip provides a unified interface to dozens of modeling packages. Instead of learning each package’s unique syntax, you specify models using consistent set_engine() and set_mode() functions.

# Random forest specification
rf_spec <- rand_forest(trees = 1000) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification")

# Logistic regression specification
lr_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

# XGBoost specification
xgb_spec <- boost_tree(trees = 100) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

Notice the identical structure across model types. This consistency becomes invaluable when comparing multiple models—you change the specification, not the entire pipeline.

Building Workflows

Workflows bundle your recipe and model specification into a single object. This might seem like extra abstraction, but it prevents a common bug: forgetting to apply the same preprocessing to your test data that you applied to training data.

# Create workflow
penguin_wf <- workflow() %>%
  add_recipe(penguin_recipe) %>%
  add_model(rf_spec)

# Fit to training data
penguin_fit <- penguin_wf %>%
  fit(data = penguin_train)

# Make predictions on test set
predictions <- predict(penguin_fit, penguin_test) %>%
  bind_cols(penguin_test)

# View predictions
predictions %>%
  select(species, .pred_class)

The workflow automatically preps and bakes your recipe during fitting and prediction. You can’t accidentally skip preprocessing steps or apply them in the wrong order.

Model Tuning and Evaluation

Most models have hyperparameters that significantly impact performance. The tune package lets you mark parameters for tuning using tune(), then search the parameter space using cross-validation.

# Define model with tunable parameters
rf_tune_spec <- rand_forest(
  mtry = tune(),
  min_n = tune(),
  trees = 1000
) %>%
  set_engine("ranger") %>%
  set_mode("classification")

# Create tuning workflow
tune_wf <- workflow() %>%
  add_recipe(penguin_recipe) %>%
  add_model(rf_tune_spec)

# Define parameter grid
rf_grid <- grid_regular(
  mtry(range = c(2, 4)),
  min_n(range = c(2, 10)),
  levels = 5
)

# Tune using cross-validation
tune_results <- tune_grid(
  tune_wf,
  resamples = penguin_folds,
  grid = rf_grid,
  metrics = metric_set(accuracy, roc_auc)
)

# View best models
show_best(tune_results, metric = "accuracy")

# Select best parameters
best_params <- select_best(tune_results, metric = "accuracy")

The yardstick package provides the metrics. For classification, use accuracy, roc_auc, precision, recall. For regression, use rmse, rsq, mae.

# Visualize tuning results
autoplot(tune_results)

Practical Example: End-to-End Pipeline

Here’s a complete regression pipeline predicting penguin body mass, demonstrating every concept together.

# Setup
library(tidymodels)
library(palmerpenguins)

data(penguins)
penguins_clean <- penguins %>% drop_na()

# Split data
set.seed(456)
mass_split <- initial_split(penguins_clean, prop = 0.75)
mass_train <- training(mass_split)
mass_test <- testing(mass_split)
mass_folds <- vfold_cv(mass_train, v = 5)

# Create recipe
mass_recipe <- recipe(body_mass_g ~ ., data = mass_train) %>%
  step_rm(year) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())

# Define tunable model
xgb_spec <- boost_tree(
  trees = 500,
  tree_depth = tune(),
  learn_rate = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

# Create workflow
mass_wf <- workflow() %>%
  add_recipe(mass_recipe) %>%
  add_model(xgb_spec)

# Tune hyperparameters
xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  learn_rate(),
  size = 20
)

tune_results <- tune_grid(
  mass_wf,
  resamples = mass_folds,
  grid = xgb_grid,
  metrics = metric_set(rmse, rsq)
)

# Finalize workflow with best parameters
best_xgb <- select_best(tune_results, metric = "rmse")
final_wf <- finalize_workflow(mass_wf, best_xgb)

# Fit final model on all training data
final_fit <- final_wf %>%
  fit(data = mass_train)

# Evaluate on test set
test_predictions <- predict(final_fit, mass_test) %>%
  bind_cols(mass_test)

test_metrics <- test_predictions %>%
  metrics(truth = body_mass_g, estimate = .pred)

test_metrics

This pipeline is production-ready. The workflow object can be serialized with saveRDS() and loaded in production to make predictions on new data without retraining.

The tidymodels approach scales from simple linear regressions to complex ensemble models without changing your code structure. When you need to compare ten different model types, you swap specifications while keeping the same recipe and workflow structure. When stakeholders request different preprocessing, you modify the recipe without touching model code. This separation of concerns makes machine learning projects maintainable at scale.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.