How to Implement CatBoost in Python

CatBoost is a gradient boosting library developed by Yandex that solves real problems other boosting frameworks gloss over. While XGBoost and LightGBM require you to encode categorical features...

Key Insights

  • CatBoost handles categorical features natively without manual encoding, eliminating preprocessing steps that often introduce data leakage and saving significant development time
  • The library’s ordered boosting algorithm and built-in regularization make it naturally resistant to overfitting, often producing better results than XGBoost or LightGBM with default parameters
  • CatBoost’s GPU training is exceptionally fast and its model serialization is production-ready out of the box, making it ideal for both experimentation and deployment

Introduction to CatBoost

CatBoost is a gradient boosting library developed by Yandex that solves real problems other boosting frameworks gloss over. While XGBoost and LightGBM require you to encode categorical features manually—often leading to data leakage through target encoding—CatBoost handles categories natively using an efficient ordered target statistic approach.

The library shines in several scenarios: datasets with many categorical features, situations where you need fast iteration without extensive hyperparameter tuning, and production environments where model stability matters. CatBoost’s default parameters are remarkably well-tuned, meaning you’ll often get competitive results without the hyperparameter gymnastics required by other frameworks.

Choose CatBoost when you have categorical data, need robust out-of-the-box performance, or want GPU acceleration without complex setup. Stick with XGBoost for maximum customization or LightGBM when training speed on large datasets is paramount.

Installation and Setup

Getting started with CatBoost is straightforward. Install it via pip and you’re ready to build models.

pip install catboost

For GPU support (which provides significant speedups), ensure you have CUDA installed and use:

pip install catboost-gpu

Here are the essential imports for most CatBoost workflows:

import pandas as pd
import numpy as np
from catboost import CatBoostClassifier, CatBoostRegressor, Pool, cv
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error, roc_auc_score
import matplotlib.pyplot as plt

The Pool class is CatBoost’s data container that efficiently handles both numerical and categorical features. It’s optional but recommended for better performance and cleaner code.

Data Preparation and Categorical Feature Handling

CatBoost’s killer feature is native categorical support. Instead of one-hot encoding or target encoding, you simply tell CatBoost which columns are categorical.

# Load dataset
df = pd.read_csv('titanic.csv')

# Basic preprocessing
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Define features and target
X = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
y = df['Survived']

# Identify categorical features by column name or index
cat_features = ['Sex', 'Embarked', 'Pclass']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

The beauty here is that ‘Sex’ and ‘Embarked’ remain as strings. No label encoding, no one-hot encoding, no manual preprocessing. CatBoost handles it internally using a sophisticated algorithm that calculates target statistics while preventing overfitting.

For better performance with large datasets, use the Pool class:

train_pool = Pool(
    data=X_train,
    label=y_train,
    cat_features=cat_features
)

test_pool = Pool(
    data=X_test,
    label=y_test,
    cat_features=cat_features
)

Training a Basic CatBoost Model

Training a CatBoost model is refreshingly simple. The library provides separate classes for classification and regression tasks.

For classification:

# Initialize classifier
model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    loss_function='Logloss',
    verbose=100,  # Print metrics every 100 iterations
    random_seed=42
)

# Train model
model.fit(
    X_train, 
    y_train,
    cat_features=cat_features,
    eval_set=(X_test, y_test),
    early_stopping_rounds=50,
    plot=True  # Shows training progress
)

# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_proba):.4f}")

For regression tasks:

# Load housing data
housing_df = pd.read_csv('housing.csv')
X_reg = housing_df.drop('price', axis=1)
y_reg = housing_df['price']

# Identify categorical features
cat_features_reg = ['neighborhood', 'property_type']

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Initialize regressor
reg_model = CatBoostRegressor(
    iterations=1000,
    learning_rate=0.05,
    depth=8,
    loss_function='RMSE',
    verbose=False,
    random_seed=42
)

# Train
reg_model.fit(
    X_train_reg,
    y_train_reg,
    cat_features=cat_features_reg,
    eval_set=(X_test_reg, y_test_reg)
)

# Predictions
predictions = reg_model.predict(X_test_reg)
rmse = np.sqrt(mean_squared_error(y_test_reg, predictions))
print(f"RMSE: {rmse:.2f}")

Hyperparameter Tuning and Model Optimization

CatBoost’s default parameters are solid, but tuning can squeeze out additional performance. Focus on these key hyperparameters:

  • iterations: Number of boosting iterations (default: 1000)
  • learning_rate: Step size shrinkage (default: 0.03)
  • depth: Tree depth (default: 6)
  • l2_leaf_reg: L2 regularization coefficient (default: 3.0)
  • border_count: Number of splits for numerical features (default: 254)

CatBoost provides a built-in cross-validation function that’s more efficient than sklearn’s GridSearchCV:

from catboost import cv

# Prepare parameters
params = {
    'iterations': 1000,
    'learning_rate': 0.1,
    'depth': 6,
    'l2_leaf_reg': 3,
    'loss_function': 'Logloss',
    'random_seed': 42
}

# Create Pool with all data
cv_pool = Pool(
    data=X,
    label=y,
    cat_features=cat_features
)

# Run cross-validation
cv_results = cv(
    pool=cv_pool,
    params=params,
    fold_count=5,
    shuffle=True,
    partition_random_seed=42,
    plot=True,
    verbose=False
)

print(f"Best CV Score: {cv_results['test-Logloss-mean'].min():.4f}")

For systematic hyperparameter search:

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'depth': [4, 6, 8],
    'learning_rate': [0.01, 0.05, 0.1],
    'l2_leaf_reg': [1, 3, 5, 7],
    'iterations': [500, 1000]
}

# Initialize model
base_model = CatBoostClassifier(
    cat_features=cat_features,
    verbose=False,
    random_seed=42
)

# Grid search
grid_search = GridSearchCV(
    estimator=base_model,
    param_grid=param_grid,
    cv=3,
    scoring='roc_auc',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best ROC-AUC: {grid_search.best_score_:.4f}")

Model Evaluation and Feature Importance

Proper evaluation goes beyond accuracy. Examine multiple metrics and understand feature contributions.

from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

# Predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Classification report
print(classification_report(y_test, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

# Feature importance
feature_importance = model.get_feature_importance()
feature_names = X_train.columns

# Create feature importance dataframe
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': feature_importance
}).sort_values('importance', ascending=False)

print("\nTop 10 Features:")
print(importance_df.head(10))

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.barh(importance_df['feature'][:10], importance_df['importance'][:10])
plt.xlabel('Importance')
plt.title('Top 10 Feature Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

CatBoost also supports SHAP values for more sophisticated interpretability:

# Get SHAP values
shap_values = model.get_feature_importance(
    data=test_pool,
    type='ShapValues'
)

# SHAP values include the expected value in the last column
print(f"SHAP values shape: {shap_values.shape}")

Saving and Deploying the Model

CatBoost models serialize efficiently and load quickly, making deployment straightforward.

# Save model in multiple formats
model.save_model('catboost_model.cbm')  # Native format
model.save_model('catboost_model.json', format='json')  # Human-readable

# Load model
loaded_model = CatBoostClassifier()
loaded_model.load_model('catboost_model.cbm')

# Verify loaded model works
test_predictions = loaded_model.predict(X_test)
print(f"Loaded model accuracy: {accuracy_score(y_test, test_predictions):.4f}")

# For production, create a prediction function
def predict_survival(passenger_data):
    """
    Predict survival probability for Titanic passengers.
    
    Args:
        passenger_data: dict or DataFrame with passenger features
    
    Returns:
        float: Survival probability
    """
    if isinstance(passenger_data, dict):
        passenger_data = pd.DataFrame([passenger_data])
    
    proba = loaded_model.predict_proba(passenger_data)[:, 1]
    return proba[0]

# Example usage
new_passenger = {
    'Pclass': 1,
    'Sex': 'female',
    'Age': 29,
    'SibSp': 0,
    'Parch': 0,
    'Fare': 100.0,
    'Embarked': 'S'
}

survival_prob = predict_survival(new_passenger)
print(f"Survival probability: {survival_prob:.2%}")

For production environments, consider wrapping your model in a REST API using Flask or FastAPI, containerizing with Docker, and monitoring prediction latency and data drift. CatBoost’s fast prediction times make it well-suited for real-time serving.

The library’s native categorical handling means your production pipeline is simpler—no need to maintain encoding schemes or worry about unseen categories. Just pass the raw data and let CatBoost handle it.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.