Pandas - Select Columns by Data Type

• Use `select_dtypes()` to filter DataFrame columns by data type with include/exclude parameters, supporting both NumPy and pandas-specific types like 'number', 'object', and 'category'

Key Insights

• Use select_dtypes() to filter DataFrame columns by data type with include/exclude parameters, supporting both NumPy and pandas-specific types like ’number’, ‘object’, and ‘category’ • Combine programmatic column selection with conditional logic to build dynamic data pipelines that adapt to schema changes without hardcoded column names • Leverage data type selection for preprocessing workflows like encoding categorical variables, scaling numeric features, or validating data quality across type-specific columns

Understanding DataFrame Data Types

Pandas assigns specific data types to each column based on the values it contains. These types include numeric types (int64, float64), text (object), categorical (category), datetime (datetime64), and boolean (bool). Selecting columns by data type becomes essential when applying type-specific transformations or validations.

import pandas as pd
import numpy as np

# Create a sample DataFrame with mixed types
df = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'username': ['alice', 'bob', 'charlie', 'david', 'eve'],
    'age': [25, 30, 35, 28, 42],
    'salary': [50000.0, 65000.0, 75000.0, 58000.0, 92000.0],
    'is_active': [True, True, False, True, False],
    'signup_date': pd.to_datetime(['2023-01-15', '2023-02-20', '2023-03-10', 
                                    '2023-04-05', '2023-05-12']),
    'department': pd.Categorical(['Sales', 'Engineering', 'Sales', 
                                   'Marketing', 'Engineering'])
})

# Inspect data types
print(df.dtypes)

Output:

user_id                int64
username              object
age                    int64
salary               float64
is_active               bool
signup_date    datetime64[ns]
department           category
dtype: object

Basic Column Selection with select_dtypes()

The select_dtypes() method accepts include and exclude parameters that take data type specifications. You can pass single types, lists of types, or generic type groups.

# Select all numeric columns
numeric_cols = df.select_dtypes(include=['number'])
print(numeric_cols.columns.tolist())
# ['user_id', 'age', 'salary']

# Select integer columns only
int_cols = df.select_dtypes(include=['int64'])
print(int_cols.columns.tolist())
# ['user_id', 'age']

# Select float columns
float_cols = df.select_dtypes(include=['float64'])
print(float_cols.columns.tolist())
# ['salary']

# Select object (string) columns
object_cols = df.select_dtypes(include=['object'])
print(object_cols.columns.tolist())
# ['username']

# Select datetime columns
datetime_cols = df.select_dtypes(include=['datetime64'])
print(datetime_cols.columns.tolist())
# ['signup_date']

# Select boolean columns
bool_cols = df.select_dtypes(include=['bool'])
print(bool_cols.columns.tolist())
# ['is_active']

# Select categorical columns
cat_cols = df.select_dtypes(include=['category'])
print(cat_cols.columns.tolist())
# ['department']

Using Generic Type Groups

Pandas provides convenient generic type groups that match multiple related types simultaneously. The ’number’ group includes all numeric types, while ‘object’ captures string data.

# Select all numeric types (int and float)
all_numeric = df.select_dtypes(include='number')
print(all_numeric.columns.tolist())
# ['user_id', 'age', 'salary']

# Multiple type groups
numeric_and_bool = df.select_dtypes(include=['number', 'bool'])
print(numeric_and_bool.columns.tolist())
# ['user_id', 'age', 'salary', 'is_active']

# Using NumPy generic types
np_numeric = df.select_dtypes(include=[np.number])
print(np_numeric.columns.tolist())
# ['user_id', 'age', 'salary']

Excluding Data Types

The exclude parameter filters out unwanted types. Combine include and exclude for precise control over column selection.

# Select all columns except object types
non_object = df.select_dtypes(exclude=['object'])
print(non_object.columns.tolist())
# ['user_id', 'age', 'salary', 'is_active', 'signup_date', 'department']

# Exclude multiple types
non_numeric_non_bool = df.select_dtypes(exclude=['number', 'bool'])
print(non_numeric_non_bool.columns.tolist())
# ['username', 'signup_date', 'department']

# Include numeric but exclude integers
floats_only = df.select_dtypes(include='number', exclude=['int'])
print(floats_only.columns.tolist())
# ['salary']

Practical Application: Data Preprocessing Pipeline

Selecting columns by type streamlines preprocessing workflows. This example demonstrates encoding categorical variables, scaling numeric features, and handling different data types separately.

from sklearn.preprocessing import StandardScaler, LabelEncoder

# Separate columns by type for preprocessing
numeric_features = df.select_dtypes(include='number').columns.tolist()
categorical_features = df.select_dtypes(include=['object', 'category']).columns.tolist()
datetime_features = df.select_dtypes(include='datetime64').columns.tolist()

# Remove ID columns from scaling
numeric_features = [col for col in numeric_features if 'id' not in col.lower()]

# Scale numeric features
scaler = StandardScaler()
df_processed = df.copy()
df_processed[numeric_features] = scaler.fit_transform(df[numeric_features])

# Encode categorical features
for col in categorical_features:
    le = LabelEncoder()
    df_processed[col] = le.fit_transform(df[col].astype(str))

# Extract datetime features
for col in datetime_features:
    df_processed[f'{col}_year'] = df[col].dt.year
    df_processed[f'{col}_month'] = df[col].dt.month
    df_processed[f'{col}_day'] = df[col].dt.day

print(df_processed.head())

Dynamic Column Operations

Build functions that adapt to DataFrame schemas automatically by selecting columns based on their types rather than hardcoded names.

def summarize_by_type(dataframe):
    """Generate type-specific summaries for a DataFrame."""
    summary = {}
    
    # Numeric summary
    numeric_cols = dataframe.select_dtypes(include='number')
    if not numeric_cols.empty:
        summary['numeric'] = {
            'columns': numeric_cols.columns.tolist(),
            'mean': numeric_cols.mean().to_dict(),
            'median': numeric_cols.median().to_dict(),
            'std': numeric_cols.std().to_dict()
        }
    
    # Categorical summary
    cat_cols = dataframe.select_dtypes(include=['object', 'category'])
    if not cat_cols.empty:
        summary['categorical'] = {
            'columns': cat_cols.columns.tolist(),
            'unique_counts': cat_cols.nunique().to_dict(),
            'mode': cat_cols.mode().iloc[0].to_dict() if len(cat_cols) > 0 else {}
        }
    
    # DateTime summary
    dt_cols = dataframe.select_dtypes(include='datetime64')
    if not dt_cols.empty:
        summary['datetime'] = {
            'columns': dt_cols.columns.tolist(),
            'min': dt_cols.min().to_dict(),
            'max': dt_cols.max().to_dict()
        }
    
    return summary

summary = summarize_by_type(df)
print(f"Numeric columns: {summary['numeric']['columns']}")
print(f"Categorical columns: {summary['categorical']['columns']}")
print(f"DateTime range: {summary['datetime']['min']} to {summary['datetime']['max']}")

Data Validation by Type

Implement type-specific validation rules to ensure data quality across your DataFrame.

def validate_dataframe(dataframe):
    """Validate DataFrame columns based on their data types."""
    issues = []
    
    # Check numeric columns for outliers (simple IQR method)
    numeric_cols = dataframe.select_dtypes(include='number')
    for col in numeric_cols.columns:
        Q1 = dataframe[col].quantile(0.25)
        Q3 = dataframe[col].quantile(0.75)
        IQR = Q3 - Q1
        outliers = dataframe[(dataframe[col] < Q1 - 1.5 * IQR) | 
                            (dataframe[col] > Q3 + 1.5 * IQR)]
        if len(outliers) > 0:
            issues.append(f"{col}: {len(outliers)} outliers detected")
    
    # Check object columns for missing or empty strings
    object_cols = dataframe.select_dtypes(include='object')
    for col in object_cols.columns:
        empty_count = (dataframe[col].str.strip() == '').sum()
        if empty_count > 0:
            issues.append(f"{col}: {empty_count} empty strings")
    
    # Check datetime columns for future dates
    dt_cols = dataframe.select_dtypes(include='datetime64')
    for col in dt_cols.columns:
        future_dates = dataframe[dataframe[col] > pd.Timestamp.now()]
        if len(future_dates) > 0:
            issues.append(f"{col}: {len(future_dates)} future dates")
    
    return issues if issues else ["All validations passed"]

validation_results = validate_dataframe(df)
for result in validation_results:
    print(result)

Combining with Other Selection Methods

Mix select_dtypes() with other pandas selection techniques for complex filtering scenarios.

# Select numeric columns that contain specific keywords
numeric_cols = df.select_dtypes(include='number').columns
salary_related = [col for col in numeric_cols if 'salary' in col.lower() or 'income' in col.lower()]
print(salary_related)
# ['salary']

# Select non-numeric columns and apply string operations
text_cols = df.select_dtypes(exclude='number')
for col in text_cols.select_dtypes(include='object').columns:
    df[f'{col}_length'] = df[col].str.len()

# Chain with loc for row and column filtering
high_earners_numeric = df.loc[df['salary'] > 60000].select_dtypes(include='number')
print(high_earners_numeric)

Performance Considerations

When working with large DataFrames, selecting columns by type before applying operations reduces memory usage and improves performance.

# Inefficient: applying operation to entire DataFrame
# df_result = df.apply(some_expensive_function)

# Efficient: select only relevant columns first
numeric_subset = df.select_dtypes(include='number')
# Apply expensive operations only to numeric columns
numeric_result = numeric_subset.apply(lambda x: x ** 2)

# Merge back if needed
df_efficient = df.copy()
df_efficient[numeric_subset.columns] = numeric_result

The select_dtypes() method provides a robust foundation for building maintainable data pipelines. By selecting columns based on their inherent data types rather than explicit names, your code becomes more flexible and resilient to schema changes. This approach proves particularly valuable in production environments where DataFrame structures may evolve over time.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.