Pandas - Select Columns by Data Type
• Use `select_dtypes()` to filter DataFrame columns by data type with include/exclude parameters, supporting both NumPy and pandas-specific types like 'number', 'object', and 'category'
Key Insights
• Use select_dtypes() to filter DataFrame columns by data type with include/exclude parameters, supporting both NumPy and pandas-specific types like ’number’, ‘object’, and ‘category’
• Combine programmatic column selection with conditional logic to build dynamic data pipelines that adapt to schema changes without hardcoded column names
• Leverage data type selection for preprocessing workflows like encoding categorical variables, scaling numeric features, or validating data quality across type-specific columns
Understanding DataFrame Data Types
Pandas assigns specific data types to each column based on the values it contains. These types include numeric types (int64, float64), text (object), categorical (category), datetime (datetime64), and boolean (bool). Selecting columns by data type becomes essential when applying type-specific transformations or validations.
import pandas as pd
import numpy as np
# Create a sample DataFrame with mixed types
df = pd.DataFrame({
'user_id': [1, 2, 3, 4, 5],
'username': ['alice', 'bob', 'charlie', 'david', 'eve'],
'age': [25, 30, 35, 28, 42],
'salary': [50000.0, 65000.0, 75000.0, 58000.0, 92000.0],
'is_active': [True, True, False, True, False],
'signup_date': pd.to_datetime(['2023-01-15', '2023-02-20', '2023-03-10',
'2023-04-05', '2023-05-12']),
'department': pd.Categorical(['Sales', 'Engineering', 'Sales',
'Marketing', 'Engineering'])
})
# Inspect data types
print(df.dtypes)
Output:
user_id int64
username object
age int64
salary float64
is_active bool
signup_date datetime64[ns]
department category
dtype: object
Basic Column Selection with select_dtypes()
The select_dtypes() method accepts include and exclude parameters that take data type specifications. You can pass single types, lists of types, or generic type groups.
# Select all numeric columns
numeric_cols = df.select_dtypes(include=['number'])
print(numeric_cols.columns.tolist())
# ['user_id', 'age', 'salary']
# Select integer columns only
int_cols = df.select_dtypes(include=['int64'])
print(int_cols.columns.tolist())
# ['user_id', 'age']
# Select float columns
float_cols = df.select_dtypes(include=['float64'])
print(float_cols.columns.tolist())
# ['salary']
# Select object (string) columns
object_cols = df.select_dtypes(include=['object'])
print(object_cols.columns.tolist())
# ['username']
# Select datetime columns
datetime_cols = df.select_dtypes(include=['datetime64'])
print(datetime_cols.columns.tolist())
# ['signup_date']
# Select boolean columns
bool_cols = df.select_dtypes(include=['bool'])
print(bool_cols.columns.tolist())
# ['is_active']
# Select categorical columns
cat_cols = df.select_dtypes(include=['category'])
print(cat_cols.columns.tolist())
# ['department']
Using Generic Type Groups
Pandas provides convenient generic type groups that match multiple related types simultaneously. The ’number’ group includes all numeric types, while ‘object’ captures string data.
# Select all numeric types (int and float)
all_numeric = df.select_dtypes(include='number')
print(all_numeric.columns.tolist())
# ['user_id', 'age', 'salary']
# Multiple type groups
numeric_and_bool = df.select_dtypes(include=['number', 'bool'])
print(numeric_and_bool.columns.tolist())
# ['user_id', 'age', 'salary', 'is_active']
# Using NumPy generic types
np_numeric = df.select_dtypes(include=[np.number])
print(np_numeric.columns.tolist())
# ['user_id', 'age', 'salary']
Excluding Data Types
The exclude parameter filters out unwanted types. Combine include and exclude for precise control over column selection.
# Select all columns except object types
non_object = df.select_dtypes(exclude=['object'])
print(non_object.columns.tolist())
# ['user_id', 'age', 'salary', 'is_active', 'signup_date', 'department']
# Exclude multiple types
non_numeric_non_bool = df.select_dtypes(exclude=['number', 'bool'])
print(non_numeric_non_bool.columns.tolist())
# ['username', 'signup_date', 'department']
# Include numeric but exclude integers
floats_only = df.select_dtypes(include='number', exclude=['int'])
print(floats_only.columns.tolist())
# ['salary']
Practical Application: Data Preprocessing Pipeline
Selecting columns by type streamlines preprocessing workflows. This example demonstrates encoding categorical variables, scaling numeric features, and handling different data types separately.
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Separate columns by type for preprocessing
numeric_features = df.select_dtypes(include='number').columns.tolist()
categorical_features = df.select_dtypes(include=['object', 'category']).columns.tolist()
datetime_features = df.select_dtypes(include='datetime64').columns.tolist()
# Remove ID columns from scaling
numeric_features = [col for col in numeric_features if 'id' not in col.lower()]
# Scale numeric features
scaler = StandardScaler()
df_processed = df.copy()
df_processed[numeric_features] = scaler.fit_transform(df[numeric_features])
# Encode categorical features
for col in categorical_features:
le = LabelEncoder()
df_processed[col] = le.fit_transform(df[col].astype(str))
# Extract datetime features
for col in datetime_features:
df_processed[f'{col}_year'] = df[col].dt.year
df_processed[f'{col}_month'] = df[col].dt.month
df_processed[f'{col}_day'] = df[col].dt.day
print(df_processed.head())
Dynamic Column Operations
Build functions that adapt to DataFrame schemas automatically by selecting columns based on their types rather than hardcoded names.
def summarize_by_type(dataframe):
"""Generate type-specific summaries for a DataFrame."""
summary = {}
# Numeric summary
numeric_cols = dataframe.select_dtypes(include='number')
if not numeric_cols.empty:
summary['numeric'] = {
'columns': numeric_cols.columns.tolist(),
'mean': numeric_cols.mean().to_dict(),
'median': numeric_cols.median().to_dict(),
'std': numeric_cols.std().to_dict()
}
# Categorical summary
cat_cols = dataframe.select_dtypes(include=['object', 'category'])
if not cat_cols.empty:
summary['categorical'] = {
'columns': cat_cols.columns.tolist(),
'unique_counts': cat_cols.nunique().to_dict(),
'mode': cat_cols.mode().iloc[0].to_dict() if len(cat_cols) > 0 else {}
}
# DateTime summary
dt_cols = dataframe.select_dtypes(include='datetime64')
if not dt_cols.empty:
summary['datetime'] = {
'columns': dt_cols.columns.tolist(),
'min': dt_cols.min().to_dict(),
'max': dt_cols.max().to_dict()
}
return summary
summary = summarize_by_type(df)
print(f"Numeric columns: {summary['numeric']['columns']}")
print(f"Categorical columns: {summary['categorical']['columns']}")
print(f"DateTime range: {summary['datetime']['min']} to {summary['datetime']['max']}")
Data Validation by Type
Implement type-specific validation rules to ensure data quality across your DataFrame.
def validate_dataframe(dataframe):
"""Validate DataFrame columns based on their data types."""
issues = []
# Check numeric columns for outliers (simple IQR method)
numeric_cols = dataframe.select_dtypes(include='number')
for col in numeric_cols.columns:
Q1 = dataframe[col].quantile(0.25)
Q3 = dataframe[col].quantile(0.75)
IQR = Q3 - Q1
outliers = dataframe[(dataframe[col] < Q1 - 1.5 * IQR) |
(dataframe[col] > Q3 + 1.5 * IQR)]
if len(outliers) > 0:
issues.append(f"{col}: {len(outliers)} outliers detected")
# Check object columns for missing or empty strings
object_cols = dataframe.select_dtypes(include='object')
for col in object_cols.columns:
empty_count = (dataframe[col].str.strip() == '').sum()
if empty_count > 0:
issues.append(f"{col}: {empty_count} empty strings")
# Check datetime columns for future dates
dt_cols = dataframe.select_dtypes(include='datetime64')
for col in dt_cols.columns:
future_dates = dataframe[dataframe[col] > pd.Timestamp.now()]
if len(future_dates) > 0:
issues.append(f"{col}: {len(future_dates)} future dates")
return issues if issues else ["All validations passed"]
validation_results = validate_dataframe(df)
for result in validation_results:
print(result)
Combining with Other Selection Methods
Mix select_dtypes() with other pandas selection techniques for complex filtering scenarios.
# Select numeric columns that contain specific keywords
numeric_cols = df.select_dtypes(include='number').columns
salary_related = [col for col in numeric_cols if 'salary' in col.lower() or 'income' in col.lower()]
print(salary_related)
# ['salary']
# Select non-numeric columns and apply string operations
text_cols = df.select_dtypes(exclude='number')
for col in text_cols.select_dtypes(include='object').columns:
df[f'{col}_length'] = df[col].str.len()
# Chain with loc for row and column filtering
high_earners_numeric = df.loc[df['salary'] > 60000].select_dtypes(include='number')
print(high_earners_numeric)
Performance Considerations
When working with large DataFrames, selecting columns by type before applying operations reduces memory usage and improves performance.
# Inefficient: applying operation to entire DataFrame
# df_result = df.apply(some_expensive_function)
# Efficient: select only relevant columns first
numeric_subset = df.select_dtypes(include='number')
# Apply expensive operations only to numeric columns
numeric_result = numeric_subset.apply(lambda x: x ** 2)
# Merge back if needed
df_efficient = df.copy()
df_efficient[numeric_subset.columns] = numeric_result
The select_dtypes() method provides a robust foundation for building maintainable data pipelines. By selecting columns based on their inherent data types rather than explicit names, your code becomes more flexible and resilient to schema changes. This approach proves particularly valuable in production environments where DataFrame structures may evolve over time.