Pandas - Get Data Types of Columns (dtypes)
• Pandas provides multiple methods to inspect column data types: `df.dtypes` for all columns, `df['column'].dtype` for individual columns, and `df.select_dtypes()` to filter columns by type
Key Insights
• Pandas provides multiple methods to inspect column data types: df.dtypes for all columns, df['column'].dtype for individual columns, and df.select_dtypes() to filter columns by type
• Understanding data types is critical for memory optimization—switching from int64 to int32 or using categorical types can reduce DataFrame memory usage by 50% or more
• Type mismatches cause silent bugs in data pipelines; explicit type checking and conversion prevents issues like numeric operations on object-type columns or datetime comparison failures
Retrieving Data Types for All Columns
The dtypes attribute returns a Series containing the data type of each column in your DataFrame. This is the most common method for quick data type inspection.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'user_id': [1, 2, 3, 4, 5],
'username': ['alice', 'bob', 'charlie', 'david', 'eve'],
'age': [25, 30, 35, 28, 32],
'salary': [50000.0, 65000.5, 72000.0, 58000.75, 69000.0],
'is_active': [True, False, True, True, False],
'signup_date': pd.date_range('2024-01-01', periods=5)
})
print(df.dtypes)
Output:
user_id int64
username object
age int64
salary float64
is_active bool
signup_date datetime64[ns]
dtype: object
The dtypes attribute returns a Series where the index contains column names and values contain the corresponding data types. This makes it easy to filter or inspect specific type information programmatically.
Checking Individual Column Types
For single column inspection, access the dtype attribute (singular, not plural) directly on the Series object.
print(df['username'].dtype) # object
print(df['salary'].dtype) # float64
print(df['signup_date'].dtype) # datetime64[ns]
# Type checking in conditional logic
if df['age'].dtype == 'int64':
print("Age column is integer type")
# Using numpy dtype objects for comparison
if df['salary'].dtype == np.float64:
print("Salary is float64")
This approach is essential when building data validation functions or conditional processing logic based on column types.
Filtering Columns by Data Type
The select_dtypes() method filters columns based on their data types, returning a new DataFrame with only matching columns.
# Select only numeric columns
numeric_df = df.select_dtypes(include=[np.number])
print(numeric_df.columns.tolist())
# ['user_id', 'age', 'salary']
# Select only object (string) columns
string_df = df.select_dtypes(include=['object'])
print(string_df.columns.tolist())
# ['username']
# Select multiple types
mixed_df = df.select_dtypes(include=['int64', 'float64'])
print(mixed_df.columns.tolist())
# ['user_id', 'age', 'salary']
# Exclude specific types
non_numeric_df = df.select_dtypes(exclude=[np.number])
print(non_numeric_df.columns.tolist())
# ['username', 'is_active', 'signup_date']
This method is particularly useful for bulk operations on columns of the same type, such as applying transformations to all numeric columns or encoding all categorical variables.
Understanding Common Pandas Data Types
Pandas uses NumPy data types with some extensions. Here are the types you’ll encounter most frequently:
# Integer types
int_df = pd.DataFrame({
'int8': np.array([1, 2, 3], dtype='int8'),
'int16': np.array([1, 2, 3], dtype='int16'),
'int32': np.array([1, 2, 3], dtype='int32'),
'int64': np.array([1, 2, 3], dtype='int64')
})
print(int_df.dtypes)
# Float types
float_df = pd.DataFrame({
'float32': np.array([1.1, 2.2, 3.3], dtype='float32'),
'float64': np.array([1.1, 2.2, 3.3], dtype='float64')
})
print(float_df.dtypes)
# Object type (strings, mixed types)
obj_df = pd.DataFrame({
'strings': ['a', 'b', 'c'],
'mixed': [1, 'two', 3.0] # Mixed types become object
})
print(obj_df.dtypes)
# Boolean type
bool_df = pd.DataFrame({
'flags': [True, False, True]
})
print(bool_df.dtypes)
# Datetime types
datetime_df = pd.DataFrame({
'dates': pd.date_range('2024-01-01', periods=3),
'times': pd.to_timedelta(['1 days', '2 days', '3 days'])
})
print(datetime_df.dtypes)
Getting Detailed Type Information
The info() method provides a comprehensive overview including data types, non-null counts, and memory usage.
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 5 non-null int64
1 username 5 non-null object
2 age 5 non-null int64
3 salary 5 non-null float64
4 is_active 5 non-null bool
5 signup_date 5 non-null datetime64[ns]
dtypes: bool(1), datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 373.0+ bytes
For memory-focused analysis, use the memory_usage() method:
memory_bytes = df.memory_usage(deep=True)
print(memory_bytes)
print(f"\nTotal memory: {memory_bytes.sum() / 1024:.2f} KB")
The deep=True parameter accurately calculates memory for object types, which can contain variable-length data.
Converting and Optimizing Data Types
Type conversion is critical for performance and correctness. Use astype() for explicit conversions.
# Create a sample DataFrame with suboptimal types
data = pd.DataFrame({
'small_int': [1, 2, 3, 4, 5],
'category_col': ['red', 'blue', 'red', 'green', 'blue'],
'numeric_string': ['100', '200', '300', '400', '500']
})
print("Before optimization:")
print(data.dtypes)
print(f"Memory: {data.memory_usage(deep=True).sum()} bytes\n")
# Optimize integer type
data['small_int'] = data['small_int'].astype('int8')
# Convert to categorical (huge savings for repeated values)
data['category_col'] = data['category_col'].astype('category')
# Convert string to numeric
data['numeric_string'] = data['numeric_string'].astype('int32')
print("After optimization:")
print(data.dtypes)
print(f"Memory: {data.memory_usage(deep=True).sum()} bytes")
For automatic type inference from strings, use pd.to_numeric(), pd.to_datetime(), or convert_dtypes():
# Automatic type conversion with error handling
df_mixed = pd.DataFrame({
'numbers': ['1', '2', '3', 'invalid'],
'dates': ['2024-01-01', '2024-01-02', '2024-01-03', 'invalid']
})
# Convert with error handling
df_mixed['numbers'] = pd.to_numeric(df_mixed['numbers'], errors='coerce')
df_mixed['dates'] = pd.to_datetime(df_mixed['dates'], errors='coerce')
print(df_mixed.dtypes)
print(df_mixed)
Detecting Type Inconsistencies
Type inconsistencies often hide in object columns that should be numeric or datetime types.
def audit_object_columns(df):
"""Identify object columns that might be mistyped."""
object_cols = df.select_dtypes(include=['object']).columns
for col in object_cols:
print(f"\n{col}:")
print(f" Unique values: {df[col].nunique()}")
print(f" Sample values: {df[col].head(3).tolist()}")
# Try numeric conversion
numeric_conversion = pd.to_numeric(df[col], errors='coerce')
if numeric_conversion.notna().sum() > 0:
print(f" ⚠️ Could be numeric ({numeric_conversion.notna().sum()} valid values)")
# Try datetime conversion
datetime_conversion = pd.to_datetime(df[col], errors='coerce')
if datetime_conversion.notna().sum() > 0:
print(f" ⚠️ Could be datetime ({datetime_conversion.notna().sum()} valid values)")
# Example usage
messy_df = pd.DataFrame({
'id': ['1', '2', '3'],
'price': ['10.50', '20.00', '15.75'],
'date': ['2024-01-01', '2024-01-02', '2024-01-03'],
'name': ['Alice', 'Bob', 'Charlie']
})
audit_object_columns(messy_df)
Practical Type Checking Patterns
Build robust data pipelines with explicit type validation:
def validate_dataframe_schema(df, expected_types):
"""Validate DataFrame against expected type schema."""
mismatches = {}
for col, expected_type in expected_types.items():
if col not in df.columns:
mismatches[col] = f"Missing column"
elif df[col].dtype != expected_type:
mismatches[col] = f"Expected {expected_type}, got {df[col].dtype}"
if mismatches:
raise ValueError(f"Schema validation failed: {mismatches}")
return True
# Define expected schema
schema = {
'user_id': 'int64',
'username': 'object',
'age': 'int64',
'salary': 'float64',
'is_active': 'bool',
'signup_date': 'datetime64[ns]'
}
try:
validate_dataframe_schema(df, schema)
print("Schema validation passed")
except ValueError as e:
print(e)
Understanding and managing data types in Pandas is foundational for building reliable data processing pipelines. Use dtypes for inspection, select_dtypes() for filtering, and astype() for conversion. Always validate types explicitly in production code to catch errors early and optimize memory usage for large datasets.