Pandas - Get Data Types of Columns (dtypes)

• Pandas provides multiple methods to inspect column data types: `df.dtypes` for all columns, `df['column'].dtype` for individual columns, and `df.select_dtypes()` to filter columns by type

Key Insights

• Pandas provides multiple methods to inspect column data types: df.dtypes for all columns, df['column'].dtype for individual columns, and df.select_dtypes() to filter columns by type • Understanding data types is critical for memory optimization—switching from int64 to int32 or using categorical types can reduce DataFrame memory usage by 50% or more • Type mismatches cause silent bugs in data pipelines; explicit type checking and conversion prevents issues like numeric operations on object-type columns or datetime comparison failures

Retrieving Data Types for All Columns

The dtypes attribute returns a Series containing the data type of each column in your DataFrame. This is the most common method for quick data type inspection.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'username': ['alice', 'bob', 'charlie', 'david', 'eve'],
    'age': [25, 30, 35, 28, 32],
    'salary': [50000.0, 65000.5, 72000.0, 58000.75, 69000.0],
    'is_active': [True, False, True, True, False],
    'signup_date': pd.date_range('2024-01-01', periods=5)
})

print(df.dtypes)

Output:

user_id                 int64
username               object
age                     int64
salary                float64
is_active                bool
signup_date    datetime64[ns]
dtype: object

The dtypes attribute returns a Series where the index contains column names and values contain the corresponding data types. This makes it easy to filter or inspect specific type information programmatically.

Checking Individual Column Types

For single column inspection, access the dtype attribute (singular, not plural) directly on the Series object.

print(df['username'].dtype)  # object
print(df['salary'].dtype)    # float64
print(df['signup_date'].dtype)  # datetime64[ns]

# Type checking in conditional logic
if df['age'].dtype == 'int64':
    print("Age column is integer type")

# Using numpy dtype objects for comparison
if df['salary'].dtype == np.float64:
    print("Salary is float64")

This approach is essential when building data validation functions or conditional processing logic based on column types.

Filtering Columns by Data Type

The select_dtypes() method filters columns based on their data types, returning a new DataFrame with only matching columns.

# Select only numeric columns
numeric_df = df.select_dtypes(include=[np.number])
print(numeric_df.columns.tolist())
# ['user_id', 'age', 'salary']

# Select only object (string) columns
string_df = df.select_dtypes(include=['object'])
print(string_df.columns.tolist())
# ['username']

# Select multiple types
mixed_df = df.select_dtypes(include=['int64', 'float64'])
print(mixed_df.columns.tolist())
# ['user_id', 'age', 'salary']

# Exclude specific types
non_numeric_df = df.select_dtypes(exclude=[np.number])
print(non_numeric_df.columns.tolist())
# ['username', 'is_active', 'signup_date']

This method is particularly useful for bulk operations on columns of the same type, such as applying transformations to all numeric columns or encoding all categorical variables.

Understanding Common Pandas Data Types

Pandas uses NumPy data types with some extensions. Here are the types you’ll encounter most frequently:

# Integer types
int_df = pd.DataFrame({
    'int8': np.array([1, 2, 3], dtype='int8'),
    'int16': np.array([1, 2, 3], dtype='int16'),
    'int32': np.array([1, 2, 3], dtype='int32'),
    'int64': np.array([1, 2, 3], dtype='int64')
})
print(int_df.dtypes)

# Float types
float_df = pd.DataFrame({
    'float32': np.array([1.1, 2.2, 3.3], dtype='float32'),
    'float64': np.array([1.1, 2.2, 3.3], dtype='float64')
})
print(float_df.dtypes)

# Object type (strings, mixed types)
obj_df = pd.DataFrame({
    'strings': ['a', 'b', 'c'],
    'mixed': [1, 'two', 3.0]  # Mixed types become object
})
print(obj_df.dtypes)

# Boolean type
bool_df = pd.DataFrame({
    'flags': [True, False, True]
})
print(bool_df.dtypes)

# Datetime types
datetime_df = pd.DataFrame({
    'dates': pd.date_range('2024-01-01', periods=3),
    'times': pd.to_timedelta(['1 days', '2 days', '3 days'])
})
print(datetime_df.dtypes)

Getting Detailed Type Information

The info() method provides a comprehensive overview including data types, non-null counts, and memory usage.

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   user_id      5 non-null      int64         
 1   username     5 non-null      object        
 2   age          5 non-null      int64         
 3   salary       5 non-null      float64       
 4   is_active    5 non-null      bool          
 5   signup_date  5 non-null      datetime64[ns]
dtypes: bool(1), datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 373.0+ bytes

For memory-focused analysis, use the memory_usage() method:

memory_bytes = df.memory_usage(deep=True)
print(memory_bytes)
print(f"\nTotal memory: {memory_bytes.sum() / 1024:.2f} KB")

The deep=True parameter accurately calculates memory for object types, which can contain variable-length data.

Converting and Optimizing Data Types

Type conversion is critical for performance and correctness. Use astype() for explicit conversions.

# Create a sample DataFrame with suboptimal types
data = pd.DataFrame({
    'small_int': [1, 2, 3, 4, 5],
    'category_col': ['red', 'blue', 'red', 'green', 'blue'],
    'numeric_string': ['100', '200', '300', '400', '500']
})

print("Before optimization:")
print(data.dtypes)
print(f"Memory: {data.memory_usage(deep=True).sum()} bytes\n")

# Optimize integer type
data['small_int'] = data['small_int'].astype('int8')

# Convert to categorical (huge savings for repeated values)
data['category_col'] = data['category_col'].astype('category')

# Convert string to numeric
data['numeric_string'] = data['numeric_string'].astype('int32')

print("After optimization:")
print(data.dtypes)
print(f"Memory: {data.memory_usage(deep=True).sum()} bytes")

For automatic type inference from strings, use pd.to_numeric(), pd.to_datetime(), or convert_dtypes():

# Automatic type conversion with error handling
df_mixed = pd.DataFrame({
    'numbers': ['1', '2', '3', 'invalid'],
    'dates': ['2024-01-01', '2024-01-02', '2024-01-03', 'invalid']
})

# Convert with error handling
df_mixed['numbers'] = pd.to_numeric(df_mixed['numbers'], errors='coerce')
df_mixed['dates'] = pd.to_datetime(df_mixed['dates'], errors='coerce')

print(df_mixed.dtypes)
print(df_mixed)

Detecting Type Inconsistencies

Type inconsistencies often hide in object columns that should be numeric or datetime types.

def audit_object_columns(df):
    """Identify object columns that might be mistyped."""
    object_cols = df.select_dtypes(include=['object']).columns
    
    for col in object_cols:
        print(f"\n{col}:")
        print(f"  Unique values: {df[col].nunique()}")
        print(f"  Sample values: {df[col].head(3).tolist()}")
        
        # Try numeric conversion
        numeric_conversion = pd.to_numeric(df[col], errors='coerce')
        if numeric_conversion.notna().sum() > 0:
            print(f"  ⚠️  Could be numeric ({numeric_conversion.notna().sum()} valid values)")
        
        # Try datetime conversion
        datetime_conversion = pd.to_datetime(df[col], errors='coerce')
        if datetime_conversion.notna().sum() > 0:
            print(f"  ⚠️  Could be datetime ({datetime_conversion.notna().sum()} valid values)")

# Example usage
messy_df = pd.DataFrame({
    'id': ['1', '2', '3'],
    'price': ['10.50', '20.00', '15.75'],
    'date': ['2024-01-01', '2024-01-02', '2024-01-03'],
    'name': ['Alice', 'Bob', 'Charlie']
})

audit_object_columns(messy_df)

Practical Type Checking Patterns

Build robust data pipelines with explicit type validation:

def validate_dataframe_schema(df, expected_types):
    """Validate DataFrame against expected type schema."""
    mismatches = {}
    
    for col, expected_type in expected_types.items():
        if col not in df.columns:
            mismatches[col] = f"Missing column"
        elif df[col].dtype != expected_type:
            mismatches[col] = f"Expected {expected_type}, got {df[col].dtype}"
    
    if mismatches:
        raise ValueError(f"Schema validation failed: {mismatches}")
    
    return True

# Define expected schema
schema = {
    'user_id': 'int64',
    'username': 'object',
    'age': 'int64',
    'salary': 'float64',
    'is_active': 'bool',
    'signup_date': 'datetime64[ns]'
}

try:
    validate_dataframe_schema(df, schema)
    print("Schema validation passed")
except ValueError as e:
    print(e)

Understanding and managing data types in Pandas is foundational for building reliable data processing pipelines. Use dtypes for inspection, select_dtypes() for filtering, and astype() for conversion. Always validate types explicitly in production code to catch errors early and optimize memory usage for large datasets.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.