Pandas - Change Column Data Type (astype)
• The `astype()` method is the primary way to convert DataFrame column types in pandas, supporting conversions between numeric, string, categorical, and datetime types with explicit control over the...
Key Insights
• The astype() method is the primary way to convert DataFrame column types in pandas, supporting conversions between numeric, string, categorical, and datetime types with explicit control over the transformation process.
• Type conversion failures can be handled gracefully using errors='ignore' or by preprocessing data with pd.to_numeric() and pd.to_datetime() which offer more robust error handling options.
• Converting to categorical types can reduce memory usage by up to 90% for columns with repetitive values, while converting to appropriate numeric types (int8, int16) optimizes storage for large datasets.
Basic Type Conversion Syntax
The astype() method converts a pandas Series or DataFrame column to a specified data type. The basic syntax accepts a target type as a string or Python type object.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'price': ['10.5', '20.3', '15.7'],
'quantity': ['100', '200', '150'],
'product_id': [1, 2, 3]
})
# Convert single column
df['price'] = df['price'].astype(float)
# Convert using dtype string notation
df['quantity'] = df['quantity'].astype('int64')
# Convert using Python type
df['product_id'] = df['product_id'].astype(str)
print(df.dtypes)
# price float64
# quantity int64
# product_id object
Converting Multiple Columns Simultaneously
Use a dictionary to convert multiple columns in one operation, which is more efficient than chaining individual conversions.
df = pd.DataFrame({
'user_id': ['1', '2', '3'],
'age': ['25', '30', '35'],
'salary': ['50000.5', '60000.75', '55000.25'],
'active': ['1', '0', '1']
})
# Convert multiple columns with dictionary
conversion_dict = {
'user_id': 'int32',
'age': 'int8',
'salary': 'float32',
'active': 'bool'
}
df = df.astype(conversion_dict)
print(df.dtypes)
print(df)
# user_id age salary active
# 0 1 25 50000.5 True
# 1 2 30 60000.8 False
# 2 3 35 55000.2 True
Handling Conversion Errors
By default, astype() raises a ValueError when conversion fails. Use the errors parameter to control this behavior.
df = pd.DataFrame({
'values': ['10', '20', 'invalid', '30']
})
# This will raise ValueError
# df['values'] = df['values'].astype(int)
# Ignore errors - keeps original dtype
df['values_ignored'] = df['values'].astype(int, errors='ignore')
# Better approach: use pd.to_numeric with coerce
df['values_numeric'] = pd.to_numeric(df['values'], errors='coerce')
print(df)
# values values_ignored values_numeric
# 0 10 10 10.0
# 1 20 20 20.0
# 2 invalid invalid NaN
# 3 30 30 30.0
Converting to Categorical Types
Categorical types dramatically reduce memory usage for columns with limited unique values. This is particularly effective for columns with repetitive string data.
df = pd.DataFrame({
'status': ['pending'] * 10000 + ['completed'] * 10000 + ['failed'] * 5000
})
# Check memory before conversion
print(f"Original memory: {df.memory_usage(deep=True)['status'] / 1024:.2f} KB")
# Convert to categorical
df['status'] = df['status'].astype('category')
print(f"Categorical memory: {df.memory_usage(deep=True)['status'] / 1024:.2f} KB")
print(df['status'].dtype)
# Original memory: ~195 KB
# Categorical memory: ~25 KB
# Access category codes
print(df['status'].cat.codes.head())
# 0 2
# 1 2
# 2 2
# 3 2
# 4 2
Numeric Type Optimization
Choosing the appropriate numeric type reduces memory footprint. Use smaller integer types when value ranges permit.
df = pd.DataFrame({
'tiny_values': [1, 2, 3, 4, 5],
'medium_values': [100, 200, 300, 400, 500],
'large_values': [1000000, 2000000, 3000000, 4000000, 5000000]
})
# Default int64 for all
print("Original dtypes:")
print(df.dtypes)
# Optimize based on value ranges
df['tiny_values'] = df['tiny_values'].astype('int8') # -128 to 127
df['medium_values'] = df['medium_values'].astype('int16') # -32,768 to 32,767
df['large_values'] = df['large_values'].astype('int32') # -2B to 2B
print("\nOptimized dtypes:")
print(df.dtypes)
# Check memory savings
original_memory = 5 * 8 * 3 # 5 rows * 8 bytes * 3 columns
optimized_memory = 5 * (1 + 2 + 4) # 5 rows * (1 + 2 + 4 bytes)
print(f"\nMemory reduction: {((original_memory - optimized_memory) / original_memory * 100):.1f}%")
DateTime Conversions
While astype() can convert to datetime, pd.to_datetime() provides more flexibility for parsing various date formats.
df = pd.DataFrame({
'date_str': ['2024-01-15', '2024-02-20', '2024-03-25'],
'timestamp': [1705276800, 1708387200, 1711324800]
})
# Convert string to datetime using astype
df['date_dt'] = df['date_str'].astype('datetime64[ns]')
# Convert Unix timestamp (requires pd.to_datetime)
df['timestamp_dt'] = pd.to_datetime(df['timestamp'], unit='s')
# Convert datetime back to string
df['date_formatted'] = df['date_dt'].astype(str)
print(df.dtypes)
print(df[['date_dt', 'timestamp_dt']])
Nullable Integer Types
Pandas supports nullable integer types that can contain NaN values, unlike standard NumPy integers.
df = pd.DataFrame({
'values': [1, 2, None, 4, 5]
})
# Standard int conversion fails with None
# df['values'] = df['values'].astype(int) # ValueError
# Use nullable integer type
df['values'] = df['values'].astype('Int64') # Capital 'I'
print(df)
print(df.dtypes)
# values
# 0 1
# 1 2
# 2 <NA>
# 3 4
# 4 5
# values Int64
Boolean Conversions
Converting to boolean requires understanding how pandas interprets different values.
df = pd.DataFrame({
'numeric': [0, 1, 2, -1],
'strings': ['True', 'False', 'true', 'false'],
'binary': ['0', '1', '0', '1']
})
# Numeric to bool: 0 is False, everything else is True
df['numeric_bool'] = df['numeric'].astype(bool)
# String to bool requires mapping
bool_map = {'True': True, 'true': True, 'False': False, 'false': False}
df['strings_bool'] = df['strings'].map(bool_map)
# Binary string to bool
df['binary_bool'] = df['binary'].astype(int).astype(bool)
print(df)
# numeric strings binary numeric_bool strings_bool binary_bool
# 0 0 True 0 False True False
# 1 1 False 1 True False True
# 2 2 true 0 True True False
# 3 -1 false 1 True False True
Performance Considerations
Type conversion can be expensive on large datasets. Consider these optimization strategies:
import time
# Create large DataFrame
df = pd.DataFrame({
'col1': ['100'] * 1000000,
'col2': ['200'] * 1000000,
'col3': ['300'] * 1000000
})
# Method 1: Individual conversions
start = time.time()
df['col1'] = df['col1'].astype(int)
df['col2'] = df['col2'].astype(int)
df['col3'] = df['col3'].astype(int)
print(f"Individual conversions: {time.time() - start:.3f}s")
# Reset DataFrame
df = pd.DataFrame({
'col1': ['100'] * 1000000,
'col2': ['200'] * 1000000,
'col3': ['300'] * 1000000
})
# Method 2: Batch conversion
start = time.time()
df = df.astype({'col1': int, 'col2': int, 'col3': int})
print(f"Batch conversion: {time.time() - start:.3f}s")
The astype() method provides explicit control over data types in pandas DataFrames. Proper type selection improves both memory efficiency and computational performance, particularly when working with large datasets. Always validate your data before conversion and choose the most appropriate type for your use case.