Pandas - Convert Column to Integer
• Converting columns to integers in Pandas requires handling null values first, as standard int types cannot represent missing data—use Int64 (nullable integer) or fill/drop nulls before conversion
Key Insights
• Converting columns to integers in Pandas requires handling null values first, as standard int types cannot represent missing data—use Int64 (nullable integer) or fill/drop nulls before conversion
• The astype(), to_numeric(), and convert_dtypes() methods each serve different use cases: astype for direct conversion, to_numeric for error handling, and convert_dtypes for automatic type inference
• String columns containing numbers with decimals, commas, or other formatting require preprocessing with str.replace() or regex before integer conversion
Basic Integer Conversion with astype()
The astype() method provides the most straightforward approach for converting columns to integers when your data is clean and contains no null values.
import pandas as pd
import numpy as np
# Create sample DataFrame
df = pd.DataFrame({
'price': [100.0, 200.0, 300.0, 400.0],
'quantity': [5.0, 10.0, 15.0, 20.0]
})
print(df.dtypes)
# price float64
# quantity float64
# Convert single column to integer
df['price'] = df['price'].astype(int)
# Convert multiple columns
df[['price', 'quantity']] = df[['price', 'quantity']].astype(int)
print(df.dtypes)
# price int64
# quantity int64
The astype() method truncates decimal values rather than rounding them. If you need rounding behavior, apply round() first:
df = pd.DataFrame({'values': [1.7, 2.3, 3.9, 4.1]})
# Truncation (default behavior)
df['truncated'] = df['values'].astype(int)
# Rounding before conversion
df['rounded'] = df['values'].round().astype(int)
print(df)
# values truncated rounded
# 0 1.7 1 2
# 1 2.3 2 2
# 2 3.9 3 4
# 3 4.1 4 4
Handling Null Values with Nullable Integer Types
Standard integer types in NumPy cannot represent missing values. When your column contains nulls, use Pandas’ nullable integer dtype (Int64, Int32, Int16, Int8).
df = pd.DataFrame({
'sales': [100.0, 200.0, np.nan, 400.0],
'returns': [5.0, np.nan, 10.0, 15.0]
})
# This will raise an error
# df['sales'] = df['sales'].astype(int)
# ValueError: Cannot convert non-finite values (NA or inf) to integer
# Use nullable integer type instead
df['sales'] = df['sales'].astype('Int64')
df['returns'] = df['returns'].astype('Int64')
print(df)
# sales returns
# 0 100 5
# 1 200 <NA>
# 2 <NA> 10
# 3 400 15
print(df.dtypes)
# sales Int64
# returns Int64
Alternative approaches include filling or dropping null values before conversion:
df = pd.DataFrame({
'score': [85.0, 90.0, np.nan, 78.0, np.nan]
})
# Fill nulls with a default value
df['score_filled'] = df['score'].fillna(0).astype(int)
# Drop rows with nulls
df['score_dropped'] = df['score'].dropna().astype(int)
# Forward fill
df['score_ffill'] = df['score'].ffill().astype(int)
print(df)
# score score_filled score_dropped score_ffill
# 0 85.0 85 85.0 85
# 1 90.0 90 90.0 90
# 2 NaN 0 NaN 90
# 3 78.0 78 78.0 78
# 4 NaN 0 NaN 78
Using pd.to_numeric() for Robust Conversion
The pd.to_numeric() function offers more control over error handling when converting strings or mixed-type columns to integers.
df = pd.DataFrame({
'id': ['1', '2', '3', 'invalid', '5'],
'amount': ['100', '200.5', '300', 'N/A', '400']
})
# Coerce errors to NaN
df['id_numeric'] = pd.to_numeric(df['id'], errors='coerce').astype('Int64')
# Ignore errors (keeps original values)
df['amount_ignore'] = pd.to_numeric(df['amount'], errors='ignore')
# Raise errors (default)
try:
df['amount_strict'] = pd.to_numeric(df['amount'], errors='raise')
except ValueError as e:
print(f"Error: {e}")
print(df)
# id amount id_numeric amount_ignore
# 0 1 100 1 100
# 1 2 200.5 2 200.5
# 2 3 300 3 300
# 3 invalid N/A <NA> N/A
# 4 5 400 5 400
Combine to_numeric() with downcast for memory optimization:
df = pd.DataFrame({
'small_numbers': ['1', '2', '3', '4', '5']
})
# Default conversion to int64
df['default'] = pd.to_numeric(df['small_numbers'])
# Downcast to smallest possible integer type
df['optimized'] = pd.to_numeric(df['small_numbers'], downcast='integer')
print(df.dtypes)
# small_numbers object
# default int64
# optimized int8
Converting Formatted String Columns
Real-world data often contains formatting like currency symbols, commas, or percentage signs that must be removed before conversion.
df = pd.DataFrame({
'price': ['$1,200', '$2,500.50', '$3,000', '$450.99'],
'percentage': ['85%', '90.5%', '78%', '92.3%'],
'quantity': ['1,000', '2,500', '3,750', '500']
})
# Remove currency symbols and commas
df['price_int'] = (df['price']
.str.replace('$', '', regex=False)
.str.replace(',', '', regex=False)
.astype(float)
.round()
.astype(int))
# Remove percentage signs
df['percentage_int'] = (df['percentage']
.str.replace('%', '', regex=False)
.astype(float)
.round()
.astype(int))
# Remove commas from quantity
df['quantity_int'] = (df['quantity']
.str.replace(',', '', regex=False)
.astype(int))
print(df)
# price percentage quantity price_int percentage_int quantity_int
# 0 $1,200 85% 1,000 1200 85 1000
# 2 $2,500.50 90.5% 2,500 2501 91 2,500
For complex formatting, use regex patterns:
df = pd.DataFrame({
'mixed': ['1,200.50', '(500)', '3000', '+750.25', 'N/A']
})
# Remove all non-numeric characters except minus sign and decimal
df['cleaned'] = df['mixed'].str.replace(r'[^\d.-]', '', regex=True)
# Handle parentheses as negative numbers
df['cleaned'] = df['cleaned'].str.replace(r'\((\d+)\)', r'-\1', regex=True)
# Convert to numeric, coercing errors
df['numeric'] = pd.to_numeric(df['cleaned'], errors='coerce').astype('Int64')
print(df)
# mixed cleaned numeric
# 0 1,200.50 1200.50 1201
# 1 (500) -500 -500
# 2 3000 3000 3000
# 3 +750.25 750.25 750
# 4 N/A <NA>
Converting Multiple Columns Efficiently
When converting multiple columns, apply transformations efficiently using dictionary mapping or apply methods.
df = pd.DataFrame({
'col1': [1.5, 2.7, 3.9],
'col2': [10.1, 20.3, 30.5],
'col3': ['100', '200', '300'],
'col4': ['A', 'B', 'C'] # Keep as string
})
# Convert specific columns using dictionary
convert_dict = {
'col1': int,
'col2': int,
'col3': int
}
df = df.astype(convert_dict)
print(df.dtypes)
# col1 int64
# col2 int64
# col3 int64
# col4 object
Use select_dtypes() to convert all numeric columns:
df = pd.DataFrame({
'float1': [1.5, 2.7, 3.9],
'float2': [10.1, 20.3, 30.5],
'text': ['A', 'B', 'C']
})
# Select and convert all float columns
float_cols = df.select_dtypes(include=['float64']).columns
df[float_cols] = df[float_cols].round().astype(int)
print(df.dtypes)
# float1 int64
# float2 int64
# text object
For columns with mixed requirements, use a loop with conditional logic:
df = pd.DataFrame({
'id': [1.0, 2.0, np.nan, 4.0],
'score': [85.5, 90.0, 78.3, np.nan],
'count': [100.0, 200.0, 300.0, 400.0]
})
for col in df.columns:
if df[col].isna().any():
# Use nullable integer for columns with NaN
df[col] = df[col].astype('Int64')
else:
# Use standard integer for complete columns
df[col] = df[col].astype(int)
print(df.dtypes)
# id Int64
# score Int64
# count int64