Pandas - Convert Column to Integer

• Converting columns to integers in Pandas requires handling null values first, as standard int types cannot represent missing data—use Int64 (nullable integer) or fill/drop nulls before conversion

Key Insights

• Converting columns to integers in Pandas requires handling null values first, as standard int types cannot represent missing data—use Int64 (nullable integer) or fill/drop nulls before conversion • The astype(), to_numeric(), and convert_dtypes() methods each serve different use cases: astype for direct conversion, to_numeric for error handling, and convert_dtypes for automatic type inference • String columns containing numbers with decimals, commas, or other formatting require preprocessing with str.replace() or regex before integer conversion

Basic Integer Conversion with astype()

The astype() method provides the most straightforward approach for converting columns to integers when your data is clean and contains no null values.

import pandas as pd
import numpy as np

# Create sample DataFrame
df = pd.DataFrame({
    'price': [100.0, 200.0, 300.0, 400.0],
    'quantity': [5.0, 10.0, 15.0, 20.0]
})

print(df.dtypes)
# price       float64
# quantity    float64

# Convert single column to integer
df['price'] = df['price'].astype(int)

# Convert multiple columns
df[['price', 'quantity']] = df[['price', 'quantity']].astype(int)

print(df.dtypes)
# price       int64
# quantity    int64

The astype() method truncates decimal values rather than rounding them. If you need rounding behavior, apply round() first:

df = pd.DataFrame({'values': [1.7, 2.3, 3.9, 4.1]})

# Truncation (default behavior)
df['truncated'] = df['values'].astype(int)

# Rounding before conversion
df['rounded'] = df['values'].round().astype(int)

print(df)
#    values  truncated  rounded
# 0     1.7          1        2
# 1     2.3          2        2
# 2     3.9          3        4
# 3     4.1          4        4

Handling Null Values with Nullable Integer Types

Standard integer types in NumPy cannot represent missing values. When your column contains nulls, use Pandas’ nullable integer dtype (Int64, Int32, Int16, Int8).

df = pd.DataFrame({
    'sales': [100.0, 200.0, np.nan, 400.0],
    'returns': [5.0, np.nan, 10.0, 15.0]
})

# This will raise an error
# df['sales'] = df['sales'].astype(int)
# ValueError: Cannot convert non-finite values (NA or inf) to integer

# Use nullable integer type instead
df['sales'] = df['sales'].astype('Int64')
df['returns'] = df['returns'].astype('Int64')

print(df)
#    sales  returns
# 0    100        5
# 1    200     <NA>
# 2   <NA>       10
# 3    400       15

print(df.dtypes)
# sales      Int64
# returns    Int64

Alternative approaches include filling or dropping null values before conversion:

df = pd.DataFrame({
    'score': [85.0, 90.0, np.nan, 78.0, np.nan]
})

# Fill nulls with a default value
df['score_filled'] = df['score'].fillna(0).astype(int)

# Drop rows with nulls
df['score_dropped'] = df['score'].dropna().astype(int)

# Forward fill
df['score_ffill'] = df['score'].ffill().astype(int)

print(df)
#    score  score_filled  score_dropped  score_ffill
# 0   85.0            85           85.0           85
# 1   90.0            90           90.0           90
# 2    NaN             0            NaN           90
# 3   78.0            78           78.0           78
# 4    NaN             0            NaN           78

Using pd.to_numeric() for Robust Conversion

The pd.to_numeric() function offers more control over error handling when converting strings or mixed-type columns to integers.

df = pd.DataFrame({
    'id': ['1', '2', '3', 'invalid', '5'],
    'amount': ['100', '200.5', '300', 'N/A', '400']
})

# Coerce errors to NaN
df['id_numeric'] = pd.to_numeric(df['id'], errors='coerce').astype('Int64')

# Ignore errors (keeps original values)
df['amount_ignore'] = pd.to_numeric(df['amount'], errors='ignore')

# Raise errors (default)
try:
    df['amount_strict'] = pd.to_numeric(df['amount'], errors='raise')
except ValueError as e:
    print(f"Error: {e}")

print(df)
#        id amount  id_numeric amount_ignore
# 0       1    100           1           100
# 1       2  200.5           2         200.5
# 2       3    300           3           300
# 3 invalid    N/A        <NA>           N/A
# 4       5    400           5           400

Combine to_numeric() with downcast for memory optimization:

df = pd.DataFrame({
    'small_numbers': ['1', '2', '3', '4', '5']
})

# Default conversion to int64
df['default'] = pd.to_numeric(df['small_numbers'])

# Downcast to smallest possible integer type
df['optimized'] = pd.to_numeric(df['small_numbers'], downcast='integer')

print(df.dtypes)
# small_numbers    object
# default           int64
# optimized          int8

Converting Formatted String Columns

Real-world data often contains formatting like currency symbols, commas, or percentage signs that must be removed before conversion.

df = pd.DataFrame({
    'price': ['$1,200', '$2,500.50', '$3,000', '$450.99'],
    'percentage': ['85%', '90.5%', '78%', '92.3%'],
    'quantity': ['1,000', '2,500', '3,750', '500']
})

# Remove currency symbols and commas
df['price_int'] = (df['price']
                   .str.replace('$', '', regex=False)
                   .str.replace(',', '', regex=False)
                   .astype(float)
                   .round()
                   .astype(int))

# Remove percentage signs
df['percentage_int'] = (df['percentage']
                        .str.replace('%', '', regex=False)
                        .astype(float)
                        .round()
                        .astype(int))

# Remove commas from quantity
df['quantity_int'] = (df['quantity']
                      .str.replace(',', '', regex=False)
                      .astype(int))

print(df)
#        price percentage quantity  price_int  percentage_int  quantity_int
# 0    $1,200        85%    1,000       1200              85          1000
# 2  $2,500.50      90.5%    2,500       2501              91          2,500

For complex formatting, use regex patterns:

df = pd.DataFrame({
    'mixed': ['1,200.50', '(500)', '3000', '+750.25', 'N/A']
})

# Remove all non-numeric characters except minus sign and decimal
df['cleaned'] = df['mixed'].str.replace(r'[^\d.-]', '', regex=True)

# Handle parentheses as negative numbers
df['cleaned'] = df['cleaned'].str.replace(r'\((\d+)\)', r'-\1', regex=True)

# Convert to numeric, coercing errors
df['numeric'] = pd.to_numeric(df['cleaned'], errors='coerce').astype('Int64')

print(df)
#       mixed  cleaned  numeric
# 0  1,200.50  1200.50     1201
# 1     (500)     -500     -500
# 2      3000     3000     3000
# 3   +750.25   750.25      750
# 4       N/A              <NA>

Converting Multiple Columns Efficiently

When converting multiple columns, apply transformations efficiently using dictionary mapping or apply methods.

df = pd.DataFrame({
    'col1': [1.5, 2.7, 3.9],
    'col2': [10.1, 20.3, 30.5],
    'col3': ['100', '200', '300'],
    'col4': ['A', 'B', 'C']  # Keep as string
})

# Convert specific columns using dictionary
convert_dict = {
    'col1': int,
    'col2': int,
    'col3': int
}

df = df.astype(convert_dict)

print(df.dtypes)
# col1    int64
# col2    int64
# col3    int64
# col4    object

Use select_dtypes() to convert all numeric columns:

df = pd.DataFrame({
    'float1': [1.5, 2.7, 3.9],
    'float2': [10.1, 20.3, 30.5],
    'text': ['A', 'B', 'C']
})

# Select and convert all float columns
float_cols = df.select_dtypes(include=['float64']).columns
df[float_cols] = df[float_cols].round().astype(int)

print(df.dtypes)
# float1    int64
# float2    int64
# text     object

For columns with mixed requirements, use a loop with conditional logic:

df = pd.DataFrame({
    'id': [1.0, 2.0, np.nan, 4.0],
    'score': [85.5, 90.0, 78.3, np.nan],
    'count': [100.0, 200.0, 300.0, 400.0]
})

for col in df.columns:
    if df[col].isna().any():
        # Use nullable integer for columns with NaN
        df[col] = df[col].astype('Int64')
    else:
        # Use standard integer for complete columns
        df[col] = df[col].astype(int)

print(df.dtypes)
# id       Int64
# score    Int64
# count    int64

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.