How to Check Data Types in Pandas
Data types in Pandas aren't just metadata—they determine what operations you can perform, how much memory your DataFrame consumes, and whether your calculations produce correct results. A column that...
Key Insights
- Use
df.dtypesfor quick inspection anddf.info()for memory-aware analysis, but rely onpd.api.typesfunctions for programmatic validation in production code. - Object dtype is Pandas’ catch-all type that often hides data quality issues—always investigate object columns for mixed types or incorrectly parsed data.
- Checking data types immediately after loading data prevents subtle bugs that surface much later in your analysis pipeline.
Introduction
Data types in Pandas aren’t just metadata—they determine what operations you can perform, how much memory your DataFrame consumes, and whether your calculations produce correct results. A column that looks numeric might actually be stored as strings, silently breaking your aggregations. A datetime column parsed as objects will fail every time-series operation you throw at it.
The most common source of dtype problems? Loading data. When you read a CSV, Pandas infers types based on the values it sees. A single “N/A” string in an otherwise numeric column forces the entire column to object dtype. A date format Pandas doesn’t recognize becomes a string. These issues compound quickly in real-world datasets.
This article covers every method for inspecting and validating data types in Pandas, from quick interactive checks to robust programmatic validation you can build into data pipelines.
Understanding Pandas Data Types
Pandas uses NumPy dtypes as its foundation but extends them with additional types optimized for tabular data. Here are the types you’ll encounter most often:
| Dtype | Description | Example Values |
|---|---|---|
int64 |
64-bit integer | 1, -5, 1000000 |
float64 |
64-bit floating point | 3.14, -0.001, NaN |
object |
Python objects (usually strings) | “hello”, mixed types |
bool |
Boolean | True, False |
datetime64[ns] |
Timestamp with nanosecond precision | 2024-01-15 |
timedelta64[ns] |
Duration | 5 days, 3 hours |
category |
Categorical data | “red”, “green”, “blue” |
The object dtype deserves special attention. It’s Pandas’ fallback type that can hold any Python object. While flexible, it’s memory-inefficient and prevents vectorized operations. When you see object, investigate whether it should be a more specific type.
Let’s create a sample DataFrame we’ll use throughout this article:
import pandas as pd
import numpy as np
from datetime import datetime
df = pd.DataFrame({
'user_id': [1001, 1002, 1003, 1004, 1005],
'username': ['alice', 'bob', 'charlie', 'diana', 'eve'],
'signup_date': pd.to_datetime(['2023-01-15', '2023-02-20', '2023-03-10', '2023-04-05', '2023-05-12']),
'account_balance': [150.50, 200.00, 75.25, 500.00, 0.00],
'is_premium': [True, False, False, True, False],
'subscription_tier': pd.Categorical(['gold', 'free', 'free', 'platinum', 'free']),
'last_login': ['2024-01-10', '2024-01-08', None, '2024-01-11', '2024-01-09'], # Intentionally not parsed
'referral_code': [None, 'REF100', None, 'REF200', 'REF150']
})
This DataFrame includes integers, strings, datetimes, floats, booleans, categoricals, and a couple of columns with issues we’ll detect later.
Checking Data Types with dtypes and info()
The dtypes attribute returns a Series mapping column names to their data types. It’s the fastest way to see what you’re working with:
print(df.dtypes)
Output:
user_id int64
username object
signup_date datetime64[ns]
account_balance float64
is_premium bool
subscription_tier category
last_login object
referral_code object
dtype: object
Notice that last_login is object rather than datetime64—that’s because we passed strings without parsing them. This is exactly the kind of issue dtype checking catches.
For a single column, access the dtype attribute directly:
print(df['signup_date'].dtype) # datetime64[ns]
print(df['account_balance'].dtype) # float64
The info() method provides a more comprehensive view, including non-null counts and memory usage:
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 5 non-null int64
1 username 5 non-null object
2 signup_date 5 non-null datetime64[ns]
3 account_balance 5 non-null float64
4 is_premium 5 non-null bool
5 subscription_tier 5 non-null category
6 last_login 4 non-null object
7 referral_code 3 non-null object
dtypes: bool(1), category(1), datetime64[ns](1), float64(1), int64(1), object(3)
memory usage: 449.0+ bytes
For large DataFrames, add memory_usage='deep' to get accurate memory consumption for object columns:
df.info(memory_usage='deep')
This reveals the true memory cost of storing strings, which can be substantially higher than the default estimate suggests.
Selecting Columns by Data Type
When you need to operate on all columns of a specific type, select_dtypes() filters your DataFrame accordingly:
# Select all numeric columns
numeric_cols = df.select_dtypes(include=['number'])
print(numeric_cols.columns.tolist())
# ['user_id', 'account_balance']
# Select all object (string) columns
string_cols = df.select_dtypes(include=['object'])
print(string_cols.columns.tolist())
# ['username', 'last_login', 'referral_code']
You can include multiple types and use exclusion:
# Include integers and floats explicitly
df.select_dtypes(include=['int64', 'float64'])
# Exclude object and category columns
df.select_dtypes(exclude=['object', 'category'])
# Combine include and exclude
df.select_dtypes(include=['number'], exclude=['int64']) # Only floats
The include parameter accepts NumPy dtype names, Python types, or these convenient shortcuts:
'number'— all numeric types'datetime'— datetime types'timedelta'— timedelta types'category'— categorical type'bool'— boolean type
A practical use case—applying string operations only to string columns:
for col in df.select_dtypes(include=['object']).columns:
df[col] = df[col].str.strip() if df[col].notna().any() else df[col]
Programmatic Type Checking
For validation logic in scripts and pipelines, the pd.api.types module provides functions that return boolean values:
from pandas.api.types import (
is_numeric_dtype,
is_string_dtype,
is_datetime64_any_dtype,
is_categorical_dtype,
is_bool_dtype,
is_integer_dtype,
is_float_dtype
)
# Check individual columns
print(is_numeric_dtype(df['account_balance'])) # True
print(is_datetime64_any_dtype(df['signup_date'])) # True
print(is_string_dtype(df['username'])) # True
print(is_categorical_dtype(df['subscription_tier'])) # True
These functions shine in validation code:
def validate_dataframe(df):
"""Validate expected data types for user data."""
errors = []
if not is_integer_dtype(df['user_id']):
errors.append("user_id must be integer type")
if not is_datetime64_any_dtype(df['signup_date']):
errors.append("signup_date must be datetime type")
if not is_numeric_dtype(df['account_balance']):
errors.append("account_balance must be numeric type")
if not is_bool_dtype(df['is_premium']):
errors.append("is_premium must be boolean type")
if errors:
raise ValueError(f"Validation failed: {'; '.join(errors)}")
return True
# This passes
validate_dataframe(df)
You can also build type-aware processing functions:
def summarize_column(series):
"""Generate appropriate summary based on column type."""
if is_numeric_dtype(series):
return series.describe()
elif is_datetime64_any_dtype(series):
return pd.Series({
'min': series.min(),
'max': series.max(),
'range': series.max() - series.min()
})
elif is_categorical_dtype(series):
return series.value_counts()
else:
return pd.Series({
'unique': series.nunique(),
'most_common': series.mode().iloc[0] if not series.mode().empty else None
})
Common Data Type Issues and Detection
Real-world data rarely arrives clean. Here are the issues you’ll encounter most often and how to detect them.
Mixed types in object columns occur when a column contains values of different Python types:
# Create a problematic column
df_messy = pd.DataFrame({
'value': [1, 2, 'three', 4, None, 6.0]
})
# Detect mixed types
type_counts = df_messy['value'].apply(type).value_counts()
print(type_counts)
Output:
<class 'int'> 3
<class 'str'> 1
<class 'float'> 1
<class 'NoneType'> 1
dtype: int64
A helper function to check all object columns:
def detect_mixed_types(df):
"""Find object columns with mixed Python types."""
mixed = {}
for col in df.select_dtypes(include=['object']).columns:
types = df[col].dropna().apply(type).unique()
if len(types) > 1:
mixed[col] = [t.__name__ for t in types]
return mixed
print(detect_mixed_types(df_messy))
# {'value': ['int', 'str', 'float']}
Numeric data stored as strings happens when CSVs contain formatting characters:
df_currency = pd.DataFrame({
'price': ['$100.00', '$250.50', '$75.25', 'N/A', '$300.00']
})
print(df_currency['price'].dtype) # object
# Detect: try converting and see what fails
def check_numeric_convertibility(series):
"""Check if string series can be converted to numeric."""
converted = pd.to_numeric(series, errors='coerce')
failed = series[converted.isna() & series.notna()]
return failed
print(check_numeric_convertibility(df_currency['price']))
Datetime parsing failures leave you with object columns instead of datetime64:
def check_datetime_columns(df, date_columns):
"""Verify expected date columns are actually datetime type."""
issues = {}
for col in date_columns:
if col not in df.columns:
issues[col] = "Column not found"
elif not is_datetime64_any_dtype(df[col]):
issues[col] = f"Expected datetime, got {df[col].dtype}"
return issues
# Check our sample DataFrame
print(check_datetime_columns(df, ['signup_date', 'last_login']))
# {'last_login': 'Expected datetime, got object'}
Conclusion
Data type checking should be the first step after loading any dataset. Start with df.dtypes or df.info() for interactive exploration, use select_dtypes() when you need to operate on columns by type, and rely on pd.api.types functions for validation in production code.
Build type validation into your data pipelines early. A validation function that runs immediately after data loading catches dtype issues before they propagate through your analysis. The few minutes spent writing these checks saves hours of debugging mysterious calculation errors later.
The pattern I recommend: load data, check types, fix issues, validate, then proceed with analysis. Make this habitual, and you’ll eliminate an entire category of data bugs from your work.