How to Check DataFrame Info in Pandas

Every data analysis project starts the same way: you load a dataset and immediately need to understand what you're working with. How many rows? What columns exist? Are there missing values? What data...

Key Insights

  • The df.info() method provides the most comprehensive single-call summary of your DataFrame, showing column types, non-null counts, and memory usage in one output
  • Understanding memory usage with memory_usage(deep=True) is critical for object columns, as the default calculation significantly underestimates actual memory consumption
  • Combining shape, dtypes, and info() gives you a complete picture of your data structure in seconds, making them essential first steps when exploring any new dataset

Introduction

Every data analysis project starts the same way: you load a dataset and immediately need to understand what you’re working with. How many rows? What columns exist? Are there missing values? What data types did pandas infer?

These aren’t idle questions. Misidentified data types cause silent bugs. Unexpected null values break calculations. Memory-hungry DataFrames crash your notebook. The difference between a smooth analysis and hours of debugging often comes down to those first few minutes of inspection.

Pandas provides several methods to examine DataFrame structure and metadata. Knowing which tool to reach for—and when—separates efficient data work from frustrating trial and error. This guide covers the essential inspection methods you’ll use daily.

Quick Overview with df.info()

The info() method is your first stop when exploring any DataFrame. It packs column names, data types, non-null counts, and memory usage into a single, readable output.

import pandas as pd
import numpy as np

# Create a sample DataFrame with mixed types
df = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'username': ['alice', 'bob', 'charlie', None, 'eve'],
    'signup_date': pd.to_datetime(['2023-01-15', '2023-02-20', '2023-03-10', '2023-04-05', '2023-05-12']),
    'account_balance': [150.50, 200.00, None, 75.25, 300.00],
    'is_premium': [True, False, True, False, True]
})

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   user_id          5 non-null      int64         
 1   username         4 non-null      object        
 2   signup_date      5 non-null      datetime64[ns]
 3   account_balance  4 non-null      float64       
 4   is_premium       5 non-null      bool          
dtypes: bool(1), datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 253.0+ bytes

This output tells you everything essential: you have 5 rows, 5 columns, and two columns with missing values (username and account_balance show 4 non-null instead of 5). The data types are correctly inferred—datetime for dates, bool for the premium flag.

For DataFrames with many columns, control the output verbosity:

# Show all columns regardless of DataFrame width
df.info(verbose=True)

# Truncate output for wide DataFrames
df.info(verbose=False)

# Skip the null count calculation (faster for large DataFrames)
df.info(null_counts=False)

Checking Data Types with df.dtypes

When you need just the data types without the extra context, dtypes returns a clean Series mapping column names to their types.

print(df.dtypes)

Output:

user_id                     int64
username                   object
signup_date        datetime64[ns]
account_balance           float64
is_premium                   bool
dtype: object

This becomes particularly useful when validating types programmatically or planning type conversions:

# Check if a specific column is numeric
if df['account_balance'].dtype in ['int64', 'float64']:
    print("Column is numeric, safe for calculations")

# Find all object (string) columns
object_columns = df.select_dtypes(include=['object']).columns.tolist()
print(f"String columns: {object_columns}")

# Find all numeric columns
numeric_columns = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numeric columns: {numeric_columns}")

The select_dtypes() method pairs well with dtypes for filtering columns by type—essential when applying transformations to specific column categories.

Inspecting Shape and Size

Three properties give you dimensional information, each with distinct use cases.

# Create a larger example
large_df = pd.DataFrame(np.random.randn(1000, 25))

# Shape returns (rows, columns) as a tuple
print(f"Shape: {large_df.shape}")
print(f"Rows: {large_df.shape[0]}, Columns: {large_df.shape[1]}")

# Size returns total element count (rows × columns)
print(f"Size: {large_df.size}")

# len() returns row count only
print(f"Length: {len(large_df)}")

Output:

Shape: (1000, 25)
Rows: 1000, Columns: 25
Size: 25000
Length: 1000

Use shape when you need both dimensions. Use len() when you only care about row count—it’s marginally faster and more readable in conditionals. Use size for total cell count, useful when estimating processing time or memory requirements.

# Practical example: validation before merge
def safe_merge(left_df, right_df, on):
    left_rows = len(left_df)
    result = left_df.merge(right_df, on=on)
    
    if len(result) > left_rows * 2:
        print(f"Warning: Merge expanded rows from {left_rows} to {len(result)}")
    
    return result

Memory Usage Analysis

Memory inspection becomes critical with large datasets. The default info() output shows memory usage, but the number can be misleading.

# DataFrame with string data
string_df = pd.DataFrame({
    'id': range(10000),
    'description': ['This is a sample description text'] * 10000,
    'category': ['Category A', 'Category B', 'Category C'] * 3333 + ['Category A']
})

# Default memory usage (underestimates object columns)
print("Default memory usage:")
print(string_df.memory_usage())
print(f"Total: {string_df.memory_usage().sum():,} bytes")

print("\n" + "="*50 + "\n")

# Deep memory usage (accurate for object columns)
print("Deep memory usage:")
print(string_df.memory_usage(deep=True))
print(f"Total: {string_df.memory_usage(deep=True).sum():,} bytes")

Output:

Default memory usage:
Index             128
id              80000
description     80000
category        80000
Total: 240,128 bytes

==================================================

Deep memory usage:
Index               128
id                80000
description     2680000
category         680000
Total: 3,440,128 bytes

The difference is stark. Default calculation only counts the pointer size for object columns (8 bytes each). Deep calculation traverses the actual string data. For this DataFrame, the true memory usage is 14 times higher than the default estimate.

Use this information to optimize memory:

# Convert repetitive strings to category type
string_df['category'] = string_df['category'].astype('category')

print("After category conversion:")
print(f"Total: {string_df.memory_usage(deep=True).sum():,} bytes")

The category conversion can reduce memory by 90% or more for columns with repeated values.

Column and Index Information

Direct access to column and index metadata enables programmatic DataFrame manipulation.

# Access column names
print(f"Columns: {df.columns.tolist()}")
print(f"Number of columns: {len(df.columns)}")

# Check if column exists
if 'user_id' in df.columns:
    print("user_id column found")

# Index information
print(f"\nIndex type: {type(df.index)}")
print(f"Index values: {df.index.tolist()}")
print(f"Index dtype: {df.index.dtype}")

For more complex index scenarios:

# DataFrame with custom index
indexed_df = df.set_index('user_id')

print(f"Index name: {indexed_df.index.name}")
print(f"Index is unique: {indexed_df.index.is_unique}")
print(f"Index has duplicates: {indexed_df.index.has_duplicates}")

# Check for duplicate rows across all columns
print(f"\nDuplicate rows: {df.duplicated().sum()}")

# Check for duplicates in specific column
print(f"Duplicate usernames: {df['username'].duplicated().sum()}")

Checking index uniqueness before operations like loc[] access or merges prevents subtle bugs that only surface with certain data combinations.

Summary

Mastering these inspection methods transforms how you approach new datasets. Start with info() for the complete picture, drill down with specific methods as needed.

Method/Property Returns Best Use Case
df.info() Comprehensive summary First look at any DataFrame
df.dtypes Series of column types Type validation, conversion planning
df.shape Tuple (rows, cols) Dimension checks, logging
len(df) Row count Conditionals, iteration limits
df.size Total elements Memory/time estimation
df.memory_usage(deep=True) Memory per column Optimization, large dataset handling
df.columns Column index Iteration, existence checks
df.index Row index Index type/uniqueness validation

Build these checks into your workflow. The few seconds spent inspecting data upfront saves hours of debugging downstream. When you encounter unexpected results, these same methods become your diagnostic toolkit—revealing the type mismatches, null values, and structural issues hiding in your data.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.