Pandas - Check if DataFrame is Empty

Key Insights

• Use df.empty for the fastest boolean check, len(df) == 0 for explicit row counting, or df.shape[0] == 0 when you need dimensional information simultaneously. • Empty DataFrames can have columns with no rows, no columns with no rows, or be completely empty—each scenario requires different handling strategies in production code. • Combine emptiness checks with null value detection using df.isnull().all().all() to distinguish between truly empty DataFrames and those containing only NaN values.

Why Checking for Empty DataFrames Matters

Data pipelines fail when you assume DataFrames contain data. A database query might return no results, a filtered dataset could eliminate all rows, or an API might send back an empty response. Without proper emptiness validation, you’ll encounter cryptic errors downstream when attempting operations on non-existent data.

Production code needs defensive checks before performing aggregations, statistical calculations, or data transformations. The cost of a simple boolean check is negligible compared to debugging runtime failures in production systems.

The Standard Method: Using the empty Property

The empty property provides the most readable and Pythonic way to check if a DataFrame contains no data:

import pandas as pd

# Create an empty DataFrame
df_empty = pd.DataFrame()
print(f"Is empty: {df_empty.empty}")  # True

# DataFrame with columns but no rows
df_cols_only = pd.DataFrame(columns=['A', 'B', 'C'])
print(f"Is empty: {df_cols_only.empty}")  # True

# DataFrame with data
df_data = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
print(f"Is empty: {df_data.empty}")  # False

The empty property returns True when the DataFrame has zero rows, regardless of whether columns are defined. This matches the intuitive definition of “empty” for most use cases.

Alternative Approaches: Length and Shape

Different scenarios call for different checking methods. When you need the actual row count or are already working with shape tuples, these alternatives provide more context:

import pandas as pd

df = pd.DataFrame({'X': [10, 20, 30]})

# Method 1: Check length
if len(df) == 0:
    print("No rows")
else:
    print(f"Contains {len(df)} rows")

# Method 2: Check shape
rows, cols = df.shape
if rows == 0:
    print("No rows")
else:
    print(f"Shape: {rows} rows × {cols} columns")

# Method 3: Direct shape indexing
if df.shape[0] == 0:
    print("No rows")

Use len(df) when you need the row count for subsequent logic. Use df.shape[0] when you’re also concerned with column count, as it provides both dimensions in a single attribute access.

Performance Comparison

For large-scale data processing, understanding performance characteristics matters:

import pandas as pd
import timeit

# Create test DataFrames of varying sizes
df_empty = pd.DataFrame()
df_small = pd.DataFrame({'A': range(100)})
df_large = pd.DataFrame({'A': range(1000000)})

def benchmark_method(df, method):
    if method == 'empty':
        return lambda: df.empty
    elif method == 'len':
        return lambda: len(df) == 0
    elif method == 'shape':
        return lambda: df.shape[0] == 0

# Run benchmarks
for size, df in [('empty', df_empty), ('small', df_small), ('large', df_large)]:
    print(f"\n{size.upper()} DataFrame:")
    for method in ['empty', 'len', 'shape']:
        time = timeit.timeit(benchmark_method(df, method), number=100000)
        print(f"  {method:6s}: {time:.4f} seconds")

All three methods execute in constant time O(1) since they access cached metadata rather than iterating through rows. The empty property typically performs marginally faster, but the difference is negligible for practical purposes.

Handling Edge Cases: Columns Without Rows

DataFrames can exist in states that appear empty but retain structural information. This matters when preserving schema through pipeline stages:

import pandas as pd

# Empty DataFrame with defined schema
df_schema = pd.DataFrame(columns=['user_id', 'timestamp', 'event_type'])
df_schema = df_schema.astype({
    'user_id': 'int64',
    'timestamp': 'datetime64[ns]',
    'event_type': 'string'
})

print(f"Empty: {df_schema.empty}")  # True
print(f"Columns: {list(df_schema.columns)}")  # ['user_id', 'timestamp', 'event_type']
print(f"Dtypes:\n{df_schema.dtypes}")

# Preserve schema when concatenating
df_new_data = pd.DataFrame({
    'user_id': [101, 102],
    'timestamp': pd.to_datetime(['2024-01-01', '2024-01-02']),
    'event_type': ['login', 'logout']
})

# Schema is preserved even when starting from empty
result = pd.concat([df_schema, df_new_data], ignore_index=True)
print(f"\nResult dtypes:\n{result.dtypes}")

This pattern is essential for maintaining type consistency in ETL pipelines where early stages might produce no results but later stages need predictable column types.

Distinguishing Empty from All-Null DataFrames

A DataFrame containing only null values is not technically empty but may require similar handling:

import pandas as pd
import numpy as np

# DataFrame with NaN values
df_nulls = pd.DataFrame({'A': [np.nan, np.nan], 'B': [None, None]})

print(f"Is empty: {df_nulls.empty}")  # False
print(f"All null: {df_nulls.isnull().all().all()}")  # True
print(f"Any data: {df_nulls.notna().any().any()}")  # False

def is_effectively_empty(df):
    """Check if DataFrame is empty or contains only null values"""
    return df.empty or not df.notna().any().any()

# Test the function
print(f"\nEffectively empty (nulls): {is_effectively_empty(df_nulls)}")  # True
print(f"Effectively empty (data): {is_effectively_empty(pd.DataFrame({'A': [1, 2]}))}")  # False
print(f"Effectively empty (empty): {is_effectively_empty(pd.DataFrame())}")  # True

This distinction prevents processing DataFrames that technically have rows but contain no usable information.

Practical Pattern: Defensive Data Processing

Implement robust checks before expensive operations:

import pandas as pd
import numpy as np

def calculate_statistics(df):
    """Calculate statistics with proper empty checks"""
    if df.empty:
        return {
            'mean': None,
            'median': None,
            'std': None,
            'count': 0
        }
    
    # Check for numeric columns
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) == 0:
        raise ValueError("No numeric columns found")
    
    return {
        'mean': df[numeric_cols].mean().to_dict(),
        'median': df[numeric_cols].median().to_dict(),
        'std': df[numeric_cols].std().to_dict(),
        'count': len(df)
    }

# Test with various inputs
test_cases = [
    pd.DataFrame(),  # Empty
    pd.DataFrame({'A': []}),  # Columns but no rows
    pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}),  # Valid data
    pd.DataFrame({'A': ['x', 'y', 'z']})  # No numeric columns
]

for i, df in enumerate(test_cases):
    try:
        result = calculate_statistics(df)
        print(f"Test {i+1}: {result}")
    except ValueError as e:
        print(f"Test {i+1}: Error - {e}")

Conditional Pipeline Execution

Skip expensive operations when DataFrames are empty:

import pandas as pd

def process_data_pipeline(df):
    """Multi-stage pipeline with early exits"""
    print(f"Input: {len(df)} rows")
    
    # Stage 1: Filter
    df_filtered = df[df['value'] > 100]
    if df_filtered.empty:
        print("No data passed filter criteria")
        return pd.DataFrame(columns=df.columns)
    
    # Stage 2: Transform (expensive operation)
    print(f"Processing {len(df_filtered)} rows...")
    df_transformed = df_filtered.copy()
    df_transformed['processed'] = df_transformed['value'].apply(lambda x: x ** 2)
    
    # Stage 3: Aggregate
    if not df_transformed.empty:
        summary = df_transformed.groupby('category').agg({
            'value': 'mean',
            'processed': 'sum'
        })
        return summary
    
    return pd.DataFrame()

# Example usage
df = pd.DataFrame({
    'category': ['A', 'B', 'C', 'A'],
    'value': [50, 150, 75, 200]
})

result = process_data_pipeline(df)
print(f"\nFinal result:\n{result}")

This pattern prevents wasted computation and provides clear logging for debugging pipeline failures where intermediate stages eliminate all data.