Pandas - Get Row Count of DataFrame

• Use `len(df)` for the fastest row count performance—it directly accesses the underlying index length without iteration

Key Insights

• Use len(df) for the fastest row count performance—it directly accesses the underlying index length without iteration • The .shape[0] attribute provides row count as part of DataFrame dimensions and is equally performant for large datasets • Methods like .count() are designed for non-null value counting per column, not total row counting, and will produce different results with missing data

Understanding DataFrame Row Count Methods

Pandas provides multiple approaches to determine the number of rows in a DataFrame. While several methods appear to return the same result, they differ in purpose, performance, and behavior with missing data. Understanding these distinctions ensures you choose the right approach for your use case.

import pandas as pd
import numpy as np

# Create sample DataFrame
df = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headset'],
    'price': [1200, 25, 75, 300, 150],
    'quantity': [5, 50, 30, 10, 20]
})

print(df)
    product  price  quantity
0    Laptop   1200         5
1     Mouse     25        50
2  Keyboard     75        30
3   Monitor    300        10
4    Headset    150        20

Using len() for Row Count

The len() function returns the number of rows by accessing the DataFrame’s index length. This is the most Pythonic and performant method.

row_count = len(df)
print(f"Number of rows: {row_count}")
# Output: Number of rows: 5

This method works consistently regardless of missing values:

# DataFrame with missing values
df_missing = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, np.nan, 5],
    'C': [1, 2, 3, 4, 5]
})

print(f"Row count with missing values: {len(df_missing)}")
# Output: Row count with missing values: 5

Using the shape Attribute

The .shape attribute returns a tuple containing the dimensions of the DataFrame: (rows, columns). Access the first element for row count.

rows, cols = df.shape
print(f"Rows: {rows}, Columns: {cols}")
# Output: Rows: 5, Columns: 3

# Direct access to row count
row_count = df.shape[0]
print(f"Number of rows: {row_count}")
# Output: Number of rows: 5

This method is particularly useful when you need both dimensions:

def analyze_dataframe(df):
    rows, cols = df.shape
    total_cells = rows * cols
    return {
        'rows': rows,
        'columns': cols,
        'total_cells': total_cells
    }

stats = analyze_dataframe(df)
print(stats)
# Output: {'rows': 5, 'columns': 3, 'total_cells': 15}

Why Not to Use count()

The .count() method counts non-null values per column, not total rows. This distinction is critical when working with datasets containing missing data.

# Demonstrating count() behavior
df_sparse = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, np.nan, 5],
    'C': [1, 2, 3, 4, 5]
})

print("count() results:")
print(df_sparse.count())
A    4
B    3
C    5
dtype: int64
print(f"\nActual row count: {len(df_sparse)}")
# Output: Actual row count: 5

The .count() method is designed for data quality assessment, not row counting:

def data_completeness(df):
    total_rows = len(df)
    completeness = {}
    
    for col in df.columns:
        non_null = df[col].count()
        completeness[col] = {
            'non_null': non_null,
            'null': total_rows - non_null,
            'completeness_pct': (non_null / total_rows) * 100
        }
    
    return completeness

stats = data_completeness(df_sparse)
for col, metrics in stats.items():
    print(f"{col}: {metrics['completeness_pct']:.1f}% complete")
A: 80.0% complete
B: 60.0% complete
C: 100.0% complete

Performance Comparison

For large DataFrames, method choice impacts performance. Here’s a comparison:

import time

# Create large DataFrame
large_df = pd.DataFrame({
    f'col_{i}': np.random.randn(1000000) 
    for i in range(10)
})

# Benchmark len()
start = time.perf_counter()
for _ in range(1000):
    _ = len(large_df)
len_time = time.perf_counter() - start

# Benchmark shape
start = time.perf_counter()
for _ in range(1000):
    _ = large_df.shape[0]
shape_time = time.perf_counter() - start

print(f"len() time: {len_time:.4f}s")
print(f"shape[0] time: {shape_time:.4f}s")

Both methods are highly optimized and show negligible performance differences. Choose based on code clarity and context.

Conditional Row Counting

For counting rows meeting specific conditions, use boolean indexing with len() or .sum():

df_sales = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headset', 'Webcam'],
    'price': [1200, 25, 75, 300, 150, 80],
    'quantity': [5, 50, 30, 10, 20, 15]
})

# Count high-value products (price > 100)
high_value_count = len(df_sales[df_sales['price'] > 100])
print(f"High-value products: {high_value_count}")
# Output: High-value products: 3

# Alternative using sum() on boolean mask
high_value_count_alt = (df_sales['price'] > 100).sum()
print(f"High-value products (alt): {high_value_count_alt}")
# Output: High-value products (alt): 3

For multiple conditions:

# Products priced between 50 and 500 with quantity > 10
condition = (df_sales['price'] >= 50) & \
            (df_sales['price'] <= 500) & \
            (df_sales['quantity'] > 10)

count = condition.sum()
print(f"Products matching criteria: {count}")
# Output: Products matching criteria: 3

Counting Rows by Group

Use .groupby() with .size() to count rows per group:

df_orders = pd.DataFrame({
    'customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'Alice'],
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headset', 'Mouse'],
    'amount': [1200, 25, 75, 300, 150, 25]
})

# Count orders per customer
orders_per_customer = df_orders.groupby('customer').size()
print(orders_per_customer)
customer
Alice      3
Bob        2
Charlie    1
dtype: int64

Convert to DataFrame for further analysis:

order_summary = df_orders.groupby('customer').size().reset_index(name='order_count')
print(order_summary)
  customer  order_count
0    Alice            3
1      Bob            2
2  Charlie            1

Handling Empty DataFrames

All row counting methods handle empty DataFrames correctly:

empty_df = pd.DataFrame()

print(f"len(): {len(empty_df)}")  # Output: 0
print(f"shape[0]: {empty_df.shape[0]}")  # Output: 0

# Safe division example
def calculate_average_price(df):
    if len(df) == 0:
        return 0
    return df['price'].sum() / len(df)

# Test with empty DataFrame
empty_sales = pd.DataFrame(columns=['product', 'price'])
avg = calculate_average_price(empty_sales)
print(f"Average price: {avg}")  # Output: Average price: 0

Best Practices

Use len(df) as your default choice for row counting—it’s clear, fast, and Pythonic. Reserve .shape[0] for situations where you need both dimensions. Avoid .count() for row counting entirely; it serves a different purpose in data quality analysis.

When building production pipelines, always validate row counts at critical stages to catch data loss issues early. Combine row counting with logging to track data flow through transformations.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.