Pandas - Get Row Count of DataFrame
• Use `len(df)` for the fastest row count performance—it directly accesses the underlying index length without iteration
Key Insights
• Use len(df) for the fastest row count performance—it directly accesses the underlying index length without iteration
• The .shape[0] attribute provides row count as part of DataFrame dimensions and is equally performant for large datasets
• Methods like .count() are designed for non-null value counting per column, not total row counting, and will produce different results with missing data
Understanding DataFrame Row Count Methods
Pandas provides multiple approaches to determine the number of rows in a DataFrame. While several methods appear to return the same result, they differ in purpose, performance, and behavior with missing data. Understanding these distinctions ensures you choose the right approach for your use case.
import pandas as pd
import numpy as np
# Create sample DataFrame
df = pd.DataFrame({
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headset'],
'price': [1200, 25, 75, 300, 150],
'quantity': [5, 50, 30, 10, 20]
})
print(df)
product price quantity
0 Laptop 1200 5
1 Mouse 25 50
2 Keyboard 75 30
3 Monitor 300 10
4 Headset 150 20
Using len() for Row Count
The len() function returns the number of rows by accessing the DataFrame’s index length. This is the most Pythonic and performant method.
row_count = len(df)
print(f"Number of rows: {row_count}")
# Output: Number of rows: 5
This method works consistently regardless of missing values:
# DataFrame with missing values
df_missing = pd.DataFrame({
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, np.nan, 5],
'C': [1, 2, 3, 4, 5]
})
print(f"Row count with missing values: {len(df_missing)}")
# Output: Row count with missing values: 5
Using the shape Attribute
The .shape attribute returns a tuple containing the dimensions of the DataFrame: (rows, columns). Access the first element for row count.
rows, cols = df.shape
print(f"Rows: {rows}, Columns: {cols}")
# Output: Rows: 5, Columns: 3
# Direct access to row count
row_count = df.shape[0]
print(f"Number of rows: {row_count}")
# Output: Number of rows: 5
This method is particularly useful when you need both dimensions:
def analyze_dataframe(df):
rows, cols = df.shape
total_cells = rows * cols
return {
'rows': rows,
'columns': cols,
'total_cells': total_cells
}
stats = analyze_dataframe(df)
print(stats)
# Output: {'rows': 5, 'columns': 3, 'total_cells': 15}
Why Not to Use count()
The .count() method counts non-null values per column, not total rows. This distinction is critical when working with datasets containing missing data.
# Demonstrating count() behavior
df_sparse = pd.DataFrame({
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, np.nan, 5],
'C': [1, 2, 3, 4, 5]
})
print("count() results:")
print(df_sparse.count())
A 4
B 3
C 5
dtype: int64
print(f"\nActual row count: {len(df_sparse)}")
# Output: Actual row count: 5
The .count() method is designed for data quality assessment, not row counting:
def data_completeness(df):
total_rows = len(df)
completeness = {}
for col in df.columns:
non_null = df[col].count()
completeness[col] = {
'non_null': non_null,
'null': total_rows - non_null,
'completeness_pct': (non_null / total_rows) * 100
}
return completeness
stats = data_completeness(df_sparse)
for col, metrics in stats.items():
print(f"{col}: {metrics['completeness_pct']:.1f}% complete")
A: 80.0% complete
B: 60.0% complete
C: 100.0% complete
Performance Comparison
For large DataFrames, method choice impacts performance. Here’s a comparison:
import time
# Create large DataFrame
large_df = pd.DataFrame({
f'col_{i}': np.random.randn(1000000)
for i in range(10)
})
# Benchmark len()
start = time.perf_counter()
for _ in range(1000):
_ = len(large_df)
len_time = time.perf_counter() - start
# Benchmark shape
start = time.perf_counter()
for _ in range(1000):
_ = large_df.shape[0]
shape_time = time.perf_counter() - start
print(f"len() time: {len_time:.4f}s")
print(f"shape[0] time: {shape_time:.4f}s")
Both methods are highly optimized and show negligible performance differences. Choose based on code clarity and context.
Conditional Row Counting
For counting rows meeting specific conditions, use boolean indexing with len() or .sum():
df_sales = pd.DataFrame({
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headset', 'Webcam'],
'price': [1200, 25, 75, 300, 150, 80],
'quantity': [5, 50, 30, 10, 20, 15]
})
# Count high-value products (price > 100)
high_value_count = len(df_sales[df_sales['price'] > 100])
print(f"High-value products: {high_value_count}")
# Output: High-value products: 3
# Alternative using sum() on boolean mask
high_value_count_alt = (df_sales['price'] > 100).sum()
print(f"High-value products (alt): {high_value_count_alt}")
# Output: High-value products (alt): 3
For multiple conditions:
# Products priced between 50 and 500 with quantity > 10
condition = (df_sales['price'] >= 50) & \
(df_sales['price'] <= 500) & \
(df_sales['quantity'] > 10)
count = condition.sum()
print(f"Products matching criteria: {count}")
# Output: Products matching criteria: 3
Counting Rows by Group
Use .groupby() with .size() to count rows per group:
df_orders = pd.DataFrame({
'customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'Alice'],
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headset', 'Mouse'],
'amount': [1200, 25, 75, 300, 150, 25]
})
# Count orders per customer
orders_per_customer = df_orders.groupby('customer').size()
print(orders_per_customer)
customer
Alice 3
Bob 2
Charlie 1
dtype: int64
Convert to DataFrame for further analysis:
order_summary = df_orders.groupby('customer').size().reset_index(name='order_count')
print(order_summary)
customer order_count
0 Alice 3
1 Bob 2
2 Charlie 1
Handling Empty DataFrames
All row counting methods handle empty DataFrames correctly:
empty_df = pd.DataFrame()
print(f"len(): {len(empty_df)}") # Output: 0
print(f"shape[0]: {empty_df.shape[0]}") # Output: 0
# Safe division example
def calculate_average_price(df):
if len(df) == 0:
return 0
return df['price'].sum() / len(df)
# Test with empty DataFrame
empty_sales = pd.DataFrame(columns=['product', 'price'])
avg = calculate_average_price(empty_sales)
print(f"Average price: {avg}") # Output: Average price: 0
Best Practices
Use len(df) as your default choice for row counting—it’s clear, fast, and Pythonic. Reserve .shape[0] for situations where you need both dimensions. Avoid .count() for row counting entirely; it serves a different purpose in data quality analysis.
When building production pipelines, always validate row counts at critical stages to catch data loss issues early. Combine row counting with logging to track data flow through transformations.