Pandas - Get Number of Rows and Columns

Key Insights

• Use .shape attribute to get both dimensions simultaneously as a tuple (rows, columns), which is the most efficient method for DataFrames • Access .shape[0] for row count and .shape[1] for column count when you need individual dimensions in your code logic • Alternative methods like len(), .size, and .count() exist but serve different purposes—understanding when to use each prevents performance issues and logical errors

Getting DataFrame Dimensions with .shape

The .shape attribute is the primary method for retrieving DataFrame dimensions. It returns a tuple containing the number of rows and columns.

import pandas as pd

# Create sample DataFrame
df = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'price': [1200, 25, 75, 350],
    'quantity': [5, 50, 30, 10]
})

# Get dimensions
dimensions = df.shape
print(f"Dimensions: {dimensions}")  # Output: (4, 3)
print(f"Type: {type(dimensions)}")  # Output: <class 'tuple'>

The .shape attribute is an O(1) operation—it doesn’t iterate through the DataFrame but retrieves stored metadata. This makes it the most performant option regardless of DataFrame size.

Extracting Individual Dimensions

Access tuple elements directly when you need specific dimension values:

# Get row count
num_rows = df.shape[0]
print(f"Number of rows: {num_rows}")  # Output: 4

# Get column count
num_cols = df.shape[1]
print(f"Number of columns: {num_cols}")  # Output: 3

# Unpack both values
rows, cols = df.shape
print(f"DataFrame has {rows} rows and {cols} columns")

This approach is cleaner than calling separate methods and ensures both values come from the same snapshot of the DataFrame.

Using len() for Row Count

The len() function returns the number of rows, equivalent to .shape[0]:

# Both produce identical results
print(len(df))        # Output: 4
print(df.shape[0])    # Output: 4

# Practical example: conditional logic
if len(df) > 100:
    print("Large dataset - using chunked processing")
else:
    print("Small dataset - processing in memory")

While len() is slightly more readable in some contexts, .shape[0] is more explicit about working with DataFrame dimensions. Use len() when the code context makes it obvious you’re counting rows.

Getting Column Count from Index

Access column count through the columns index:

# Multiple ways to get column count
print(len(df.columns))           # Output: 3
print(df.columns.size)           # Output: 3
print(df.shape[1])               # Output: 3

# Get column names simultaneously
col_names = df.columns.tolist()
col_count = len(col_names)
print(f"{col_count} columns: {col_names}")
# Output: 3 columns: ['product', 'price', 'quantity']

Understanding .size vs .shape

The .size attribute returns the total number of elements (rows × columns):

print(f"Shape: {df.shape}")      # Output: (4, 3)
print(f"Size: {df.size}")        # Output: 12

# Verify the relationship
rows, cols = df.shape
assert df.size == rows * cols    # True

# Practical use: memory estimation
bytes_per_element = 8  # approximate for numeric data
estimated_memory = df.size * bytes_per_element
print(f"Estimated memory: {estimated_memory} bytes")

Use .size when you need the total element count, not when you need dimensional information.

Handling Empty DataFrames

Empty DataFrames require careful handling:

# Create empty DataFrame
empty_df = pd.DataFrame()
print(f"Empty shape: {empty_df.shape}")  # Output: (0, 0)

# DataFrame with columns but no rows
df_no_rows = pd.DataFrame(columns=['A', 'B', 'C'])
print(f"No rows shape: {df_no_rows.shape}")  # Output: (0, 3)

# DataFrame with rows but no columns (rare)
df_no_cols = pd.DataFrame(index=[0, 1, 2])
print(f"No columns shape: {df_no_cols.shape}")  # Output: (3, 0)

# Safe checking
def process_dataframe(df):
    rows, cols = df.shape
    if rows == 0:
        print("Warning: DataFrame has no rows")
        return
    if cols == 0:
        print("Warning: DataFrame has no columns")
        return
    print(f"Processing {rows} rows × {cols} columns")

process_dataframe(empty_df)
process_dataframe(df_no_rows)

Working with MultiIndex DataFrames

MultiIndex DataFrames follow the same dimension rules:

# Create MultiIndex DataFrame
arrays = [
    ['A', 'A', 'B', 'B'],
    ['one', 'two', 'one', 'two']
]
index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
df_multi = pd.DataFrame({'value': [10, 20, 30, 40]}, index=index)

print(f"MultiIndex shape: {df_multi.shape}")  # Output: (4, 1)
print(f"Number of index levels: {df_multi.index.nlevels}")  # Output: 2

# Shape counts rows, not index levels
print(f"Rows: {len(df_multi)}")  # Output: 4

The .shape attribute counts total rows regardless of index structure.

Counting Non-Null Values

The .count() method returns non-null values per column, which differs from row count:

# DataFrame with missing values
df_missing = pd.DataFrame({
    'A': [1, 2, None, 4],
    'B': [5, None, None, 8],
    'C': [9, 10, 11, 12]
})

print(f"Shape: {df_missing.shape}")  # Output: (4, 3)
print("\nNon-null counts per column:")
print(df_missing.count())
# Output:
# A    3
# B    2
# C    4

# Total non-null values across entire DataFrame
total_non_null = df_missing.count().sum()
print(f"\nTotal non-null values: {total_non_null}")  # Output: 9

Performance Comparison

Here’s how different methods compare:

import numpy as np
import time

# Create large DataFrame
large_df = pd.DataFrame(np.random.randn(1000000, 50))

# Time different approaches
methods = {
    'shape[0]': lambda: large_df.shape[0],
    'len()': lambda: len(large_df),
    'count().max()': lambda: large_df.count().max()
}

for name, func in methods.items():
    start = time.perf_counter()
    for _ in range(1000):
        result = func()
    elapsed = time.perf_counter() - start
    print(f"{name}: {elapsed:.4f}s")

# Typical output:
# shape[0]: 0.0003s
# len(): 0.0003s
# count().max(): 0.8500s

The .count() method iterates through data to find non-null values, making it significantly slower for dimension queries.

Practical Application: Data Validation

Combine dimension methods for robust data validation:

def validate_dataframe(df, expected_rows=None, expected_cols=None, 
                       min_rows=None, required_columns=None):
    """Validate DataFrame dimensions and structure."""
    rows, cols = df.shape
    
    # Check exact dimensions
    if expected_rows is not None and rows != expected_rows:
        raise ValueError(f"Expected {expected_rows} rows, got {rows}")
    
    if expected_cols is not None and cols != expected_cols:
        raise ValueError(f"Expected {expected_cols} columns, got {cols}")
    
    # Check minimum rows
    if min_rows is not None and rows < min_rows:
        raise ValueError(f"Need at least {min_rows} rows, got {rows}")
    
    # Check required columns exist
    if required_columns:
        missing = set(required_columns) - set(df.columns)
        if missing:
            raise ValueError(f"Missing required columns: {missing}")
    
    return True

# Usage
try:
    validate_dataframe(df, min_rows=3, 
                      required_columns=['product', 'price'])
    print("Validation passed")
except ValueError as e:
    print(f"Validation failed: {e}")

This validation pattern ensures data meets expected dimensions before processing, preventing downstream errors in data pipelines.