Pandas - Get First N Rows (head) and Last N Rows (tail)
• The `head()` and `tail()` methods provide efficient ways to preview DataFrames without loading entire datasets into memory, with `head(n)` returning the first n rows and `tail(n)` returning the...
Key Insights
• The head() and tail() methods provide efficient ways to preview DataFrames without loading entire datasets into memory, with head(n) returning the first n rows and tail(n) returning the last n rows
• Both methods accept negative integers to exclude rows from the opposite end, enabling flexible data slicing patterns like head(-5) to get all rows except the last 5
• These methods work seamlessly with method chaining and can be combined with indexing operations to extract specific subsets for validation, debugging, and exploratory data analysis
Basic Usage of head() and tail()
The head() and tail() methods are fundamental tools for quick DataFrame inspection. By default, both return 5 rows, but you can specify any number of rows as an argument.
import pandas as pd
import numpy as np
# Create sample DataFrame
df = pd.DataFrame({
'product_id': range(1, 101),
'revenue': np.random.randint(100, 1000, 100),
'units_sold': np.random.randint(1, 50, 100),
'category': np.random.choice(['Electronics', 'Clothing', 'Food'], 100)
})
# Get first 5 rows (default)
print(df.head())
# Get first 10 rows
print(df.head(10))
# Get last 5 rows (default)
print(df.tail())
# Get last 15 rows
print(df.tail(15))
These methods are particularly useful when working with large datasets where displaying the entire DataFrame would be impractical. They provide a quick snapshot without the computational overhead of rendering thousands of rows.
Using Negative Integers for Exclusion
Both methods support negative integers, which reverses their behavior by excluding rows from the opposite end. This feature enables precise data slicing without complex indexing.
# Get all rows except the last 10
first_90 = df.head(-10)
print(f"Shape after head(-10): {first_90.shape}")
# Get all rows except the first 20
last_80 = df.tail(-20)
print(f"Shape after tail(-20): {last_80.shape}")
# Practical example: Remove outliers from both ends
# Exclude first 5 and last 5 rows (assuming sorted data)
middle_section = df.head(-5).tail(-5)
print(f"Middle section shape: {middle_section.shape}")
This approach is cleaner than using iloc for many use cases, especially when you want to exclude a specific number of rows from one end of the DataFrame.
Method Chaining with head() and tail()
These methods integrate seamlessly into pandas’ method chaining paradigm, allowing you to combine them with filtering, sorting, and aggregation operations.
# Chain with sorting to get top performers
top_revenue = (df
.sort_values('revenue', ascending=False)
.head(10)
.reset_index(drop=True)
)
print("Top 10 products by revenue:")
print(top_revenue)
# Get bottom 5 performers in Electronics category
bottom_electronics = (df
[df['category'] == 'Electronics']
.sort_values('revenue')
.head(5)
[['product_id', 'revenue', 'units_sold']]
)
print("\nBottom 5 Electronics products:")
print(bottom_electronics)
# Complex chain: Filter, sort, and sample from both ends
analysis = (df
[df['units_sold'] > 10]
.sort_values('revenue', ascending=False)
.assign(revenue_per_unit=lambda x: x['revenue'] / x['units_sold'])
)
print("\nTop 3 high-volume products:")
print(analysis.head(3))
print("\nBottom 3 high-volume products:")
print(analysis.tail(3))
Working with MultiIndex DataFrames
When dealing with MultiIndex DataFrames, head() and tail() operate on the outermost index level, which requires understanding how your data is structured.
# Create MultiIndex DataFrame
multi_df = df.set_index(['category', 'product_id']).sort_index()
# head() and tail() work on the first level of the index
print("First 8 entries (by category, then product_id):")
print(multi_df.head(8))
print("\nLast 8 entries:")
print(multi_df.tail(8))
# To get first/last n rows per group, use groupby
print("\nFirst 3 products per category:")
first_per_category = (multi_df
.groupby(level='category')
.head(3)
)
print(first_per_category)
print("\nLast 2 products per category:")
last_per_category = (multi_df
.groupby(level='category')
.tail(2)
)
print(last_per_category)
Practical Applications in Data Validation
These methods are invaluable for data quality checks and validation workflows, especially when dealing with time-series or sequential data.
# Create time-series dataset
dates = pd.date_range('2024-01-01', periods=100, freq='D')
ts_df = pd.DataFrame({
'date': dates,
'temperature': np.random.normal(20, 5, 100),
'humidity': np.random.normal(60, 10, 100)
})
# Check for data freshness
print("Most recent 7 days:")
print(ts_df.tail(7))
# Validate data completeness at both ends
def validate_data_range(df, date_col, expected_days=7):
"""Validate first and last n days have complete data"""
first_week = df.head(expected_days)
last_week = df.tail(expected_days)
print(f"First {expected_days} days - Missing values:")
print(first_week.isnull().sum())
print(f"\nLast {expected_days} days - Missing values:")
print(last_week.isnull().sum())
return first_week, last_week
first_week, last_week = validate_data_range(ts_df, 'date')
Performance Considerations and Memory Efficiency
Unlike slicing operations that create views, head() and tail() return copies of the data. However, they’re optimized for reading operations and don’t load the entire DataFrame when working with chunked or lazy evaluation contexts.
# Efficient preview of large files without loading everything
def preview_large_file(filepath, n_rows=10):
"""Preview both ends of a large CSV without loading it entirely"""
# Read first n rows efficiently
first_rows = pd.read_csv(filepath, nrows=n_rows)
print(f"First {n_rows} rows:")
print(first_rows)
# For last rows, we need to know total count or use a different approach
# This loads the full file but demonstrates the pattern
full_df = pd.read_csv(filepath)
print(f"\nLast {n_rows} rows:")
print(full_df.tail(n_rows))
print(f"\nTotal rows: {len(full_df)}")
# Comparing memory usage
import sys
sample_small = df.head(10)
sample_large = df.head(50)
print(f"Memory usage - 10 rows: {sys.getsizeof(sample_small)} bytes")
print(f"Memory usage - 50 rows: {sys.getsizeof(sample_large)} bytes")
print(f"Memory usage - full DataFrame: {sys.getsizeof(df)} bytes")
Combining with Other Selection Methods
While head() and tail() are powerful alone, combining them with iloc, loc, and boolean indexing creates sophisticated selection patterns.
# Get first 20 rows, then select specific columns
subset = df.head(20)[['product_id', 'revenue']]
# Get last 30 rows where revenue > 500
high_revenue_recent = df.tail(30)[df.tail(30)['revenue'] > 500]
# Alternative using query for cleaner syntax
high_revenue_recent_alt = df.tail(30).query('revenue > 500')
# Get specific rows from the head
first_ten_odd_indices = df.head(20).iloc[::2] # Every other row from first 20
print("Odd-indexed rows from first 20:")
print(first_ten_odd_indices)
# Practical: Compare first and last quartiles
first_quartile = df.head(25)
last_quartile = df.tail(25)
comparison = pd.DataFrame({
'metric': ['mean_revenue', 'mean_units', 'total_revenue'],
'first_25': [
first_quartile['revenue'].mean(),
first_quartile['units_sold'].mean(),
first_quartile['revenue'].sum()
],
'last_25': [
last_quartile['revenue'].mean(),
last_quartile['units_sold'].mean(),
last_quartile['revenue'].sum()
]
})
print("\nComparison of first vs last 25 rows:")
print(comparison)
These methods form the foundation of efficient DataFrame exploration. Use head() and tail() as your first step in any data analysis workflow to understand structure, identify patterns, and validate assumptions before performing more expensive operations on the full dataset.