Pandas - GroupBy and First/Last
• GroupBy operations with first() and last() retrieve boundary records per group, essential for time-series analysis, deduplication, and state tracking across categorical data
Key Insights
• GroupBy operations with first() and last() retrieve boundary records per group, essential for time-series analysis, deduplication, and state tracking across categorical data • Understanding the difference between first/last (label-based) and head/tail (position-based) prevents common mistakes when working with sorted versus unsorted data • Combining first/last with multi-level indexing and custom aggregations enables sophisticated data transformations like before/after comparisons and change detection
Understanding First and Last in GroupBy Context
The first() and last() methods in pandas GroupBy operations return the first and last non-null values for each group. This differs from simple indexing—these methods skip NaN values and operate on each column independently.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'customer_id': ['A', 'A', 'A', 'B', 'B', 'B'],
'date': pd.date_range('2024-01-01', periods=6),
'amount': [100, np.nan, 150, 200, 250, np.nan],
'status': ['pending', 'approved', 'completed', 'pending', np.nan, 'approved']
})
# First non-null value per group
first_values = df.groupby('customer_id').first()
print(first_values)
Output:
date amount status
customer_id
A 2024-01-01 100.0 pending
B 2024-01-04 200.0 pending
Notice how first() returns the first non-null value for each column. For customer A, the amount is 100 (first row), but if that were null, it would return 150 from the third row.
First vs Head: Critical Differences
Many developers confuse first() with head(). The distinction matters significantly in production code.
df_unsorted = pd.DataFrame({
'group': ['X', 'Y', 'X', 'Y', 'X'],
'timestamp': pd.to_datetime(['2024-01-05', '2024-01-03',
'2024-01-01', '2024-01-04',
'2024-01-02']),
'value': [50, 30, 10, 40, 20]
})
# head(1) returns first row as it appears
print("Using head(1):")
print(df_unsorted.groupby('group').head(1))
# first() returns first row by index position
print("\nUsing first():")
print(df_unsorted.groupby('group').first())
# Sort first, then get earliest timestamp
print("\nSorted, then first():")
df_sorted = df_unsorted.sort_values('timestamp')
print(df_sorted.groupby('group').first())
The head(1) approach returns rows in their original order, while first() returns values from the first index position within each group. For time-series data, always sort before using first() or last() to get chronologically accurate results.
Practical Pattern: Time-Series Boundary Analysis
Analyzing the first and last states in time-series data reveals trends and changes over periods.
transactions = pd.DataFrame({
'account': ['ACC1', 'ACC1', 'ACC1', 'ACC2', 'ACC2', 'ACC2'],
'date': pd.to_datetime(['2024-01-01', '2024-01-15', '2024-01-31',
'2024-01-01', '2024-01-20', '2024-01-31']),
'balance': [1000, 1500, 1200, 5000, 4500, 4800],
'transaction_count': [0, 5, 12, 0, 8, 15]
})
# Sort by date to ensure chronological order
transactions = transactions.sort_values(['account', 'date'])
grouped = transactions.groupby('account')
# Get opening and closing balances
opening = grouped.first()[['balance', 'transaction_count']]
closing = grouped.last()[['balance', 'transaction_count']]
# Calculate changes
comparison = pd.DataFrame({
'opening_balance': opening['balance'],
'closing_balance': closing['balance'],
'balance_change': closing['balance'] - opening['balance'],
'total_transactions': closing['transaction_count']
})
print(comparison)
Output:
opening_balance closing_balance balance_change total_transactions
account
ACC1 1000 1200 200 12
ACC2 5000 4800 -300 15
Handling Multiple Columns with Different Logic
Real-world scenarios often require first/last on some columns while aggregating others differently.
sales_data = pd.DataFrame({
'region': ['North', 'North', 'North', 'South', 'South'],
'month': pd.date_range('2024-01-01', periods=5, freq='M'),
'sales': [100, 150, 200, 300, 350],
'returns': [10, 5, 8, 15, 12],
'active_customers': [50, 55, 60, 80, 85]
})
sales_data = sales_data.sort_values(['region', 'month'])
summary = sales_data.groupby('region').agg({
'month': ['first', 'last'],
'sales': ['sum', 'mean'],
'returns': 'sum',
'active_customers': ['first', 'last']
})
summary.columns = ['_'.join(col).strip() for col in summary.columns.values]
summary['customer_growth'] = summary['active_customers_last'] - summary['active_customers_first']
print(summary)
This pattern combines boundary analysis (first/last) with aggregate statistics (sum/mean) to create comprehensive summaries.
Working with Multi-Index and Hierarchical Groups
Complex datasets often require grouping by multiple dimensions while tracking first and last values.
inventory = pd.DataFrame({
'warehouse': ['W1', 'W1', 'W1', 'W2', 'W2', 'W2'] * 2,
'product': ['A', 'A', 'B', 'B', 'A', 'A'] * 2,
'date': pd.date_range('2024-01-01', periods=12, freq='D'),
'stock_level': [100, 95, 50, 48, 80, 75, 90, 85, 45, 42, 70, 65],
'reorder_point': [20, 20, 15, 15, 20, 20, 20, 20, 15, 15, 20, 20]
})
inventory = inventory.sort_values(['warehouse', 'product', 'date'])
# Group by multiple columns
multi_group = inventory.groupby(['warehouse', 'product'])
# Get first and last stock levels
stock_analysis = pd.DataFrame({
'initial_stock': multi_group['stock_level'].first(),
'final_stock': multi_group['stock_level'].last(),
'reorder_point': multi_group['reorder_point'].first(),
'days_tracked': multi_group.size()
})
stock_analysis['stock_change'] = stock_analysis['final_stock'] - stock_analysis['initial_stock']
stock_analysis['below_reorder'] = stock_analysis['final_stock'] < stock_analysis['reorder_point']
print(stock_analysis)
Dealing with Null Values and Edge Cases
Understanding how first/last handle missing data prevents silent bugs in production pipelines.
sparse_data = pd.DataFrame({
'id': ['G1', 'G1', 'G1', 'G2', 'G2', 'G2'],
'metric_a': [np.nan, np.nan, 30, 10, np.nan, np.nan],
'metric_b': [5, np.nan, np.nan, np.nan, np.nan, 20],
'metric_c': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]
})
result = sparse_data.groupby('id').first()
print("First non-null values:")
print(result)
# Check for completely null columns
print("\nNull check per group:")
print(sparse_data.groupby('id').apply(lambda x: x.isnull().all()))
# Alternative: Use first valid index
def first_valid_value(series):
valid = series.dropna()
return valid.iloc[0] if len(valid) > 0 else np.nan
custom_first = sparse_data.groupby('id').agg({
'metric_a': first_valid_value,
'metric_b': first_valid_value,
'metric_c': first_valid_value
})
print("\nCustom first valid:")
print(custom_first)
Performance Considerations for Large Datasets
When working with millions of rows, the approach to first/last operations impacts performance significantly.
# Generate large dataset
large_df = pd.DataFrame({
'category': np.random.choice(['A', 'B', 'C', 'D'], 1000000),
'timestamp': pd.date_range('2024-01-01', periods=1000000, freq='S'),
'value': np.random.randn(1000000)
})
# Efficient approach: sort once, then group
large_df_sorted = large_df.sort_values(['category', 'timestamp'])
# Using first/last on pre-sorted data
%timeit large_df_sorted.groupby('category').first()
# Less efficient: multiple sorts per aggregation
def get_boundaries(group):
return pd.Series({
'first_value': group.iloc[0]['value'],
'last_value': group.iloc[-1]['value']
})
%timeit large_df.groupby('category').apply(get_boundaries)
For optimal performance, sort your DataFrame once before grouping, use built-in methods over custom functions, and consider nth(0) and nth(-1) for position-based access when you don’t need null-skipping behavior.
Combining First/Last with Transform
Transform operations allow you to broadcast first/last values back to the original DataFrame structure.
orders = pd.DataFrame({
'customer': ['C1', 'C1', 'C1', 'C2', 'C2'],
'order_date': pd.date_range('2024-01-01', periods=5),
'order_value': [100, 150, 120, 200, 180]
})
orders = orders.sort_values(['customer', 'order_date'])
# Add first and last order values as columns
orders['first_order_value'] = orders.groupby('customer')['order_value'].transform('first')
orders['last_order_value'] = orders.groupby('customer')['order_value'].transform('last')
orders['is_first_order'] = orders['order_value'] == orders['first_order_value']
orders['is_last_order'] = orders['order_value'] == orders['last_order_value']
print(orders)
This pattern enables row-level calculations based on group boundaries, useful for customer segmentation, cohort analysis, and sequential event processing.