Pandas - GroupBy and First/Last | Application Architect

Key Insights

• GroupBy operations with first() and last() retrieve boundary records per group, essential for time-series analysis, deduplication, and state tracking across categorical data • Understanding the difference between first/last (label-based) and head/tail (position-based) prevents common mistakes when working with sorted versus unsorted data • Combining first/last with multi-level indexing and custom aggregations enables sophisticated data transformations like before/after comparisons and change detection

Understanding First and Last in GroupBy Context

The first() and last() methods in pandas GroupBy operations return the first and last non-null values for each group. This differs from simple indexing—these methods skip NaN values and operate on each column independently.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'customer_id': ['A', 'A', 'A', 'B', 'B', 'B'],
    'date': pd.date_range('2024-01-01', periods=6),
    'amount': [100, np.nan, 150, 200, 250, np.nan],
    'status': ['pending', 'approved', 'completed', 'pending', np.nan, 'approved']
})

# First non-null value per group
first_values = df.groupby('customer_id').first()
print(first_values)

Output:

              date  amount    status
customer_id                         
A         2024-01-01   100.0   pending
B         2024-01-04   200.0   pending

Notice how first() returns the first non-null value for each column. For customer A, the amount is 100 (first row), but if that were null, it would return 150 from the third row.

First vs Head: Critical Differences

Many developers confuse first() with head(). The distinction matters significantly in production code.

df_unsorted = pd.DataFrame({
    'group': ['X', 'Y', 'X', 'Y', 'X'],
    'timestamp': pd.to_datetime(['2024-01-05', '2024-01-03', 
                                  '2024-01-01', '2024-01-04', 
                                  '2024-01-02']),
    'value': [50, 30, 10, 40, 20]
})

# head(1) returns first row as it appears
print("Using head(1):")
print(df_unsorted.groupby('group').head(1))

# first() returns first row by index position
print("\nUsing first():")
print(df_unsorted.groupby('group').first())

# Sort first, then get earliest timestamp
print("\nSorted, then first():")
df_sorted = df_unsorted.sort_values('timestamp')
print(df_sorted.groupby('group').first())

The head(1) approach returns rows in their original order, while first() returns values from the first index position within each group. For time-series data, always sort before using first() or last() to get chronologically accurate results.

Practical Pattern: Time-Series Boundary Analysis

Analyzing the first and last states in time-series data reveals trends and changes over periods.

transactions = pd.DataFrame({
    'account': ['ACC1', 'ACC1', 'ACC1', 'ACC2', 'ACC2', 'ACC2'],
    'date': pd.to_datetime(['2024-01-01', '2024-01-15', '2024-01-31',
                            '2024-01-01', '2024-01-20', '2024-01-31']),
    'balance': [1000, 1500, 1200, 5000, 4500, 4800],
    'transaction_count': [0, 5, 12, 0, 8, 15]
})

# Sort by date to ensure chronological order
transactions = transactions.sort_values(['account', 'date'])

grouped = transactions.groupby('account')

# Get opening and closing balances
opening = grouped.first()[['balance', 'transaction_count']]
closing = grouped.last()[['balance', 'transaction_count']]

# Calculate changes
comparison = pd.DataFrame({
    'opening_balance': opening['balance'],
    'closing_balance': closing['balance'],
    'balance_change': closing['balance'] - opening['balance'],
    'total_transactions': closing['transaction_count']
})

print(comparison)

Output:

      opening_balance  closing_balance  balance_change  total_transactions
account                                                                    
ACC1             1000             1200             200                  12
ACC2             5000             4800            -300                  15

Handling Multiple Columns with Different Logic

Real-world scenarios often require first/last on some columns while aggregating others differently.

sales_data = pd.DataFrame({
    'region': ['North', 'North', 'North', 'South', 'South'],
    'month': pd.date_range('2024-01-01', periods=5, freq='M'),
    'sales': [100, 150, 200, 300, 350],
    'returns': [10, 5, 8, 15, 12],
    'active_customers': [50, 55, 60, 80, 85]
})

sales_data = sales_data.sort_values(['region', 'month'])

summary = sales_data.groupby('region').agg({
    'month': ['first', 'last'],
    'sales': ['sum', 'mean'],
    'returns': 'sum',
    'active_customers': ['first', 'last']
})

summary.columns = ['_'.join(col).strip() for col in summary.columns.values]
summary['customer_growth'] = summary['active_customers_last'] - summary['active_customers_first']

print(summary)

This pattern combines boundary analysis (first/last) with aggregate statistics (sum/mean) to create comprehensive summaries.

Working with Multi-Index and Hierarchical Groups

Complex datasets often require grouping by multiple dimensions while tracking first and last values.

inventory = pd.DataFrame({
    'warehouse': ['W1', 'W1', 'W1', 'W2', 'W2', 'W2'] * 2,
    'product': ['A', 'A', 'B', 'B', 'A', 'A'] * 2,
    'date': pd.date_range('2024-01-01', periods=12, freq='D'),
    'stock_level': [100, 95, 50, 48, 80, 75, 90, 85, 45, 42, 70, 65],
    'reorder_point': [20, 20, 15, 15, 20, 20, 20, 20, 15, 15, 20, 20]
})

inventory = inventory.sort_values(['warehouse', 'product', 'date'])

# Group by multiple columns
multi_group = inventory.groupby(['warehouse', 'product'])

# Get first and last stock levels
stock_analysis = pd.DataFrame({
    'initial_stock': multi_group['stock_level'].first(),
    'final_stock': multi_group['stock_level'].last(),
    'reorder_point': multi_group['reorder_point'].first(),
    'days_tracked': multi_group.size()
})

stock_analysis['stock_change'] = stock_analysis['final_stock'] - stock_analysis['initial_stock']
stock_analysis['below_reorder'] = stock_analysis['final_stock'] < stock_analysis['reorder_point']

print(stock_analysis)

Dealing with Null Values and Edge Cases

Understanding how first/last handle missing data prevents silent bugs in production pipelines.

sparse_data = pd.DataFrame({
    'id': ['G1', 'G1', 'G1', 'G2', 'G2', 'G2'],
    'metric_a': [np.nan, np.nan, 30, 10, np.nan, np.nan],
    'metric_b': [5, np.nan, np.nan, np.nan, np.nan, 20],
    'metric_c': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]
})

result = sparse_data.groupby('id').first()
print("First non-null values:")
print(result)

# Check for completely null columns
print("\nNull check per group:")
print(sparse_data.groupby('id').apply(lambda x: x.isnull().all()))

# Alternative: Use first valid index
def first_valid_value(series):
    valid = series.dropna()
    return valid.iloc[0] if len(valid) > 0 else np.nan

custom_first = sparse_data.groupby('id').agg({
    'metric_a': first_valid_value,
    'metric_b': first_valid_value,
    'metric_c': first_valid_value
})
print("\nCustom first valid:")
print(custom_first)

Performance Considerations for Large Datasets

When working with millions of rows, the approach to first/last operations impacts performance significantly.

# Generate large dataset
large_df = pd.DataFrame({
    'category': np.random.choice(['A', 'B', 'C', 'D'], 1000000),
    'timestamp': pd.date_range('2024-01-01', periods=1000000, freq='S'),
    'value': np.random.randn(1000000)
})

# Efficient approach: sort once, then group
large_df_sorted = large_df.sort_values(['category', 'timestamp'])

# Using first/last on pre-sorted data
%timeit large_df_sorted.groupby('category').first()

# Less efficient: multiple sorts per aggregation
def get_boundaries(group):
    return pd.Series({
        'first_value': group.iloc[0]['value'],
        'last_value': group.iloc[-1]['value']
    })

%timeit large_df.groupby('category').apply(get_boundaries)

For optimal performance, sort your DataFrame once before grouping, use built-in methods over custom functions, and consider nth(0) and nth(-1) for position-based access when you don’t need null-skipping behavior.

Combining First/Last with Transform

Transform operations allow you to broadcast first/last values back to the original DataFrame structure.

orders = pd.DataFrame({
    'customer': ['C1', 'C1', 'C1', 'C2', 'C2'],
    'order_date': pd.date_range('2024-01-01', periods=5),
    'order_value': [100, 150, 120, 200, 180]
})

orders = orders.sort_values(['customer', 'order_date'])

# Add first and last order values as columns
orders['first_order_value'] = orders.groupby('customer')['order_value'].transform('first')
orders['last_order_value'] = orders.groupby('customer')['order_value'].transform('last')
orders['is_first_order'] = orders['order_value'] == orders['first_order_value']
orders['is_last_order'] = orders['order_value'] == orders['last_order_value']

print(orders)

This pattern enables row-level calculations based on group boundaries, useful for customer segmentation, cohort analysis, and sequential event processing.