Pandas - Sort by Multiple Columns
• Pandas provides multiple methods for multi-column sorting including `sort_values()` with column lists, custom sort orders per column, and performance optimizations for large datasets
Key Insights
• Pandas provides multiple methods for multi-column sorting including sort_values() with column lists, custom sort orders per column, and performance optimizations for large datasets
• Understanding sort stability, handling missing values with na_position, and controlling ascending/descending order per column are critical for complex sorting operations
• Advanced techniques like custom key functions, categorical sorting, and in-place sorting can significantly improve both code clarity and execution performance
Basic Multi-Column Sorting
The sort_values() method accepts a list of column names to sort by multiple columns. Pandas sorts by the first column, then uses subsequent columns as tiebreakers.
import pandas as pd
import numpy as np
# Create sample dataset
df = pd.DataFrame({
'department': ['Sales', 'Sales', 'IT', 'IT', 'HR', 'Sales'],
'employee': ['John', 'Alice', 'Bob', 'Carol', 'Dave', 'Eve'],
'salary': [75000, 85000, 95000, 95000, 65000, 75000],
'years': [3, 5, 7, 4, 2, 3]
})
# Sort by department, then by salary
sorted_df = df.sort_values(['department', 'salary'])
print(sorted_df)
Output:
department employee salary years
4 HR Dave 65000 2
5 Sales Eve 75000 3
0 Sales John 75000 3
1 Sales Alice 85000 5
2 IT Bob 95000 7
3 IT Carol 95000 4
The sorting happens left-to-right in the column list. Within each department, employees are ordered by salary.
Controlling Sort Direction Per Column
Use the ascending parameter with a list of boolean values to specify sort direction for each column independently.
# Sort department ascending, salary descending
sorted_df = df.sort_values(
['department', 'salary'],
ascending=[True, False]
)
print(sorted_df)
Output:
department employee salary years
4 HR Dave 65000 2
1 Sales Alice 85000 5
0 Sales John 75000 3
5 Sales Eve 75000 3
2 IT Bob 95000 7
3 IT Carol 95000 4
This pattern is essential for ranking scenarios where you want primary grouping in one direction and secondary ordering in another.
Handling Missing Values
The na_position parameter controls where NaN values appear in sorted output: ‘first’ or ’last’ (default).
df_with_nan = pd.DataFrame({
'category': ['A', 'B', 'A', 'B', 'A'],
'value1': [10, np.nan, 30, 20, np.nan],
'value2': [100, 200, np.nan, 400, 500]
})
# NaN values last (default)
sorted_last = df_with_nan.sort_values(
['category', 'value1'],
na_position='last'
)
print("NaN last:\n", sorted_last)
# NaN values first
sorted_first = df_with_nan.sort_values(
['category', 'value1'],
na_position='first'
)
print("\nNaN first:\n", sorted_first)
Note that na_position applies to all columns. For column-specific NaN handling, use custom key functions or pre-processing.
In-Place Sorting for Memory Efficiency
The inplace=True parameter modifies the original DataFrame, avoiding memory overhead from creating a copy.
# Create large dataset
large_df = pd.DataFrame({
'col1': np.random.choice(['A', 'B', 'C'], 100000),
'col2': np.random.randint(0, 1000, 100000),
'col3': np.random.randn(100000)
})
# In-place sorting
large_df.sort_values(['col1', 'col2'], inplace=True)
# Reset index after sorting
large_df.reset_index(drop=True, inplace=True)
In-place operations are crucial for memory-constrained environments but prevent method chaining and make code harder to debug.
Custom Sort Keys with Key Functions
The key parameter accepts a function to transform values before sorting, similar to Python’s built-in sorted().
df_custom = pd.DataFrame({
'name': ['alice', 'Bob', 'CHARLIE', 'dave'],
'score': [85, 92, 78, 95],
'grade': ['B', 'A', 'C', 'A']
})
# Case-insensitive sort on name, then by score
sorted_custom = df_custom.sort_values(
['name', 'score'],
key=lambda x: x.str.lower() if x.dtype == 'object' else x
)
print(sorted_custom)
The key function applies to all columns being sorted. For per-column transformations, create temporary columns:
# Different transformations per column
df_custom['name_lower'] = df_custom['name'].str.lower()
sorted_df = df_custom.sort_values(['name_lower', 'score']).drop('name_lower', axis=1)
Categorical Data Sorting
Categorical columns with defined orders enable custom sorting logic beyond alphabetical or numerical.
# Define custom category order
size_order = pd.CategoricalDtype(['Small', 'Medium', 'Large', 'XLarge'], ordered=True)
df_sizes = pd.DataFrame({
'product': ['Shirt', 'Pants', 'Hat', 'Shoes', 'Jacket'],
'size': ['Large', 'Small', 'Medium', 'XLarge', 'Small'],
'price': [29.99, 49.99, 19.99, 89.99, 99.99]
})
# Convert to categorical
df_sizes['size'] = df_sizes['size'].astype(size_order)
# Sort by size then price
sorted_sizes = df_sizes.sort_values(['size', 'price'])
print(sorted_sizes)
Output:
product size price
1 Pants Small 49.99
4 Jacket Small 99.99
2 Hat Medium 19.99
0 Shirt Large 29.99
3 Shoes XLarge 89.99
Categorical sorting is significantly faster than using custom key functions for repeated operations.
Sort Stability and Index Management
Pandas sorting is stable, meaning rows with equal sort keys maintain their original relative order.
df_stable = pd.DataFrame({
'group': ['A', 'A', 'B', 'B', 'A'],
'value': [1, 1, 2, 2, 1]
})
# Add original position
df_stable['original_index'] = df_stable.index
sorted_stable = df_stable.sort_values(['group', 'value'])
print(sorted_stable)
After sorting, the index often becomes non-sequential. Use reset_index() to create a new sequential index:
# Keep old index as column
sorted_with_old = sorted_stable.reset_index(drop=False)
# Discard old index
sorted_clean = sorted_stable.reset_index(drop=True)
The ignore_index=True parameter in sort_values() combines sorting and index reset:
sorted_df = df.sort_values(['col1', 'col2'], ignore_index=True)
Performance Optimization
For large datasets, consider these optimization strategies:
import time
# Create large dataset
n = 1000000
df_large = pd.DataFrame({
'cat1': np.random.choice(['A', 'B', 'C', 'D'], n),
'cat2': np.random.choice(range(100), n),
'value': np.random.randn(n)
})
# Method 1: Standard sort
start = time.time()
result1 = df_large.sort_values(['cat1', 'cat2'])
print(f"Standard sort: {time.time() - start:.3f}s")
# Method 2: Convert to categorical first
start = time.time()
df_large['cat1'] = df_large['cat1'].astype('category')
df_large['cat2'] = df_large['cat2'].astype('category')
result2 = df_large.sort_values(['cat1', 'cat2'])
print(f"Categorical sort: {time.time() - start:.3f}s")
# Method 3: Use kind parameter for algorithm selection
start = time.time()
result3 = df_large.sort_values(['cat1', 'cat2'], kind='mergesort')
print(f"Mergesort: {time.time() - start:.3f}s")
The kind parameter accepts ‘quicksort’, ‘mergesort’, ‘heapsort’, or ‘stable’. The default ‘quicksort’ offers best average performance, while ‘stable’ (mergesort) guarantees stability.
Practical Example: Sales Data Analysis
# Real-world sales dataset
sales_df = pd.DataFrame({
'region': ['North', 'South', 'North', 'East', 'South', 'West'],
'quarter': ['Q1', 'Q1', 'Q2', 'Q1', 'Q2', 'Q1'],
'revenue': [150000, 200000, 175000, 125000, 225000, 100000],
'units': [500, 650, 550, 400, 700, 350]
})
# Sort by quarter, then revenue descending
quarter_order = pd.CategoricalDtype(['Q1', 'Q2', 'Q3', 'Q4'], ordered=True)
sales_df['quarter'] = sales_df['quarter'].astype(quarter_order)
top_performers = sales_df.sort_values(
['quarter', 'revenue'],
ascending=[True, False],
ignore_index=True
)
print(top_performers)
This produces a ranked list of sales performance by quarter, with highest revenue first within each quarter—a common business reporting requirement.