Pandas - Sort by Multiple Columns | Application Architect

Key Insights

• Pandas provides multiple methods for multi-column sorting including sort_values() with column lists, custom sort orders per column, and performance optimizations for large datasets • Understanding sort stability, handling missing values with na_position, and controlling ascending/descending order per column are critical for complex sorting operations • Advanced techniques like custom key functions, categorical sorting, and in-place sorting can significantly improve both code clarity and execution performance

Basic Multi-Column Sorting

The sort_values() method accepts a list of column names to sort by multiple columns. Pandas sorts by the first column, then uses subsequent columns as tiebreakers.

import pandas as pd
import numpy as np

# Create sample dataset
df = pd.DataFrame({
    'department': ['Sales', 'Sales', 'IT', 'IT', 'HR', 'Sales'],
    'employee': ['John', 'Alice', 'Bob', 'Carol', 'Dave', 'Eve'],
    'salary': [75000, 85000, 95000, 95000, 65000, 75000],
    'years': [3, 5, 7, 4, 2, 3]
})

# Sort by department, then by salary
sorted_df = df.sort_values(['department', 'salary'])
print(sorted_df)

Output:

  department employee  salary  years
4         HR     Dave   65000      2
5      Sales      Eve   75000      3
0      Sales     John   75000      3
1      Sales    Alice   85000      5
2         IT      Bob   95000      7
3         IT    Carol   95000      4

The sorting happens left-to-right in the column list. Within each department, employees are ordered by salary.

Controlling Sort Direction Per Column

Use the ascending parameter with a list of boolean values to specify sort direction for each column independently.

# Sort department ascending, salary descending
sorted_df = df.sort_values(
    ['department', 'salary'],
    ascending=[True, False]
)
print(sorted_df)

Output:

  department employee  salary  years
4         HR     Dave   65000      2
1      Sales    Alice   85000      5
0      Sales     John   75000      3
5      Sales      Eve   75000      3
2         IT      Bob   95000      7
3         IT    Carol   95000      4

This pattern is essential for ranking scenarios where you want primary grouping in one direction and secondary ordering in another.

Handling Missing Values

The na_position parameter controls where NaN values appear in sorted output: ‘first’ or ’last’ (default).

df_with_nan = pd.DataFrame({
    'category': ['A', 'B', 'A', 'B', 'A'],
    'value1': [10, np.nan, 30, 20, np.nan],
    'value2': [100, 200, np.nan, 400, 500]
})

# NaN values last (default)
sorted_last = df_with_nan.sort_values(
    ['category', 'value1'],
    na_position='last'
)
print("NaN last:\n", sorted_last)

# NaN values first
sorted_first = df_with_nan.sort_values(
    ['category', 'value1'],
    na_position='first'
)
print("\nNaN first:\n", sorted_first)

Note that na_position applies to all columns. For column-specific NaN handling, use custom key functions or pre-processing.

In-Place Sorting for Memory Efficiency

The inplace=True parameter modifies the original DataFrame, avoiding memory overhead from creating a copy.

# Create large dataset
large_df = pd.DataFrame({
    'col1': np.random.choice(['A', 'B', 'C'], 100000),
    'col2': np.random.randint(0, 1000, 100000),
    'col3': np.random.randn(100000)
})

# In-place sorting
large_df.sort_values(['col1', 'col2'], inplace=True)

# Reset index after sorting
large_df.reset_index(drop=True, inplace=True)

In-place operations are crucial for memory-constrained environments but prevent method chaining and make code harder to debug.

Custom Sort Keys with Key Functions

The key parameter accepts a function to transform values before sorting, similar to Python’s built-in sorted().

df_custom = pd.DataFrame({
    'name': ['alice', 'Bob', 'CHARLIE', 'dave'],
    'score': [85, 92, 78, 95],
    'grade': ['B', 'A', 'C', 'A']
})

# Case-insensitive sort on name, then by score
sorted_custom = df_custom.sort_values(
    ['name', 'score'],
    key=lambda x: x.str.lower() if x.dtype == 'object' else x
)
print(sorted_custom)

The key function applies to all columns being sorted. For per-column transformations, create temporary columns:

# Different transformations per column
df_custom['name_lower'] = df_custom['name'].str.lower()
sorted_df = df_custom.sort_values(['name_lower', 'score']).drop('name_lower', axis=1)

Categorical Data Sorting

Categorical columns with defined orders enable custom sorting logic beyond alphabetical or numerical.

# Define custom category order
size_order = pd.CategoricalDtype(['Small', 'Medium', 'Large', 'XLarge'], ordered=True)

df_sizes = pd.DataFrame({
    'product': ['Shirt', 'Pants', 'Hat', 'Shoes', 'Jacket'],
    'size': ['Large', 'Small', 'Medium', 'XLarge', 'Small'],
    'price': [29.99, 49.99, 19.99, 89.99, 99.99]
})

# Convert to categorical
df_sizes['size'] = df_sizes['size'].astype(size_order)

# Sort by size then price
sorted_sizes = df_sizes.sort_values(['size', 'price'])
print(sorted_sizes)

Output:

   product    size  price
1    Pants   Small  49.99
4   Jacket   Small  99.99
2      Hat  Medium  19.99
0    Shirt   Large  29.99
3    Shoes  XLarge  89.99

Categorical sorting is significantly faster than using custom key functions for repeated operations.

Sort Stability and Index Management

Pandas sorting is stable, meaning rows with equal sort keys maintain their original relative order.

df_stable = pd.DataFrame({
    'group': ['A', 'A', 'B', 'B', 'A'],
    'value': [1, 1, 2, 2, 1]
})

# Add original position
df_stable['original_index'] = df_stable.index

sorted_stable = df_stable.sort_values(['group', 'value'])
print(sorted_stable)

After sorting, the index often becomes non-sequential. Use reset_index() to create a new sequential index:

# Keep old index as column
sorted_with_old = sorted_stable.reset_index(drop=False)

# Discard old index
sorted_clean = sorted_stable.reset_index(drop=True)

The ignore_index=True parameter in sort_values() combines sorting and index reset:

sorted_df = df.sort_values(['col1', 'col2'], ignore_index=True)

Performance Optimization

For large datasets, consider these optimization strategies:

import time

# Create large dataset
n = 1000000
df_large = pd.DataFrame({
    'cat1': np.random.choice(['A', 'B', 'C', 'D'], n),
    'cat2': np.random.choice(range(100), n),
    'value': np.random.randn(n)
})

# Method 1: Standard sort
start = time.time()
result1 = df_large.sort_values(['cat1', 'cat2'])
print(f"Standard sort: {time.time() - start:.3f}s")

# Method 2: Convert to categorical first
start = time.time()
df_large['cat1'] = df_large['cat1'].astype('category')
df_large['cat2'] = df_large['cat2'].astype('category')
result2 = df_large.sort_values(['cat1', 'cat2'])
print(f"Categorical sort: {time.time() - start:.3f}s")

# Method 3: Use kind parameter for algorithm selection
start = time.time()
result3 = df_large.sort_values(['cat1', 'cat2'], kind='mergesort')
print(f"Mergesort: {time.time() - start:.3f}s")

The kind parameter accepts ‘quicksort’, ‘mergesort’, ‘heapsort’, or ‘stable’. The default ‘quicksort’ offers best average performance, while ‘stable’ (mergesort) guarantees stability.

Practical Example: Sales Data Analysis

# Real-world sales dataset
sales_df = pd.DataFrame({
    'region': ['North', 'South', 'North', 'East', 'South', 'West'],
    'quarter': ['Q1', 'Q1', 'Q2', 'Q1', 'Q2', 'Q1'],
    'revenue': [150000, 200000, 175000, 125000, 225000, 100000],
    'units': [500, 650, 550, 400, 700, 350]
})

# Sort by quarter, then revenue descending
quarter_order = pd.CategoricalDtype(['Q1', 'Q2', 'Q3', 'Q4'], ordered=True)
sales_df['quarter'] = sales_df['quarter'].astype(quarter_order)

top_performers = sales_df.sort_values(
    ['quarter', 'revenue'],
    ascending=[True, False],
    ignore_index=True
)

print(top_performers)

This produces a ranked list of sales performance by quarter, with highest revenue first within each quarter—a common business reporting requirement.

Basic Multi-Column Sorting

Controlling Sort Direction Per Column

Handling Missing Values

In-Place Sorting for Memory Efficiency

Custom Sort Keys with Key Functions

Categorical Data Sorting

Sort Stability and Index Management

Performance Optimization

Practical Example: Sales Data Analysis

Liked this? There's more.