How to Sort by Multiple Columns in Pandas
Sorting data by a single column is straightforward, but real-world analysis rarely stays that simple. You need to sort sales data by region first, then by revenue within each region. You need...
Key Insights
- Pass a list of column names to
sort_values(by=[...])to sort by multiple columns, with earlier columns taking priority over later ones in the sort hierarchy - Control sort direction independently for each column by passing a list of booleans to the
ascendingparameter—this enables powerful mixed sorting like “category A-Z, then price high-to-low” - Use
ignore_index=Trueto get clean sequential indexing after sorting, which prevents confusing index values when you’ve filtered or will iterate over the result
Introduction
Sorting data by a single column is straightforward, but real-world analysis rarely stays that simple. You need to sort sales data by region first, then by revenue within each region. You need employee records ordered by department, then by seniority. You need transaction logs sorted by date, then by timestamp within each day.
Pandas handles multi-column sorting elegantly through the sort_values() method. Understanding its full capabilities—including mixed sort directions, NaN handling, and performance implications—separates competent data work from sloppy analysis. This article covers everything you need to sort DataFrames by multiple columns effectively.
Basic Multi-Column Sorting with sort_values()
The foundation of multi-column sorting is passing a list to the by parameter. Pandas sorts by the first column, then uses subsequent columns to break ties.
import pandas as pd
# Sample employee data
df = pd.DataFrame({
'employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
'department': ['Engineering', 'Sales', 'Engineering', 'Sales', 'Engineering', 'Sales'],
'salary': [95000, 72000, 88000, 85000, 92000, 68000]
})
# Sort by department, then by salary within each department
sorted_df = df.sort_values(by=['department', 'salary'])
print(sorted_df)
Output:
employee department salary
2 Charlie Engineering 88000
4 Eve Engineering 92000
0 Alice Engineering 95000
5 Frank Sales 68000
1 Bob Sales 72000
3 Diana Sales 85000
The sort groups all Engineering employees together, then all Sales employees. Within each department, employees appear in ascending salary order. The hierarchy matters: the first column in the list has the highest sorting priority.
You can sort by as many columns as needed:
# Sort by three columns
df.sort_values(by=['department', 'salary', 'employee'])
Controlling Sort Order per Column
The real power emerges when you need different sort directions for different columns. Pass a list of booleans to ascending that matches your column list.
# Product data
products = pd.DataFrame({
'category': ['Electronics', 'Clothing', 'Electronics', 'Clothing', 'Electronics'],
'product': ['Laptop', 'Jacket', 'Phone', 'Shoes', 'Tablet'],
'price': [1200, 150, 800, 89, 450]
})
# Sort by category A-Z, then by price high-to-low
sorted_products = products.sort_values(
by=['category', 'price'],
ascending=[True, False]
)
print(sorted_products)
Output:
category product price
1 Clothing Jacket 150
3 Clothing Shoes 89
0 Electronics Laptop 1200
2 Electronics Phone 800
4 Electronics Tablet 450
Categories appear alphabetically (Clothing before Electronics), but within each category, products are sorted by price from highest to lowest. This pattern is extremely common in business reporting—you want logical groupings with the most important items (highest revenue, most recent, etc.) at the top of each group.
A common mistake is passing a single boolean when you need a list:
# Wrong: applies same direction to all columns
df.sort_values(by=['department', 'salary'], ascending=False)
# Right: different direction per column
df.sort_values(by=['department', 'salary'], ascending=[True, False])
Handling Missing Values
Real datasets contain missing values. By default, Pandas places NaN values at the end of sorted results regardless of sort direction. The na_position parameter gives you control.
# Data with missing values
sales = pd.DataFrame({
'region': ['North', 'South', 'North', 'South', 'North'],
'revenue': [50000, None, 75000, 62000, None]
})
# Default behavior: NaN at end
print(sales.sort_values(by='revenue'))
Output:
region revenue
0 North 50000.0
3 South 62000.0
2 North 75000.0
1 South NaN
4 North NaN
To place NaN values first:
# NaN values at the beginning
print(sales.sort_values(by='revenue', na_position='first'))
Output:
region revenue
1 South NaN
4 North NaN
0 North 50000.0
3 South 62000.0
2 North 75000.0
This is useful when missing data requires attention—putting it first ensures analysts see it immediately rather than discovering it buried at the bottom of a report.
When sorting by multiple columns, na_position applies to all columns uniformly. If you need different NaN handling per column, you’ll need a more complex approach involving temporary columns or sequential sorts.
Sorting with Index Reset
After sorting, your DataFrame retains its original index values, which can look messy and cause confusion:
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'score': [85, 92, 78]
})
sorted_df = df.sort_values(by='score')
print(sorted_df)
Output:
name score
2 Charlie 78
0 Alice 85
1 Bob 92
The index values (2, 0, 1) reflect original positions, not current order. Use ignore_index=True for clean sequential indexing:
sorted_df = df.sort_values(by='score', ignore_index=True)
print(sorted_df)
Output:
name score
0 Charlie 78
1 Alice 85
2 Bob 92
When should you reset the index? Always do it when:
- You’ll iterate over the DataFrame using
iloc - You’re preparing data for export or display
- The original index has no meaningful information
Keep the original index when:
- You need to trace rows back to their source position
- The index contains meaningful data (dates, IDs)
- You’ll merge or join with other DataFrames using the index
In-Place Sorting vs. Returning a Copy
By default, sort_values() returns a new DataFrame, leaving the original unchanged. The inplace=True parameter modifies the DataFrame directly.
import time
import numpy as np
# Create a larger dataset for meaningful timing
large_df = pd.DataFrame({
'category': np.random.choice(['A', 'B', 'C', 'D'], size=1000000),
'value': np.random.randn(1000000)
})
# Approach 1: Return a copy
start = time.time()
sorted_copy = large_df.sort_values(by=['category', 'value'])
copy_time = time.time() - start
# Approach 2: In-place modification
large_df_inplace = large_df.copy() # Fresh copy for fair comparison
start = time.time()
large_df_inplace.sort_values(by=['category', 'value'], inplace=True)
inplace_time = time.time() - start
print(f"Copy approach: {copy_time:.4f} seconds")
print(f"In-place approach: {inplace_time:.4f} seconds")
Typical output:
Copy approach: 0.2847 seconds
In-place approach: 0.2823 seconds
The performance difference is negligible in most cases. The Pandas development team has actually discouraged inplace=True because it rarely provides meaningful speedup and makes code harder to reason about.
My recommendation: avoid inplace=True. The copy approach enables method chaining and makes your code’s data flow explicit:
# Clean, chainable approach
result = (df
.sort_values(by=['department', 'salary'], ascending=[True, False])
.head(10)
.reset_index(drop=True))
# Awkward in-place approach
df.sort_values(by=['department', 'salary'], ascending=[True, False], inplace=True)
df = df.head(10)
df.reset_index(drop=True, inplace=True)
Practical Use Case: Multi-Level Business Data
Let’s combine everything with a realistic scenario. You’re analyzing employee data and need to create a report sorted by department, then by job level (with senior roles first), then by hire date (most recent first within each level).
import pandas as pd
from datetime import date
# Employee dataset
employees = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],
'department': ['Engineering', 'Engineering', 'Sales', 'Sales', 'Engineering', 'Sales', 'Engineering', 'Sales'],
'level': ['Senior', 'Junior', 'Senior', 'Junior', 'Mid', 'Senior', 'Junior', 'Mid'],
'hire_date': pd.to_datetime(['2019-03-15', '2022-07-01', '2020-01-10', '2023-02-28',
'2021-06-15', '2018-11-20', '2023-01-05', '2022-09-12']),
'salary': [125000, 75000, 95000, 62000, 98000, 105000, 72000, 78000]
})
# Define level order (Senior > Mid > Junior)
level_order = {'Senior': 0, 'Mid': 1, 'Junior': 2}
employees['level_rank'] = employees['level'].map(level_order)
# Sort by department (A-Z), level (Senior first), hire date (newest first)
report = (employees
.sort_values(
by=['department', 'level_rank', 'hire_date'],
ascending=[True, True, False]
)
.drop(columns=['level_rank']) # Remove helper column
.reset_index(drop=True))
print(report)
Output:
name department level hire_date salary
0 Alice Engineering Senior 2019-03-15 125000
1 Eve Engineering Mid 2021-06-15 98000
2 Grace Engineering Junior 2023-01-05 72000
3 Bob Engineering Junior 2022-07-01 75000
4 Charlie Sales Senior 2020-01-10 95000
5 Frank Sales Senior 2018-11-20 105000
6 Henry Sales Mid 2022-09-12 78000
7 Diana Sales Junior 2023-02-28 62000
This pattern—creating a helper column for custom sort orders, then dropping it—is essential for categorical data that doesn’t sort alphabetically in the order you need.
Conclusion
Multi-column sorting in Pandas is straightforward once you understand the core mechanics: pass column names as a list to by, match that list with booleans in ascending for mixed sort directions, and use ignore_index=True for clean output. Handle edge cases with na_position, and skip inplace=True unless you have a specific reason to use it.
The combination of these techniques handles virtually any sorting requirement you’ll encounter in data analysis. Master them, and you’ll spend less time wrestling with data order and more time extracting insights.