Pandas - Sort DataFrame by Column (sort_values)
• The `sort_values()` method is the primary way to sort DataFrames by one or multiple columns, replacing the deprecated `sort()` and `sort_index()` methods for column-based sorting
Key Insights
• The sort_values() method is the primary way to sort DataFrames by one or multiple columns, replacing the deprecated sort() and sort_index() methods for column-based sorting
• Sorting can be customized with parameters for ascending/descending order, handling missing values, sorting algorithms, and in-place modifications
• Multi-column sorting follows a hierarchical order where subsequent columns act as tiebreakers, and you can specify different sort orders for each column
Basic Single Column Sorting
The sort_values() method sorts a DataFrame by the values in one or more columns. The most basic usage requires only the column name:
import pandas as pd
import numpy as np
# Create sample DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'age': [25, 30, 22, 35, 28],
'salary': [50000, 60000, 45000, 75000, 55000],
'department': ['IT', 'HR', 'IT', 'Finance', 'HR']
})
# Sort by age (ascending by default)
sorted_df = df.sort_values('age')
print(sorted_df)
Output:
name age salary department
2 Charlie 22 45000 IT
0 Alice 25 50000 IT
4 Eve 28 55000 HR
1 Bob 30 60000 HR
3 David 35 75000 Finance
The original DataFrame remains unchanged. To sort in descending order, use the ascending parameter:
# Sort by salary (descending)
sorted_df = df.sort_values('salary', ascending=False)
print(sorted_df)
Output:
name age salary department
3 David 35 75000 Finance
1 Bob 30 60000 HR
4 Eve 28 55000 HR
0 Alice 25 50000 IT
2 Charlie 22 45000 IT
Multi-Column Sorting
Pass a list of column names to sort by multiple columns. The sorting is hierarchical—the first column is the primary sort key, the second breaks ties, and so on:
df_multi = pd.DataFrame({
'department': ['IT', 'HR', 'IT', 'HR', 'IT', 'Finance'],
'level': ['Senior', 'Junior', 'Senior', 'Senior', 'Junior', 'Senior'],
'salary': [80000, 45000, 75000, 55000, 50000, 90000]
})
# Sort by department, then by level
sorted_df = df_multi.sort_values(['department', 'level'])
print(sorted_df)
Output:
department level salary
5 Finance Senior 90000
1 HR Junior 45000
3 HR Senior 55000
4 IT Junior 50000
0 IT Senior 80000
2 IT Senior 75000
Specify different sort orders for each column by passing a list to ascending:
# Sort by department (ascending), then salary (descending)
sorted_df = df_multi.sort_values(
['department', 'salary'],
ascending=[True, False]
)
print(sorted_df)
Output:
department level salary
5 Finance Senior 90000
3 HR Senior 55000
1 HR Junior 45000
0 IT Senior 80000
2 IT Senior 75000
4 IT Junior 50000
Handling Missing Values
The na_position parameter controls where NaN values appear in the sorted result. Options are 'last' (default) or 'first':
df_na = pd.DataFrame({
'product': ['A', 'B', 'C', 'D', 'E'],
'price': [100, np.nan, 75, 120, np.nan],
'stock': [50, 30, np.nan, 80, 20]
})
# NaN values at the end (default)
sorted_last = df_na.sort_values('price')
print("NaN last:")
print(sorted_last)
# NaN values at the beginning
sorted_first = df_na.sort_values('price', na_position='first')
print("\nNaN first:")
print(sorted_first)
Output:
NaN last:
product price stock
2 C 75.0 NaN
0 A 100.0 50.0
3 D 120.0 80.0
1 B NaN 30.0
4 E NaN 20.0
NaN first:
product price stock
1 B NaN 30.0
4 E NaN 20.0
2 C 75.0 NaN
0 A 100.0 50.0
3 D 120.0 80.0
In-Place Sorting
The inplace parameter modifies the original DataFrame instead of returning a new one. This saves memory for large DataFrames:
df = pd.DataFrame({
'id': [3, 1, 4, 2],
'value': [30, 10, 40, 20]
})
print("Before:", df['id'].tolist())
# Modify the DataFrame in place
df.sort_values('id', inplace=True)
print("After:", df['id'].tolist())
# Output: Before: [3, 1, 4, 2]
# After: [1, 2, 3, 4]
Be cautious with inplace=True as it makes the operation irreversible without reloading the data.
Sorting with Custom Keys
The key parameter accepts a function to transform values before sorting, similar to Python’s built-in sorted():
df_names = pd.DataFrame({
'name': ['alice', 'BOB', 'Charlie', 'DAVID'],
'score': [85, 92, 78, 88]
})
# Case-insensitive sorting
sorted_df = df_names.sort_values('name', key=lambda x: x.str.lower())
print(sorted_df)
Output:
name score
0 alice 85
1 BOB 92
2 Charlie 78
3 DAVID 88
This is useful for custom sorting logic without modifying the original data:
df_dates = pd.DataFrame({
'event': ['A', 'B', 'C', 'D'],
'month': ['March', 'January', 'December', 'February']
})
# Sort by actual month order
month_order = {
'January': 1, 'February': 2, 'March': 3, 'April': 4,
'May': 5, 'June': 6, 'July': 7, 'August': 8,
'September': 9, 'October': 10, 'November': 11, 'December': 12
}
sorted_df = df_dates.sort_values('month', key=lambda x: x.map(month_order))
print(sorted_df)
Output:
event month
1 B January
3 D February
0 A March
2 C December
Sorting Algorithms
The kind parameter specifies the sorting algorithm. Options include 'quicksort' (default), 'mergesort', 'heapsort', and 'stable':
df_stable = pd.DataFrame({
'category': ['A', 'B', 'A', 'B', 'A'],
'value': [1, 1, 1, 1, 1],
'order': [1, 2, 3, 4, 5]
})
# Stable sort preserves original order for equal values
sorted_stable = df_stable.sort_values('value', kind='stable')
print(sorted_stable)
Mergesort and stable sorting guarantee that rows with equal sort keys maintain their relative order. This matters when sorting by multiple criteria sequentially or when the original order has meaning.
Sorting by Index
While sort_values() sorts by column values, use sort_index() for index-based sorting:
df_idx = pd.DataFrame({
'value': [10, 20, 30, 40]
}, index=[3, 1, 4, 2])
# Sort by index
sorted_by_idx = df_idx.sort_index()
print(sorted_by_idx)
Output:
value
1 20
2 40
3 10
4 30
Resetting Index After Sorting
Sorting changes row order but preserves the original index. Use reset_index() to create a sequential index:
df = pd.DataFrame({
'name': ['Charlie', 'Alice', 'Bob'],
'score': [85, 92, 78]
})
sorted_df = df.sort_values('name')
print("With original index:")
print(sorted_df)
sorted_reset = df.sort_values('name').reset_index(drop=True)
print("\nWith reset index:")
print(sorted_reset)
Output:
With original index:
name score
1 Alice 92
2 Bob 78
0 Charlie 85
With reset index:
name score
0 Alice 92
1 Bob 78
2 Charlie 85
The drop=True parameter discards the old index; without it, the old index becomes a new column.
Performance Considerations
For large DataFrames, sorting performance depends on data types and algorithms. Numerical columns sort faster than strings, and sorted or nearly-sorted data benefits from stable algorithms:
import time
# Create large DataFrame
large_df = pd.DataFrame({
'values': np.random.randint(0, 1000000, 1000000)
})
start = time.time()
sorted_quick = large_df.sort_values('values', kind='quicksort')
print(f"Quicksort: {time.time() - start:.3f}s")
start = time.time()
sorted_merge = large_df.sort_values('values', kind='mergesort')
print(f"Mergesort: {time.time() - start:.3f}s")
For memory efficiency with large datasets, use inplace=True or consider sorting only necessary columns before operations that benefit from sorted data, like merge operations or groupby with sorted keys.