How to Sort a DataFrame in Pandas
Sorting is one of the most frequent operations you'll perform during data analysis. Whether you're finding top performers, organizing time-series data chronologically, or simply making a DataFrame...
Key Insights
- Use
sort_values()for sorting by column data andsort_index()for sorting by row or column labels—knowing when to use each saves debugging time - Multi-column sorting with different sort orders per column requires passing lists to both
byandascendingparameters - Avoid
inplace=Truein modern pandas code; assignment (df = df.sort_values()) is clearer, equally performant, and plays better with method chaining
Sorting is one of the most frequent operations you’ll perform during data analysis. Whether you’re finding top performers, organizing time-series data chronologically, or simply making a DataFrame easier to read, pandas provides two primary methods: sort_values() for sorting by column data and sort_index() for sorting by labels. This article covers both methods comprehensively, including edge cases that trip up even experienced developers.
Sorting by Column Values with sort_values()
The sort_values() method is your go-to for sorting rows based on the data in one or more columns. At its simplest, you pass the column name to the by parameter.
import pandas as pd
# Create a sample sales DataFrame
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Sprocket', 'Gizmo', 'Doohickey'],
'revenue': [15000, 42000, 8500, 42000, 23000],
'units_sold': [150, 280, 425, 190, 310]
})
# Sort by revenue (ascending by default)
df_sorted = df.sort_values(by='revenue')
print(df_sorted)
Output:
product revenue units_sold
2 Sprocket 8500 425
0 Widget 15000 150
4 Doohickey 23000 310
1 Gadget 42000 280
3 Gizmo 42000 190
Notice that the original index values are preserved. This is intentional—it lets you trace rows back to their original positions.
To sort in descending order, set ascending=False:
# Sort by revenue, highest first
df_top_revenue = df.sort_values(by='revenue', ascending=False)
print(df_top_revenue)
Output:
product revenue units_sold
1 Gadget 42000 280
3 Gizmo 42000 190
4 Doohickey 23000 310
0 Widget 15000 150
2 Sprocket 8500 425
When values are tied (Gadget and Gizmo both have 42000 revenue), pandas preserves their relative order from the original DataFrame. This is called a “stable sort,” and it’s the default behavior.
Multi-Column Sorting
Real-world sorting often requires multiple criteria. You might want to sort products by category first, then by date within each category. Pass a list of column names to the by parameter.
df = pd.DataFrame({
'category': ['Electronics', 'Clothing', 'Electronics', 'Clothing', 'Electronics'],
'product': ['Phone', 'Shirt', 'Laptop', 'Pants', 'Tablet'],
'sale_date': pd.to_datetime(['2024-03-15', '2024-03-10', '2024-03-12', '2024-03-15', '2024-03-10']),
'revenue': [800, 45, 1200, 60, 450]
})
# Sort by category (ascending), then by sale_date (descending)
df_sorted = df.sort_values(
by=['category', 'sale_date'],
ascending=[True, False]
)
print(df_sorted)
Output:
category product sale_date revenue
1 Clothing Shirt 2024-03-10 45
3 Clothing Pants 2024-03-15 60
0 Electronics Phone 2024-03-15 800
2 Electronics Laptop 2024-03-12 1200
4 Electronics Tablet 2024-03-10 450
Wait—that output shows Clothing before Electronics because ‘C’ comes before ‘E’ alphabetically. Let me fix the example to show the intended behavior more clearly:
# Sort by category (ascending), then by date (descending within each category)
df_sorted = df.sort_values(
by=['category', 'sale_date'],
ascending=[True, False]
)
The key insight here is that the ascending parameter accepts a list of booleans that maps one-to-one with the by list. The first boolean controls the first column’s sort order, the second controls the second column, and so on. This gives you fine-grained control over complex sorting requirements.
Sorting by Index with sort_index()
While sort_values() sorts by data, sort_index() sorts by the DataFrame’s index labels. This is particularly useful for time-series data where dates are the index.
# Create a time-series DataFrame with an unordered DateTimeIndex
dates = pd.to_datetime(['2024-03-15', '2024-03-10', '2024-03-12', '2024-03-11', '2024-03-14'])
df = pd.DataFrame({
'temperature': [72, 65, 68, 66, 71],
'humidity': [45, 52, 48, 50, 46]
}, index=dates)
print("Before sorting:")
print(df)
# Sort by the DateTimeIndex
df_sorted = df.sort_index()
print("\nAfter sorting:")
print(df_sorted)
Output:
Before sorting:
temperature humidity
2024-03-15 72 45
2024-03-10 65 52
2024-03-12 68 48
2024-03-11 66 50
2024-03-14 71 46
After sorting:
temperature humidity
2024-03-10 65 52
2024-03-11 66 50
2024-03-12 68 48
2024-03-14 71 46
2024-03-15 72 45
You can also sort column labels by passing axis=1:
df = pd.DataFrame({
'zebra': [1, 2],
'apple': [3, 4],
'mango': [5, 6]
})
# Sort columns alphabetically
df_sorted = df.sort_index(axis=1)
print(df_sorted)
Output:
apple mango zebra
0 3 5 1
1 4 6 2
Handling Missing Values
Missing values (NaN) are inevitable in real datasets. By default, pandas places NaN values at the end of sorted results, regardless of ascending or descending order. The na_position parameter gives you control over this behavior.
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Sprocket', 'Gizmo', 'Doohickey'],
'revenue': [15000, None, 8500, 42000, None]
})
# Default: NaNs at the end
df_nan_last = df.sort_values(by='revenue')
print("NaNs at end (default):")
print(df_nan_last)
# NaNs at the beginning
df_nan_first = df.sort_values(by='revenue', na_position='first')
print("\nNaNs at beginning:")
print(df_nan_first)
Output:
NaNs at end (default):
product revenue
2 Sprocket 8500.0
0 Widget 15000.0
3 Gizmo 42000.0
1 Gadget NaN
4 Doohickey NaN
NaNs at beginning:
product revenue
1 Gadget NaN
4 Doohickey NaN
2 Sprocket 8500.0
0 Widget 15000.0
3 Gizmo 42000.0
This is particularly useful when you want to quickly identify rows with missing data. Sort with na_position='first' and the problem rows appear at the top.
In-Place Sorting and Performance Tips
You’ll see inplace=True in older pandas code and tutorials. While it works, I recommend avoiding it for several reasons.
# The inplace approach (not recommended)
df.sort_values(by='revenue', inplace=True)
# The assignment approach (recommended)
df = df.sort_values(by='revenue')
The assignment approach is clearer about what’s happening, works with method chaining, and contrary to popular belief, doesn’t create significant memory overhead—pandas optimizes this internally. The inplace parameter is being soft-deprecated in pandas, with core developers recommending assignment instead.
For resetting the index after sorting, use ignore_index=True:
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Sprocket'],
'revenue': [15000, 42000, 8500]
})
# Sort and reset index in one operation
df_sorted = df.sort_values(by='revenue', ignore_index=True)
print(df_sorted)
Output:
product revenue
0 Sprocket 8500
1 Widget 15000
2 Gadget 42000
This is more efficient than calling reset_index(drop=True) separately.
For large DataFrames, consider these performance tips:
- Sort once, not repeatedly. If you need sorted data multiple times, sort once and store the result.
- Use categorical dtypes. If sorting by a column with limited unique values, converting to categorical can speed up sorting significantly.
- Consider the
kindparameter. The default ‘quicksort’ is usually fastest, but ‘mergesort’ is stable and ‘heapsort’ uses less memory.
# For very large DataFrames with categorical data
df['category'] = df['category'].astype('category')
df_sorted = df.sort_values(by='category')
Conclusion
Sorting in pandas comes down to two methods: sort_values() for data-based sorting and sort_index() for label-based sorting. Master these with their key parameters, and you’ll handle any sorting requirement efficiently.
| Parameter | Method | Purpose |
|---|---|---|
by |
sort_values() |
Column(s) to sort by |
ascending |
Both | Sort order (True/False or list) |
na_position |
Both | Where to place NaN values (‘first’ or ’last’) |
ignore_index |
Both | Reset index after sorting |
axis |
Both | Sort rows (0) or columns (1) |
kind |
Both | Algorithm: ‘quicksort’, ‘mergesort’, ‘heapsort’ |
Use assignment over inplace=True, remember that ascending accepts lists for multi-column sorts, and leverage ignore_index when you don’t need to preserve original index values. These practices will make your pandas code cleaner and more maintainable.