How to Sort a DataFrame in Pandas

Sorting is one of the most frequent operations you'll perform during data analysis. Whether you're finding top performers, organizing time-series data chronologically, or simply making a DataFrame...

Key Insights

  • Use sort_values() for sorting by column data and sort_index() for sorting by row or column labels—knowing when to use each saves debugging time
  • Multi-column sorting with different sort orders per column requires passing lists to both by and ascending parameters
  • Avoid inplace=True in modern pandas code; assignment (df = df.sort_values()) is clearer, equally performant, and plays better with method chaining

Sorting is one of the most frequent operations you’ll perform during data analysis. Whether you’re finding top performers, organizing time-series data chronologically, or simply making a DataFrame easier to read, pandas provides two primary methods: sort_values() for sorting by column data and sort_index() for sorting by labels. This article covers both methods comprehensively, including edge cases that trip up even experienced developers.

Sorting by Column Values with sort_values()

The sort_values() method is your go-to for sorting rows based on the data in one or more columns. At its simplest, you pass the column name to the by parameter.

import pandas as pd

# Create a sample sales DataFrame
df = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Sprocket', 'Gizmo', 'Doohickey'],
    'revenue': [15000, 42000, 8500, 42000, 23000],
    'units_sold': [150, 280, 425, 190, 310]
})

# Sort by revenue (ascending by default)
df_sorted = df.sort_values(by='revenue')
print(df_sorted)

Output:

     product  revenue  units_sold
2   Sprocket     8500         425
0     Widget    15000         150
4  Doohickey    23000         310
1     Gadget    42000         280
3      Gizmo    42000         190

Notice that the original index values are preserved. This is intentional—it lets you trace rows back to their original positions.

To sort in descending order, set ascending=False:

# Sort by revenue, highest first
df_top_revenue = df.sort_values(by='revenue', ascending=False)
print(df_top_revenue)

Output:

     product  revenue  units_sold
1     Gadget    42000         280
3      Gizmo    42000         190
4  Doohickey    23000         310
0     Widget    15000         150
2   Sprocket     8500         425

When values are tied (Gadget and Gizmo both have 42000 revenue), pandas preserves their relative order from the original DataFrame. This is called a “stable sort,” and it’s the default behavior.

Multi-Column Sorting

Real-world sorting often requires multiple criteria. You might want to sort products by category first, then by date within each category. Pass a list of column names to the by parameter.

df = pd.DataFrame({
    'category': ['Electronics', 'Clothing', 'Electronics', 'Clothing', 'Electronics'],
    'product': ['Phone', 'Shirt', 'Laptop', 'Pants', 'Tablet'],
    'sale_date': pd.to_datetime(['2024-03-15', '2024-03-10', '2024-03-12', '2024-03-15', '2024-03-10']),
    'revenue': [800, 45, 1200, 60, 450]
})

# Sort by category (ascending), then by sale_date (descending)
df_sorted = df.sort_values(
    by=['category', 'sale_date'],
    ascending=[True, False]
)
print(df_sorted)

Output:

      category product  sale_date  revenue
1     Clothing   Shirt 2024-03-10       45
3     Clothing   Pants 2024-03-15       60
0  Electronics   Phone 2024-03-15      800
2  Electronics  Laptop 2024-03-12     1200
4  Electronics  Tablet 2024-03-10      450

Wait—that output shows Clothing before Electronics because ‘C’ comes before ‘E’ alphabetically. Let me fix the example to show the intended behavior more clearly:

# Sort by category (ascending), then by date (descending within each category)
df_sorted = df.sort_values(
    by=['category', 'sale_date'],
    ascending=[True, False]
)

The key insight here is that the ascending parameter accepts a list of booleans that maps one-to-one with the by list. The first boolean controls the first column’s sort order, the second controls the second column, and so on. This gives you fine-grained control over complex sorting requirements.

Sorting by Index with sort_index()

While sort_values() sorts by data, sort_index() sorts by the DataFrame’s index labels. This is particularly useful for time-series data where dates are the index.

# Create a time-series DataFrame with an unordered DateTimeIndex
dates = pd.to_datetime(['2024-03-15', '2024-03-10', '2024-03-12', '2024-03-11', '2024-03-14'])
df = pd.DataFrame({
    'temperature': [72, 65, 68, 66, 71],
    'humidity': [45, 52, 48, 50, 46]
}, index=dates)

print("Before sorting:")
print(df)

# Sort by the DateTimeIndex
df_sorted = df.sort_index()
print("\nAfter sorting:")
print(df_sorted)

Output:

Before sorting:
            temperature  humidity
2024-03-15           72        45
2024-03-10           65        52
2024-03-12           68        48
2024-03-11           66        50
2024-03-14           71        46

After sorting:
            temperature  humidity
2024-03-10           65        52
2024-03-11           66        50
2024-03-12           68        48
2024-03-14           71        46
2024-03-15           72        45

You can also sort column labels by passing axis=1:

df = pd.DataFrame({
    'zebra': [1, 2],
    'apple': [3, 4],
    'mango': [5, 6]
})

# Sort columns alphabetically
df_sorted = df.sort_index(axis=1)
print(df_sorted)

Output:

   apple  mango  zebra
0      3      5      1
1      4      6      2

Handling Missing Values

Missing values (NaN) are inevitable in real datasets. By default, pandas places NaN values at the end of sorted results, regardless of ascending or descending order. The na_position parameter gives you control over this behavior.

df = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Sprocket', 'Gizmo', 'Doohickey'],
    'revenue': [15000, None, 8500, 42000, None]
})

# Default: NaNs at the end
df_nan_last = df.sort_values(by='revenue')
print("NaNs at end (default):")
print(df_nan_last)

# NaNs at the beginning
df_nan_first = df.sort_values(by='revenue', na_position='first')
print("\nNaNs at beginning:")
print(df_nan_first)

Output:

NaNs at end (default):
     product   revenue
2   Sprocket    8500.0
0     Widget   15000.0
3      Gizmo   42000.0
1     Gadget       NaN
4  Doohickey       NaN

NaNs at beginning:
     product   revenue
1     Gadget       NaN
4  Doohickey       NaN
2   Sprocket    8500.0
0     Widget   15000.0
3      Gizmo   42000.0

This is particularly useful when you want to quickly identify rows with missing data. Sort with na_position='first' and the problem rows appear at the top.

In-Place Sorting and Performance Tips

You’ll see inplace=True in older pandas code and tutorials. While it works, I recommend avoiding it for several reasons.

# The inplace approach (not recommended)
df.sort_values(by='revenue', inplace=True)

# The assignment approach (recommended)
df = df.sort_values(by='revenue')

The assignment approach is clearer about what’s happening, works with method chaining, and contrary to popular belief, doesn’t create significant memory overhead—pandas optimizes this internally. The inplace parameter is being soft-deprecated in pandas, with core developers recommending assignment instead.

For resetting the index after sorting, use ignore_index=True:

df = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Sprocket'],
    'revenue': [15000, 42000, 8500]
})

# Sort and reset index in one operation
df_sorted = df.sort_values(by='revenue', ignore_index=True)
print(df_sorted)

Output:

    product  revenue
0  Sprocket     8500
1    Widget    15000
2    Gadget    42000

This is more efficient than calling reset_index(drop=True) separately.

For large DataFrames, consider these performance tips:

  1. Sort once, not repeatedly. If you need sorted data multiple times, sort once and store the result.
  2. Use categorical dtypes. If sorting by a column with limited unique values, converting to categorical can speed up sorting significantly.
  3. Consider the kind parameter. The default ‘quicksort’ is usually fastest, but ‘mergesort’ is stable and ‘heapsort’ uses less memory.
# For very large DataFrames with categorical data
df['category'] = df['category'].astype('category')
df_sorted = df.sort_values(by='category')

Conclusion

Sorting in pandas comes down to two methods: sort_values() for data-based sorting and sort_index() for label-based sorting. Master these with their key parameters, and you’ll handle any sorting requirement efficiently.

Parameter Method Purpose
by sort_values() Column(s) to sort by
ascending Both Sort order (True/False or list)
na_position Both Where to place NaN values (‘first’ or ’last’)
ignore_index Both Reset index after sorting
axis Both Sort rows (0) or columns (1)
kind Both Algorithm: ‘quicksort’, ‘mergesort’, ‘heapsort’

Use assignment over inplace=True, remember that ascending accepts lists for multi-column sorts, and leverage ignore_index when you don’t need to preserve original index values. These practices will make your pandas code cleaner and more maintainable.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.