Pandas - Sort DataFrame by Column (sort_values)

• The `sort_values()` method is the primary way to sort DataFrames by one or multiple columns, replacing the deprecated `sort()` and `sort_index()` methods for column-based sorting

Key Insights

• The sort_values() method is the primary way to sort DataFrames by one or multiple columns, replacing the deprecated sort() and sort_index() methods for column-based sorting • Sorting can be customized with parameters for ascending/descending order, handling missing values, sorting algorithms, and in-place modifications • Multi-column sorting follows a hierarchical order where subsequent columns act as tiebreakers, and you can specify different sort orders for each column

Basic Single Column Sorting

The sort_values() method sorts a DataFrame by the values in one or more columns. The most basic usage requires only the column name:

import pandas as pd
import numpy as np

# Create sample DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [25, 30, 22, 35, 28],
    'salary': [50000, 60000, 45000, 75000, 55000],
    'department': ['IT', 'HR', 'IT', 'Finance', 'HR']
})

# Sort by age (ascending by default)
sorted_df = df.sort_values('age')
print(sorted_df)

Output:

      name  age  salary department
2  Charlie   22   45000         IT
0    Alice   25   50000         IT
4      Eve   28   55000         HR
1      Bob   30   60000         HR
3    David   35   75000    Finance

The original DataFrame remains unchanged. To sort in descending order, use the ascending parameter:

# Sort by salary (descending)
sorted_df = df.sort_values('salary', ascending=False)
print(sorted_df)

Output:

      name  age  salary department
3    David   35   75000    Finance
1      Bob   30   60000         HR
4      Eve   28   55000         HR
0    Alice   25   50000         IT
2  Charlie   22   45000         IT

Multi-Column Sorting

Pass a list of column names to sort by multiple columns. The sorting is hierarchical—the first column is the primary sort key, the second breaks ties, and so on:

df_multi = pd.DataFrame({
    'department': ['IT', 'HR', 'IT', 'HR', 'IT', 'Finance'],
    'level': ['Senior', 'Junior', 'Senior', 'Senior', 'Junior', 'Senior'],
    'salary': [80000, 45000, 75000, 55000, 50000, 90000]
})

# Sort by department, then by level
sorted_df = df_multi.sort_values(['department', 'level'])
print(sorted_df)

Output:

  department   level  salary
5    Finance  Senior   90000
1         HR  Junior   45000
3         HR  Senior   55000
4         IT  Junior   50000
0         IT  Senior   80000
2         IT  Senior   75000

Specify different sort orders for each column by passing a list to ascending:

# Sort by department (ascending), then salary (descending)
sorted_df = df_multi.sort_values(
    ['department', 'salary'],
    ascending=[True, False]
)
print(sorted_df)

Output:

  department   level  salary
5    Finance  Senior   90000
3         HR  Senior   55000
1         HR  Junior   45000
0         IT  Senior   80000
2         IT  Senior   75000
4         IT  Junior   50000

Handling Missing Values

The na_position parameter controls where NaN values appear in the sorted result. Options are 'last' (default) or 'first':

df_na = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D', 'E'],
    'price': [100, np.nan, 75, 120, np.nan],
    'stock': [50, 30, np.nan, 80, 20]
})

# NaN values at the end (default)
sorted_last = df_na.sort_values('price')
print("NaN last:")
print(sorted_last)

# NaN values at the beginning
sorted_first = df_na.sort_values('price', na_position='first')
print("\nNaN first:")
print(sorted_first)

Output:

NaN last:
  product  price  stock
2       C   75.0    NaN
0       A  100.0   50.0
3       D  120.0   80.0
1       B    NaN   30.0
4       E    NaN   20.0

NaN first:
  product  price  stock
1       B    NaN   30.0
4       E    NaN   20.0
2       C   75.0    NaN
0       A  100.0   50.0
3       D  120.0   80.0

In-Place Sorting

The inplace parameter modifies the original DataFrame instead of returning a new one. This saves memory for large DataFrames:

df = pd.DataFrame({
    'id': [3, 1, 4, 2],
    'value': [30, 10, 40, 20]
})

print("Before:", df['id'].tolist())

# Modify the DataFrame in place
df.sort_values('id', inplace=True)

print("After:", df['id'].tolist())
# Output: Before: [3, 1, 4, 2]
#         After: [1, 2, 3, 4]

Be cautious with inplace=True as it makes the operation irreversible without reloading the data.

Sorting with Custom Keys

The key parameter accepts a function to transform values before sorting, similar to Python’s built-in sorted():

df_names = pd.DataFrame({
    'name': ['alice', 'BOB', 'Charlie', 'DAVID'],
    'score': [85, 92, 78, 88]
})

# Case-insensitive sorting
sorted_df = df_names.sort_values('name', key=lambda x: x.str.lower())
print(sorted_df)

Output:

      name  score
0    alice     85
1      BOB     92
2  Charlie     78
3    DAVID     88

This is useful for custom sorting logic without modifying the original data:

df_dates = pd.DataFrame({
    'event': ['A', 'B', 'C', 'D'],
    'month': ['March', 'January', 'December', 'February']
})

# Sort by actual month order
month_order = {
    'January': 1, 'February': 2, 'March': 3, 'April': 4,
    'May': 5, 'June': 6, 'July': 7, 'August': 8,
    'September': 9, 'October': 10, 'November': 11, 'December': 12
}

sorted_df = df_dates.sort_values('month', key=lambda x: x.map(month_order))
print(sorted_df)

Output:

  event     month
1     B   January
3     D  February
0     A     March
2     C  December

Sorting Algorithms

The kind parameter specifies the sorting algorithm. Options include 'quicksort' (default), 'mergesort', 'heapsort', and 'stable':

df_stable = pd.DataFrame({
    'category': ['A', 'B', 'A', 'B', 'A'],
    'value': [1, 1, 1, 1, 1],
    'order': [1, 2, 3, 4, 5]
})

# Stable sort preserves original order for equal values
sorted_stable = df_stable.sort_values('value', kind='stable')
print(sorted_stable)

Mergesort and stable sorting guarantee that rows with equal sort keys maintain their relative order. This matters when sorting by multiple criteria sequentially or when the original order has meaning.

Sorting by Index

While sort_values() sorts by column values, use sort_index() for index-based sorting:

df_idx = pd.DataFrame({
    'value': [10, 20, 30, 40]
}, index=[3, 1, 4, 2])

# Sort by index
sorted_by_idx = df_idx.sort_index()
print(sorted_by_idx)

Output:

   value
1     20
2     40
3     10
4     30

Resetting Index After Sorting

Sorting changes row order but preserves the original index. Use reset_index() to create a sequential index:

df = pd.DataFrame({
    'name': ['Charlie', 'Alice', 'Bob'],
    'score': [85, 92, 78]
})

sorted_df = df.sort_values('name')
print("With original index:")
print(sorted_df)

sorted_reset = df.sort_values('name').reset_index(drop=True)
print("\nWith reset index:")
print(sorted_reset)

Output:

With original index:
      name  score
1    Alice     92
2      Bob     78
0  Charlie     85

With reset index:
      name  score
0    Alice     92
1      Bob     78
2  Charlie     85

The drop=True parameter discards the old index; without it, the old index becomes a new column.

Performance Considerations

For large DataFrames, sorting performance depends on data types and algorithms. Numerical columns sort faster than strings, and sorted or nearly-sorted data benefits from stable algorithms:

import time

# Create large DataFrame
large_df = pd.DataFrame({
    'values': np.random.randint(0, 1000000, 1000000)
})

start = time.time()
sorted_quick = large_df.sort_values('values', kind='quicksort')
print(f"Quicksort: {time.time() - start:.3f}s")

start = time.time()
sorted_merge = large_df.sort_values('values', kind='mergesort')
print(f"Mergesort: {time.time() - start:.3f}s")

For memory efficiency with large datasets, use inplace=True or consider sorting only necessary columns before operations that benefit from sorted data, like merge operations or groupby with sorted keys.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.