Pandas - Rank Values in Column
• Pandas provides multiple ranking methods (average, min, max, first, dense) that handle tied values differently, with the `rank()` method offering fine-grained control over ranking behavior
Key Insights
• Pandas provides multiple ranking methods (average, min, max, first, dense) that handle tied values differently, with the rank() method offering fine-grained control over ranking behavior
• You can rank data in ascending or descending order, apply rankings within groups using groupby(), and handle missing values explicitly through the na_option parameter
• Performance considerations matter when ranking large datasets—using appropriate dtypes and avoiding unnecessary copies can significantly improve execution speed
Basic Ranking with rank()
The rank() method assigns numerical ranks to values in a DataFrame column. By default, it uses the “average” method for handling ties and ranks in ascending order.
import pandas as pd
import numpy as np
# Create sample data
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'score': [85, 92, 78, 92, 88]
})
# Basic ranking
df['rank'] = df['score'].rank()
print(df)
Output:
name score rank
0 Alice 85 3.0
1 Bob 92 4.5
2 Charlie 78 1.0
3 David 92 4.5
4 Eve 88 2.0
Notice that Bob and David both scored 92, so they share ranks 4 and 5, averaging to 4.5. This is the default “average” method behavior.
Understanding Ranking Methods
Pandas offers five distinct methods for handling tied values. Each serves different analytical needs.
df = pd.DataFrame({
'value': [10, 20, 20, 30, 30, 30, 40]
})
# Apply different ranking methods
df['average'] = df['value'].rank(method='average')
df['min'] = df['value'].rank(method='min')
df['max'] = df['value'].rank(method='max')
df['first'] = df['value'].rank(method='first')
df['dense'] = df['value'].rank(method='dense')
print(df)
Output:
value average min max first dense
0 10 1.0 1.0 1.0 1.0 1.0
1 20 2.5 2.0 3.0 2.0 2.0
2 20 2.5 2.0 3.0 3.0 2.0
3 30 5.0 4.0 6.0 4.0 3.0
4 30 5.0 4.0 6.0 5.0 3.0
5 30 5.0 4.0 6.0 6.0 3.0
6 40 7.0 7.0 7.0 7.0 4.0
- average: Assigns the mean rank to tied values (default)
- min: All tied values receive the lowest rank in the group
- max: All tied values receive the highest rank in the group
- first: Ranks tied values in the order they appear
- dense: Like min, but no gaps in rank sequence
Ranking in Descending Order
For scenarios where higher values should receive better (lower) ranks, use ascending=False.
df = pd.DataFrame({
'player': ['Player A', 'Player B', 'Player C', 'Player D'],
'points': [450, 520, 380, 520]
})
# Rank by points (higher is better)
df['rank'] = df['points'].rank(ascending=False, method='min')
print(df)
Output:
player points rank
0 Player A 450 3.0
1 Player B 520 1.0
2 Player C 380 4.0
3 Player D 520 1.0
Both players with 520 points share rank 1 using the “min” method.
Handling Missing Values
The na_option parameter controls how NaN values are ranked.
df = pd.DataFrame({
'score': [85, np.nan, 92, 78, np.nan, 88]
})
# Different approaches to NaN handling
df['keep'] = df['score'].rank(na_option='keep')
df['top'] = df['score'].rank(na_option='top')
df['bottom'] = df['score'].rank(na_option='bottom')
print(df)
Output:
score keep top bottom
0 85.0 3.0 5.0 3.0
1 NaN NaN 1.5 NaN
2 92.0 4.0 6.0 4.0
3 78.0 1.0 3.0 1.0
4 NaN NaN 1.5 NaN
5 88.0 2.0 4.0 2.0
- keep: Leaves NaN values as NaN (default)
- top: Assigns NaN values the highest ranks
- bottom: Assigns NaN values the lowest ranks
Ranking Within Groups
Use groupby() combined with rank() to rank values within categorical groups.
df = pd.DataFrame({
'department': ['Sales', 'Sales', 'IT', 'IT', 'Sales', 'IT'],
'employee': ['John', 'Sarah', 'Mike', 'Lisa', 'Tom', 'Emma'],
'revenue': [50000, 75000, 45000, 60000, 62000, 55000]
})
# Rank within each department
df['dept_rank'] = df.groupby('department')['revenue'].rank(ascending=False, method='min')
print(df)
Output:
department employee revenue dept_rank
0 Sales John 50000 3.0
1 Sales Sarah 75000 1.0
2 IT Mike 45000 3.0
3 IT Lisa 60000 1.0
4 Sales Tom 62000 2.0
5 IT Emma 55000 2.0
Each employee is ranked against others in their department only.
Percentile Ranking
Convert ranks to percentiles using the pct=True parameter, useful for understanding relative position.
df = pd.DataFrame({
'student': ['A', 'B', 'C', 'D', 'E'],
'test_score': [72, 85, 91, 78, 85]
})
df['percentile'] = df['test_score'].rank(pct=True)
df['percentile_100'] = df['percentile'] * 100
print(df)
Output:
student test_score percentile percentile_100
0 A 72 0.2 20.0
1 B 85 0.6 60.0
2 C 91 1.0 100.0
3 D 78 0.4 40.0
4 E 85 0.6 60.0
Ranking Multiple Columns
Apply ranking to multiple columns simultaneously using apply() or direct assignment.
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D'],
'sales': [1200, 1500, 900, 1500],
'profit': [300, 400, 250, 350]
})
# Rank multiple columns
df['sales_rank'] = df['sales'].rank(ascending=False, method='min')
df['profit_rank'] = df['profit'].rank(ascending=False, method='min')
# Calculate combined rank (lower is better)
df['combined_rank'] = df['sales_rank'] + df['profit_rank']
print(df)
Output:
product sales profit sales_rank profit_rank combined_rank
0 A 1200 300 3.0 3.0 6.0
1 B 1500 400 1.0 1.0 2.0
2 C 900 250 4.0 4.0 8.0
3 D 1500 350 1.0 2.0 3.0
Performance Optimization
When working with large datasets, ranking performance matters. Here are optimization techniques:
# Create large dataset
large_df = pd.DataFrame({
'value': np.random.randint(0, 1000, 1000000)
})
# Use appropriate dtype
large_df['value'] = large_df['value'].astype('int32')
# Efficient ranking with method selection
%timeit large_df['value'].rank(method='first') # Fastest for unique handling
%timeit large_df['value'].rank(method='average') # Slower due to averaging
# For grouped ranking on large data
large_df['category'] = np.random.choice(['A', 'B', 'C'], 1000000)
# Use transform for in-place group ranking
large_df['rank'] = large_df.groupby('category')['value'].transform(
lambda x: x.rank(method='first')
)
Practical Example: Competition Leaderboard
Combining multiple ranking concepts for a real-world leaderboard system:
df = pd.DataFrame({
'player_id': range(1, 11),
'kills': [15, 22, 22, 18, 30, 12, 30, 25, 19, 22],
'deaths': [8, 10, 9, 7, 12, 15, 8, 11, 10, 8],
'assists': [5, 8, 10, 6, 15, 3, 12, 9, 7, 9]
})
# Calculate KDA ratio
df['kda'] = (df['kills'] + df['assists']) / df['deaths'].replace(0, 1)
# Rank by KDA (higher is better)
df['rank'] = df['kda'].rank(ascending=False, method='min')
# Add percentile
df['percentile'] = df['kda'].rank(pct=True) * 100
# Sort by rank
df_sorted = df.sort_values('rank')
print(df_sorted[['player_id', 'kills', 'deaths', 'assists', 'kda', 'rank', 'percentile']])
This produces a complete leaderboard with ranks and percentiles, handling tied KDA ratios appropriately using the minimum rank method.