Pandas - Get N Largest/Smallest Values
• Pandas provides `nlargest()` and `nsmallest()` methods that outperform sorting-based approaches for finding top/bottom N values, especially on large datasets
Key Insights
• Pandas provides nlargest() and nsmallest() methods that outperform sorting-based approaches for finding top/bottom N values, especially on large datasets
• These methods support multi-column operations with customizable sort orders and handle edge cases like duplicates and NaN values intelligently
• For grouped operations, combining groupby() with nlargest()/nsmallest() enables efficient per-group ranking without manual iteration
Basic Usage with Series
The simplest use case involves finding the N largest or smallest values in a pandas Series:
import pandas as pd
import numpy as np
# Create sample data
data = pd.Series([45, 12, 78, 23, 89, 34, 67, 91, 56, 29])
# Get 3 largest values
largest = data.nlargest(3)
print(largest)
# Output:
# 7 91
# 4 89
# 2 78
# dtype: int64
# Get 3 smallest values
smallest = data.nsmallest(3)
print(smallest)
# Output:
# 1 12
# 3 23
# 9 29
# dtype: int64
The methods return a Series with the original indices preserved, which is crucial for tracking back to source data. This is more efficient than data.sort_values().tail(3) because it uses a partial sorting algorithm with O(n + k log k) complexity instead of O(n log n).
DataFrame Operations with Single Column
When working with DataFrames, specify which column to use for comparison:
df = pd.DataFrame({
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
'price': [1200, 25, 75, 350, 89],
'sales': [150, 890, 420, 280, 310]
})
# Get 3 products with highest prices
top_prices = df.nlargest(3, 'price')
print(top_prices)
# product price sales
# 0 Laptop 1200 150
# 3 Monitor 350 280
# 2 Keyboard 75 420
# Get 2 products with lowest sales
low_sales = df.nsmallest(2, 'sales')
print(low_sales)
# product price sales
# 0 Laptop 1200 150
# 3 Monitor 350 280
The entire row is returned for each match, making it easy to see all related attributes without additional joins or lookups.
Multi-Column Sorting
You can specify multiple columns for hierarchical sorting, similar to sort_values():
df = pd.DataFrame({
'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'department': ['Sales', 'IT', 'Sales', 'IT', 'Sales'],
'salary': [75000, 85000, 75000, 92000, 68000],
'experience': [5, 7, 3, 10, 2]
})
# Get top 3 by salary, then by experience if salary is equal
top_employees = df.nlargest(3, ['salary', 'experience'])
print(top_employees)
# employee department salary experience
# 3 David IT 92000 10
# 1 Bob IT 85000 7
# 0 Alice Sales 75000 5
The columns parameter accepts a list where priority decreases from left to right. You can also control sort order per column:
# This requires using nlargest with keep parameter
# For mixed ascending/descending, use sort_values instead
top_mixed = df.nlargest(3, 'salary')
For truly mixed sort orders (ascending on one column, descending on another), sort_values() remains the better choice.
Handling Duplicates
The keep parameter controls which duplicates to retain when values are tied:
df = pd.DataFrame({
'name': ['A', 'B', 'C', 'D', 'E', 'F'],
'score': [95, 87, 95, 82, 95, 90]
})
# Keep first occurrence of duplicates (default)
top_first = df.nlargest(3, 'score', keep='first')
print(top_first)
# name score
# 0 A 95
# 2 C 95
# 4 E 95
# Keep last occurrence
top_last = df.nlargest(3, 'score', keep='last')
print(top_last)
# name score
# 4 E 95
# 2 C 95
# 0 A 95
# Keep all duplicates (may return more than N rows)
top_all = df.nlargest(3, 'score', keep='all')
print(top_all)
# name score
# 0 A 95
# 2 C 95
# 4 E 95
Using keep='all' is particularly useful when you need to ensure no tied values are excluded, even if it means returning more than N results.
NaN Handling
By default, NaN values are excluded from results:
df = pd.DataFrame({
'item': ['A', 'B', 'C', 'D', 'E'],
'value': [10, np.nan, 30, 25, np.nan]
})
largest = df.nlargest(3, 'value')
print(largest)
# item value
# 2 C 30.0
# 3 D 25.0
# 0 A 10.0
# NaN values are automatically filtered out
# Only non-NaN values are considered
If you need NaN values in results, filter or fill them before using these methods:
# Fill NaN with a specific value first
df_filled = df.fillna(0)
largest_with_filled = df_filled.nlargest(3, 'value')
GroupBy Operations
Combining groupby() with nlargest()/nsmallest() enables per-group top-N queries:
df = pd.DataFrame({
'category': ['Electronics', 'Electronics', 'Clothing', 'Clothing',
'Electronics', 'Clothing', 'Electronics'],
'product': ['TV', 'Laptop', 'Shirt', 'Pants', 'Phone', 'Jacket', 'Tablet'],
'revenue': [15000, 25000, 3000, 4500, 18000, 5500, 12000]
})
# Get top 2 products by revenue in each category
top_per_category = df.groupby('category', group_keys=False).apply(
lambda x: x.nlargest(2, 'revenue')
)
print(top_per_category)
# category product revenue
# 1 Electronics Laptop 25000
# 4 Electronics Phone 18000
# 5 Clothing Jacket 5500
# 3 Clothing Pants 4500
For cleaner syntax in pandas 1.1+, use the head() pattern after sorting within groups:
# Alternative approach
top_per_category_v2 = (df.sort_values('revenue', ascending=False)
.groupby('category', group_keys=False)
.head(2))
However, the apply() with nlargest() approach is more explicit about intent and doesn’t require a full sort.
Performance Considerations
For small N relative to dataset size, nlargest()/nsmallest() significantly outperform sorting:
import time
# Create large dataset
large_df = pd.DataFrame({
'value': np.random.randint(0, 1000000, 1000000)
})
# Method 1: nlargest
start = time.time()
result1 = large_df.nlargest(10, 'value')
time1 = time.time() - start
# Method 2: sort and head
start = time.time()
result2 = large_df.sort_values('value', ascending=False).head(10)
time2 = time.time() - start
print(f"nlargest: {time1:.4f}s")
print(f"sort+head: {time2:.4f}s")
# Typical output:
# nlargest: 0.0234s
# sort+head: 0.1567s
The performance advantage comes from using a heap-based selection algorithm that doesn’t need to sort the entire dataset. For N approaching the dataset size, sorting becomes competitive or faster.
Practical Example: Sales Analysis
Here’s a real-world scenario combining multiple techniques:
sales_df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=100),
'store': np.random.choice(['Store A', 'Store B', 'Store C'], 100),
'category': np.random.choice(['Electronics', 'Clothing', 'Food'], 100),
'revenue': np.random.randint(1000, 50000, 100)
})
# Find top 5 revenue days overall
top_days = sales_df.nlargest(5, 'revenue')[['date', 'store', 'revenue']]
# Find top 3 performing days per store
top_per_store = (sales_df.groupby('store', group_keys=False)
.apply(lambda x: x.nlargest(3, 'revenue')))
# Find bottom 2 categories by total revenue
category_totals = sales_df.groupby('category')['revenue'].sum().reset_index()
worst_categories = category_totals.nsmallest(2, 'revenue')
print("Top Revenue Days:\n", top_days)
print("\nTop Days Per Store:\n", top_per_store)
print("\nWorst Performing Categories:\n", worst_categories)
This pattern of finding extremes within groups is common in business analytics, customer segmentation, and performance monitoring applications.