Pandas - Get N Largest/Smallest Values

Key Insights

• Pandas provides nlargest() and nsmallest() methods that outperform sorting-based approaches for finding top/bottom N values, especially on large datasets • These methods support multi-column operations with customizable sort orders and handle edge cases like duplicates and NaN values intelligently • For grouped operations, combining groupby() with nlargest()/nsmallest() enables efficient per-group ranking without manual iteration

Basic Usage with Series

The simplest use case involves finding the N largest or smallest values in a pandas Series:

import pandas as pd
import numpy as np

# Create sample data
data = pd.Series([45, 12, 78, 23, 89, 34, 67, 91, 56, 29])

# Get 3 largest values
largest = data.nlargest(3)
print(largest)
# Output:
# 7    91
# 4    89
# 2    78
# dtype: int64

# Get 3 smallest values
smallest = data.nsmallest(3)
print(smallest)
# Output:
# 1    12
# 3    23
# 9    29
# dtype: int64

The methods return a Series with the original indices preserved, which is crucial for tracking back to source data. This is more efficient than data.sort_values().tail(3) because it uses a partial sorting algorithm with O(n + k log k) complexity instead of O(n log n).

DataFrame Operations with Single Column

When working with DataFrames, specify which column to use for comparison:

df = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
    'price': [1200, 25, 75, 350, 89],
    'sales': [150, 890, 420, 280, 310]
})

# Get 3 products with highest prices
top_prices = df.nlargest(3, 'price')
print(top_prices)
#    product  price  sales
# 0   Laptop   1200    150
# 3  Monitor    350    280
# 2 Keyboard     75    420

# Get 2 products with lowest sales
low_sales = df.nsmallest(2, 'sales')
print(low_sales)
#   product  price  sales
# 0  Laptop   1200    150
# 3 Monitor    350    280

The entire row is returned for each match, making it easy to see all related attributes without additional joins or lookups.

Multi-Column Sorting

You can specify multiple columns for hierarchical sorting, similar to sort_values():

df = pd.DataFrame({
    'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'department': ['Sales', 'IT', 'Sales', 'IT', 'Sales'],
    'salary': [75000, 85000, 75000, 92000, 68000],
    'experience': [5, 7, 3, 10, 2]
})

# Get top 3 by salary, then by experience if salary is equal
top_employees = df.nlargest(3, ['salary', 'experience'])
print(top_employees)
#   employee department  salary  experience
# 3    David         IT   92000          10
# 1      Bob         IT   85000           7
# 0    Alice      Sales   75000           5

The columns parameter accepts a list where priority decreases from left to right. You can also control sort order per column:

# This requires using nlargest with keep parameter
# For mixed ascending/descending, use sort_values instead
top_mixed = df.nlargest(3, 'salary')

For truly mixed sort orders (ascending on one column, descending on another), sort_values() remains the better choice.

Handling Duplicates

The keep parameter controls which duplicates to retain when values are tied:

df = pd.DataFrame({
    'name': ['A', 'B', 'C', 'D', 'E', 'F'],
    'score': [95, 87, 95, 82, 95, 90]
})

# Keep first occurrence of duplicates (default)
top_first = df.nlargest(3, 'score', keep='first')
print(top_first)
#   name  score
# 0    A     95
# 2    C     95
# 4    E     95

# Keep last occurrence
top_last = df.nlargest(3, 'score', keep='last')
print(top_last)
#   name  score
# 4    E     95
# 2    C     95
# 0    A     95

# Keep all duplicates (may return more than N rows)
top_all = df.nlargest(3, 'score', keep='all')
print(top_all)
#   name  score
# 0    A     95
# 2    C     95
# 4    E     95

Using keep='all' is particularly useful when you need to ensure no tied values are excluded, even if it means returning more than N results.

NaN Handling

By default, NaN values are excluded from results:

df = pd.DataFrame({
    'item': ['A', 'B', 'C', 'D', 'E'],
    'value': [10, np.nan, 30, 25, np.nan]
})

largest = df.nlargest(3, 'value')
print(largest)
#   item  value
# 2    C   30.0
# 3    D   25.0
# 0    A   10.0

# NaN values are automatically filtered out
# Only non-NaN values are considered

If you need NaN values in results, filter or fill them before using these methods:

# Fill NaN with a specific value first
df_filled = df.fillna(0)
largest_with_filled = df_filled.nlargest(3, 'value')

GroupBy Operations

Combining groupby() with nlargest()/nsmallest() enables per-group top-N queries:

df = pd.DataFrame({
    'category': ['Electronics', 'Electronics', 'Clothing', 'Clothing', 
                 'Electronics', 'Clothing', 'Electronics'],
    'product': ['TV', 'Laptop', 'Shirt', 'Pants', 'Phone', 'Jacket', 'Tablet'],
    'revenue': [15000, 25000, 3000, 4500, 18000, 5500, 12000]
})

# Get top 2 products by revenue in each category
top_per_category = df.groupby('category', group_keys=False).apply(
    lambda x: x.nlargest(2, 'revenue')
)
print(top_per_category)
#       category product  revenue
# 1  Electronics  Laptop    25000
# 4  Electronics   Phone    18000
# 5     Clothing  Jacket     5500
# 3     Clothing   Pants     4500

For cleaner syntax in pandas 1.1+, use the head() pattern after sorting within groups:

# Alternative approach
top_per_category_v2 = (df.sort_values('revenue', ascending=False)
                         .groupby('category', group_keys=False)
                         .head(2))

However, the apply() with nlargest() approach is more explicit about intent and doesn’t require a full sort.

Performance Considerations

For small N relative to dataset size, nlargest()/nsmallest() significantly outperform sorting:

import time

# Create large dataset
large_df = pd.DataFrame({
    'value': np.random.randint(0, 1000000, 1000000)
})

# Method 1: nlargest
start = time.time()
result1 = large_df.nlargest(10, 'value')
time1 = time.time() - start

# Method 2: sort and head
start = time.time()
result2 = large_df.sort_values('value', ascending=False).head(10)
time2 = time.time() - start

print(f"nlargest: {time1:.4f}s")
print(f"sort+head: {time2:.4f}s")
# Typical output:
# nlargest: 0.0234s
# sort+head: 0.1567s

The performance advantage comes from using a heap-based selection algorithm that doesn’t need to sort the entire dataset. For N approaching the dataset size, sorting becomes competitive or faster.

Practical Example: Sales Analysis

Here’s a real-world scenario combining multiple techniques:

sales_df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=100),
    'store': np.random.choice(['Store A', 'Store B', 'Store C'], 100),
    'category': np.random.choice(['Electronics', 'Clothing', 'Food'], 100),
    'revenue': np.random.randint(1000, 50000, 100)
})

# Find top 5 revenue days overall
top_days = sales_df.nlargest(5, 'revenue')[['date', 'store', 'revenue']]

# Find top 3 performing days per store
top_per_store = (sales_df.groupby('store', group_keys=False)
                 .apply(lambda x: x.nlargest(3, 'revenue')))

# Find bottom 2 categories by total revenue
category_totals = sales_df.groupby('category')['revenue'].sum().reset_index()
worst_categories = category_totals.nsmallest(2, 'revenue')

print("Top Revenue Days:\n", top_days)
print("\nTop Days Per Store:\n", top_per_store)
print("\nWorst Performing Categories:\n", worst_categories)

This pattern of finding extremes within groups is common in business analytics, customer segmentation, and performance monitoring applications.