Pandas - Select Top N Rows by Column Value (nlargest)

The `nlargest()` method returns the first N rows ordered by columns in descending order. The syntax is straightforward: specify the number of rows and the column to sort by.

Key Insights

  • The nlargest() method provides the most efficient way to select top N rows by column value, outperforming sort_values() followed by head() for large datasets through heap-based algorithms
  • You can select top N rows across multiple columns simultaneously, with control over handling duplicate values through the keep parameter (‘first’, ’last’, or ‘all’)
  • For grouped operations, combine groupby() with nlargest() to extract top N rows per category, essential for analyzing segmented data like top products per region or highest scores per team

Basic Usage of nlargest()

The nlargest() method returns the first N rows ordered by columns in descending order. The syntax is straightforward: specify the number of rows and the column to sort by.

import pandas as pd
import numpy as np

# Create sample sales data
df = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam', 'Headset'],
    'revenue': [1200, 25, 75, 350, 89, 120],
    'units_sold': [15, 200, 95, 42, 67, 88]
})

# Get top 3 products by revenue
top_3_revenue = df.nlargest(3, 'revenue')
print(top_3_revenue)

Output:

   product  revenue  units_sold
0   Laptop     1200          15
3  Monitor      350          42
5  Headset      120          88

The method preserves the original index by default, which helps trace back to source data. This differs from sorting where you might reset the index.

Performance Comparison with sort_values()

For selecting top N rows, nlargest() uses a heap-based algorithm that’s more efficient than full sorting when N is small relative to dataset size.

import time

# Create large dataset
large_df = pd.DataFrame({
    'id': range(1000000),
    'value': np.random.randint(1, 1000000, 1000000)
})

# Method 1: nlargest()
start = time.time()
result1 = large_df.nlargest(100, 'value')
time_nlargest = time.time() - start

# Method 2: sort_values() + head()
start = time.time()
result2 = large_df.sort_values('value', ascending=False).head(100)
time_sort = time.time() - start

print(f"nlargest(): {time_nlargest:.4f} seconds")
print(f"sort_values() + head(): {time_sort:.4f} seconds")
print(f"Speed improvement: {time_sort/time_nlargest:.2f}x")

Typical output shows nlargest() is 2-5x faster for small N values. The performance gap widens with larger datasets.

Handling Multiple Columns

You can sort by multiple columns by passing a list. The method applies priority from left to right—first sorting by the first column, then using subsequent columns as tiebreakers.

# Sales data with ties
df = pd.DataFrame({
    'salesperson': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'total_sales': [50000, 75000, 75000, 45000, 80000, 75000],
    'deals_closed': [12, 18, 15, 10, 20, 22]
})

# Top 4 by total_sales, then by deals_closed for ties
top_performers = df.nlargest(4, ['total_sales', 'deals_closed'])
print(top_performers)

Output:

  salesperson  total_sales  deals_closed
4         Eve        80000            20
5       Frank        75000            22
1         Bob        75000            18
2     Charlie        75000            15

Frank ranks above Bob and Charlie because all three have 75000 in sales, but Frank has more deals closed.

Managing Duplicates with the keep Parameter

The keep parameter controls which duplicates to retain when multiple rows have identical values in the sorting column.

df = pd.DataFrame({
    'player': ['A', 'B', 'C', 'D', 'E'],
    'score': [100, 95, 100, 88, 100]
})

# Keep first occurrence of duplicates (default)
top_first = df.nlargest(2, 'score', keep='first')
print("Keep first:\n", top_first)

# Keep last occurrence of duplicates
top_last = df.nlargest(2, 'score', keep='last')
print("\nKeep last:\n", top_last)

# Keep all duplicates (may return more than N rows)
top_all = df.nlargest(2, 'score', keep='all')
print("\nKeep all:\n", top_all)

Output:

Keep first:
   player  score
0      A    100
2      C    100

Keep last:
   player  score
4      E    100
2      C    100

Keep all:
   player  score
0      A    100
2      C    100
4      E    100

Using keep='all' returns all rows tied for the Nth position, which can exceed N rows—critical for fair ranking systems.

Top N Rows Per Group

Combining groupby() with nlargest() enables extracting top performers within each category. This pattern is essential for segmented analysis.

# Sales data across regions
df = pd.DataFrame({
    'region': ['North', 'North', 'North', 'South', 'South', 'South', 
               'East', 'East', 'East'],
    'product': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'revenue': [1200, 800, 1500, 900, 1100, 750, 1300, 950, 880]
})

# Top 2 products by revenue in each region
top_per_region = df.groupby('region').apply(
    lambda x: x.nlargest(2, 'revenue')
).reset_index(drop=True)

print(top_per_region)

Output:

  region product  revenue
0  North       C     1500
1  North       A     1200
2  South       B     1100
3  South       A      900
4   East       A     1300
5   East       B      950

The reset_index(drop=True) removes the multi-index created by groupby(), producing a clean output DataFrame.

Alternative Approach with sort_values() for Groups

For more complex grouping scenarios, you might prefer sort_values() with groupby().head():

# Same result using sort_values
top_per_region_alt = (df.sort_values('revenue', ascending=False)
                       .groupby('region')
                       .head(2)
                       .reset_index(drop=True))

print(top_per_region_alt)

This approach is more readable when chaining multiple operations, though slightly less performant for simple top-N extraction.

Practical Example: Customer Segmentation

Here’s a real-world scenario analyzing customer purchase behavior:

# Customer transaction data
transactions = pd.DataFrame({
    'customer_id': [101, 102, 103, 104, 105, 101, 102, 103, 106, 107],
    'purchase_date': pd.date_range('2024-01-01', periods=10, freq='D'),
    'amount': [250, 180, 420, 95, 310, 175, 290, 380, 150, 275],
    'category': ['Electronics', 'Clothing', 'Electronics', 'Clothing', 
                 'Electronics', 'Clothing', 'Electronics', 'Clothing',
                 'Electronics', 'Clothing']
})

# Find top 3 transactions overall
top_transactions = transactions.nlargest(3, 'amount')
print("Top 3 transactions:\n", top_transactions)

# Find top 2 transactions per category
top_per_category = (transactions.groupby('category')
                   .apply(lambda x: x.nlargest(2, 'amount'))
                   .reset_index(drop=True))
print("\nTop 2 per category:\n", top_per_category)

# Find customers with highest single purchase
top_customers = (transactions.groupby('customer_id')['amount']
                .max()
                .nlargest(3))
print("\nTop 3 customers by max purchase:\n", top_customers)

This analysis identifies high-value transactions, category leaders, and top customers—actionable insights for marketing and inventory decisions.

Working with nsmallest()

The inverse operation nsmallest() follows identical syntax but returns the smallest values:

# Find products with lowest revenue
bottom_3 = df.nsmallest(3, 'revenue')
print(bottom_3)

# Useful for identifying underperformers or outliers
inventory = pd.DataFrame({
    'item': ['Widget A', 'Widget B', 'Widget C', 'Widget D'],
    'stock_level': [5, 150, 3, 89]
})

low_stock = inventory.nsmallest(2, 'stock_level')
print("\nLow stock alert:\n", low_stock)

This is particularly valuable for inventory management, quality control, and identifying items needing attention.

Edge Cases and Considerations

When working with nlargest(), watch for these scenarios:

# Empty DataFrame
empty_df = pd.DataFrame()
result = empty_df.nlargest(5, 'column')  # Returns empty DataFrame

# N larger than DataFrame size
small_df = pd.DataFrame({'val': [1, 2, 3]})
result = small_df.nlargest(10, 'val')  # Returns all 3 rows

# NaN handling
df_with_nan = pd.DataFrame({'val': [5, np.nan, 3, 8, np.nan, 1]})
top_3 = df_with_nan.nlargest(3, 'val')  # NaN values excluded
print(top_3)

NaN values are automatically excluded from results, which prevents them from appearing in top-N selections. If you need to include or count NaN values, preprocess with fillna() first.

The nlargest() method is the go-to tool for efficient top-N selection in pandas, offering superior performance and cleaner syntax than alternatives for most use cases.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.