How to GroupBy and Sum in Pandas

Grouping data by categories and calculating sums is one of the most common operations in data analysis. Whether you're calculating total sales by region, summing expenses by department, or...

Key Insights

  • The groupby().sum() pattern follows Pandas’ split-apply-combine paradigm, letting you aggregate data by categories in a single line of code
  • Named aggregation with .agg() produces cleaner output with custom column names, making your analysis more readable and production-ready
  • Always consider whether to use dropna=False and .reset_index() to handle missing values and convert grouped results back to a standard DataFrame

Introduction

Grouping data by categories and calculating sums is one of the most common operations in data analysis. Whether you’re calculating total sales by region, summing expenses by department, or aggregating user activity by date, Pandas’ groupby() function combined with sum() handles these tasks efficiently.

The operation follows what Pandas calls the “split-apply-combine” pattern: split your data into groups based on some criteria, apply a function (in this case, sum) to each group, and combine the results into a new data structure. Understanding this pattern helps you reason about more complex aggregations as your analysis needs grow.

This article covers everything from basic single-column grouping to advanced named aggregation techniques. By the end, you’ll know which approach to use for different scenarios and how to handle common edge cases.

Basic GroupBy and Sum Syntax

The simplest form of groupby-sum involves grouping by one column and summing another. Here’s the core syntax:

import pandas as pd

# Sample sales data
df = pd.DataFrame({
    'category': ['Electronics', 'Clothing', 'Electronics', 'Clothing', 'Food'],
    'sales': [1200, 450, 800, 320, 150]
})

# Group by category and sum sales
result = df.groupby('category')['sales'].sum()
print(result)

Output:

category
Clothing       770
Electronics   2000
Food           150
Name: sales, dtype: int64

The result is a Pandas Series with the group labels as the index. This is the split-apply-combine pattern in action: Pandas splits the DataFrame into three groups (Clothing, Electronics, Food), applies the sum function to the ‘sales’ column in each group, and combines everything into a single Series.

If you want the result as a DataFrame instead of a Series, wrap the column name in double brackets:

result = df.groupby('category')[['sales']].sum()

This returns a DataFrame with ‘category’ as the index and ‘sales’ as a column. The distinction matters when you’re chaining operations or need consistent data structures in your pipeline.

Grouping by Multiple Columns

Real-world analysis often requires grouping by multiple dimensions. For instance, you might want total revenue broken down by both region and product type:

df = pd.DataFrame({
    'region': ['North', 'North', 'South', 'South', 'North', 'South'],
    'product': ['Widget', 'Gadget', 'Widget', 'Gadget', 'Widget', 'Widget'],
    'revenue': [1000, 1500, 800, 1200, 900, 750]
})

# Group by multiple columns
result = df.groupby(['region', 'product'])['revenue'].sum()
print(result)

Output:

region  product
North   Gadget     1500
        Widget     1900
South   Gadget     1200
        Widget     1550
Name: revenue, dtype: int64

The result has a MultiIndex, which can be powerful for hierarchical data but sometimes awkward to work with. You can access specific combinations using tuple indexing:

# Get revenue for North region, Widget product
north_widget = result[('North', 'Widget')]  # Returns 1900

For a flatter structure that’s easier to manipulate, use reset_index():

flat_result = df.groupby(['region', 'product'])['revenue'].sum().reset_index()

This gives you a regular DataFrame with ‘region’, ‘product’, and ‘revenue’ as columns—often more convenient for downstream processing or exporting to CSV.

Summing Multiple Columns

When you need to sum several numeric columns at once, pass a list of column names:

df = pd.DataFrame({
    'department': ['Sales', 'Engineering', 'Sales', 'Engineering', 'Marketing'],
    'sales': [50000, 0, 45000, 0, 15000],
    'expenses': [12000, 85000, 10000, 92000, 25000],
    'profit': [38000, -85000, 35000, -92000, -10000]
})

# Sum multiple columns
result = df.groupby('department')[['sales', 'expenses', 'profit']].sum()
print(result)

Output:

             sales  expenses  profit
department                          
Engineering      0    177000 -177000
Marketing    15000     25000  -10000
Sales        95000     22000   73000

If you want to sum all numeric columns without listing them explicitly, omit the column selection:

# Sum all numeric columns
result = df.groupby('department').sum(numeric_only=True)

The numeric_only=True parameter ensures Pandas only attempts to sum columns with numeric dtypes, avoiding errors from string or datetime columns. This parameter became required in recent Pandas versions—older code that omits it may throw warnings.

Using Named Aggregation

The .agg() method with named aggregation is the modern, recommended approach for production code. It lets you specify custom names for your output columns:

df = pd.DataFrame({
    'store': ['Downtown', 'Mall', 'Downtown', 'Mall', 'Airport'],
    'sales': [2500, 3200, 2800, 2900, 4100],
    'returns': [150, 200, 100, 180, 90]
})

# Named aggregation
result = df.groupby('store').agg(
    total_sales=('sales', 'sum'),
    total_returns=('returns', 'sum'),
    net_sales=('sales', 'sum')  # We'll fix this below
)
print(result)

Output:

          total_sales  total_returns  net_sales
store                                          
Airport          4100             90       4100
Downtown         5300            250       5300
Mall             6100            380       6100

Named aggregation uses the syntax new_column_name=('source_column', 'aggregation_function'). This approach has several advantages: clearer intent, self-documenting code, and output columns that don’t require renaming afterward.

You can also mix different aggregation functions:

result = df.groupby('store').agg(
    total_sales=('sales', 'sum'),
    avg_sales=('sales', 'mean'),
    transaction_count=('sales', 'count')
)

For calculated fields like net sales (sales minus returns), you’ll need to compute them after aggregation:

result = df.groupby('store').agg(
    total_sales=('sales', 'sum'),
    total_returns=('returns', 'sum')
)
result['net_sales'] = result['total_sales'] - result['total_returns']

Handling Common Issues

Dealing with NaN Values

By default, groupby() excludes rows where the grouping column contains NaN:

df = pd.DataFrame({
    'category': ['A', 'B', None, 'A', 'B'],
    'value': [10, 20, 30, 40, 50]
})

# Default behavior: NaN group is excluded
result = df.groupby('category')['value'].sum()
print(result)

Output:

category
A    50
B    70
Name: value, dtype: int64

The row with value 30 is ignored. To include NaN as a group, use dropna=False:

result = df.groupby('category', dropna=False)['value'].sum()
print(result)

Output:

category
A      50
B      70
NaN    30
Name: value, dtype: int64

Resetting the Index

Grouped results use the grouping columns as the index. This is often inconvenient for further processing:

# Convert index back to columns
result = df.groupby('category')['value'].sum().reset_index()

Alternatively, use as_index=False during grouping:

result = df.groupby('category', as_index=False)['value'].sum()

Both approaches produce a DataFrame with ‘category’ and ‘value’ as regular columns.

Sorting Results

Results are sorted by the grouping column(s) by default. To sort by the aggregated values instead:

result = df.groupby('category')['value'].sum().sort_values(ascending=False)

For large datasets where sorting isn’t needed, disable it for a performance boost:

result = df.groupby('category', sort=False)['value'].sum()

Conclusion

Pandas offers multiple ways to group and sum data, each suited to different situations:

  • Basic groupby().sum(): Quick exploration and simple aggregations
  • Multiple column grouping: Hierarchical analysis with groupby(['col1', 'col2'])
  • Named aggregation: Production code where clarity and custom column names matter

Start with the simplest approach that meets your needs. For one-off analysis, basic syntax works fine. For code that others will read or that feeds into dashboards and reports, named aggregation pays dividends in maintainability.

These same patterns apply to other aggregation functions—replace sum with mean, count, min, max, or std as needed. You can also combine multiple functions in a single .agg() call to build comprehensive summary tables efficiently.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.