How to GroupBy and Sum in Pandas
Grouping data by categories and calculating sums is one of the most common operations in data analysis. Whether you're calculating total sales by region, summing expenses by department, or...
Key Insights
- The
groupby().sum()pattern follows Pandas’ split-apply-combine paradigm, letting you aggregate data by categories in a single line of code - Named aggregation with
.agg()produces cleaner output with custom column names, making your analysis more readable and production-ready - Always consider whether to use
dropna=Falseand.reset_index()to handle missing values and convert grouped results back to a standard DataFrame
Introduction
Grouping data by categories and calculating sums is one of the most common operations in data analysis. Whether you’re calculating total sales by region, summing expenses by department, or aggregating user activity by date, Pandas’ groupby() function combined with sum() handles these tasks efficiently.
The operation follows what Pandas calls the “split-apply-combine” pattern: split your data into groups based on some criteria, apply a function (in this case, sum) to each group, and combine the results into a new data structure. Understanding this pattern helps you reason about more complex aggregations as your analysis needs grow.
This article covers everything from basic single-column grouping to advanced named aggregation techniques. By the end, you’ll know which approach to use for different scenarios and how to handle common edge cases.
Basic GroupBy and Sum Syntax
The simplest form of groupby-sum involves grouping by one column and summing another. Here’s the core syntax:
import pandas as pd
# Sample sales data
df = pd.DataFrame({
'category': ['Electronics', 'Clothing', 'Electronics', 'Clothing', 'Food'],
'sales': [1200, 450, 800, 320, 150]
})
# Group by category and sum sales
result = df.groupby('category')['sales'].sum()
print(result)
Output:
category
Clothing 770
Electronics 2000
Food 150
Name: sales, dtype: int64
The result is a Pandas Series with the group labels as the index. This is the split-apply-combine pattern in action: Pandas splits the DataFrame into three groups (Clothing, Electronics, Food), applies the sum function to the ‘sales’ column in each group, and combines everything into a single Series.
If you want the result as a DataFrame instead of a Series, wrap the column name in double brackets:
result = df.groupby('category')[['sales']].sum()
This returns a DataFrame with ‘category’ as the index and ‘sales’ as a column. The distinction matters when you’re chaining operations or need consistent data structures in your pipeline.
Grouping by Multiple Columns
Real-world analysis often requires grouping by multiple dimensions. For instance, you might want total revenue broken down by both region and product type:
df = pd.DataFrame({
'region': ['North', 'North', 'South', 'South', 'North', 'South'],
'product': ['Widget', 'Gadget', 'Widget', 'Gadget', 'Widget', 'Widget'],
'revenue': [1000, 1500, 800, 1200, 900, 750]
})
# Group by multiple columns
result = df.groupby(['region', 'product'])['revenue'].sum()
print(result)
Output:
region product
North Gadget 1500
Widget 1900
South Gadget 1200
Widget 1550
Name: revenue, dtype: int64
The result has a MultiIndex, which can be powerful for hierarchical data but sometimes awkward to work with. You can access specific combinations using tuple indexing:
# Get revenue for North region, Widget product
north_widget = result[('North', 'Widget')] # Returns 1900
For a flatter structure that’s easier to manipulate, use reset_index():
flat_result = df.groupby(['region', 'product'])['revenue'].sum().reset_index()
This gives you a regular DataFrame with ‘region’, ‘product’, and ‘revenue’ as columns—often more convenient for downstream processing or exporting to CSV.
Summing Multiple Columns
When you need to sum several numeric columns at once, pass a list of column names:
df = pd.DataFrame({
'department': ['Sales', 'Engineering', 'Sales', 'Engineering', 'Marketing'],
'sales': [50000, 0, 45000, 0, 15000],
'expenses': [12000, 85000, 10000, 92000, 25000],
'profit': [38000, -85000, 35000, -92000, -10000]
})
# Sum multiple columns
result = df.groupby('department')[['sales', 'expenses', 'profit']].sum()
print(result)
Output:
sales expenses profit
department
Engineering 0 177000 -177000
Marketing 15000 25000 -10000
Sales 95000 22000 73000
If you want to sum all numeric columns without listing them explicitly, omit the column selection:
# Sum all numeric columns
result = df.groupby('department').sum(numeric_only=True)
The numeric_only=True parameter ensures Pandas only attempts to sum columns with numeric dtypes, avoiding errors from string or datetime columns. This parameter became required in recent Pandas versions—older code that omits it may throw warnings.
Using Named Aggregation
The .agg() method with named aggregation is the modern, recommended approach for production code. It lets you specify custom names for your output columns:
df = pd.DataFrame({
'store': ['Downtown', 'Mall', 'Downtown', 'Mall', 'Airport'],
'sales': [2500, 3200, 2800, 2900, 4100],
'returns': [150, 200, 100, 180, 90]
})
# Named aggregation
result = df.groupby('store').agg(
total_sales=('sales', 'sum'),
total_returns=('returns', 'sum'),
net_sales=('sales', 'sum') # We'll fix this below
)
print(result)
Output:
total_sales total_returns net_sales
store
Airport 4100 90 4100
Downtown 5300 250 5300
Mall 6100 380 6100
Named aggregation uses the syntax new_column_name=('source_column', 'aggregation_function'). This approach has several advantages: clearer intent, self-documenting code, and output columns that don’t require renaming afterward.
You can also mix different aggregation functions:
result = df.groupby('store').agg(
total_sales=('sales', 'sum'),
avg_sales=('sales', 'mean'),
transaction_count=('sales', 'count')
)
For calculated fields like net sales (sales minus returns), you’ll need to compute them after aggregation:
result = df.groupby('store').agg(
total_sales=('sales', 'sum'),
total_returns=('returns', 'sum')
)
result['net_sales'] = result['total_sales'] - result['total_returns']
Handling Common Issues
Dealing with NaN Values
By default, groupby() excludes rows where the grouping column contains NaN:
df = pd.DataFrame({
'category': ['A', 'B', None, 'A', 'B'],
'value': [10, 20, 30, 40, 50]
})
# Default behavior: NaN group is excluded
result = df.groupby('category')['value'].sum()
print(result)
Output:
category
A 50
B 70
Name: value, dtype: int64
The row with value 30 is ignored. To include NaN as a group, use dropna=False:
result = df.groupby('category', dropna=False)['value'].sum()
print(result)
Output:
category
A 50
B 70
NaN 30
Name: value, dtype: int64
Resetting the Index
Grouped results use the grouping columns as the index. This is often inconvenient for further processing:
# Convert index back to columns
result = df.groupby('category')['value'].sum().reset_index()
Alternatively, use as_index=False during grouping:
result = df.groupby('category', as_index=False)['value'].sum()
Both approaches produce a DataFrame with ‘category’ and ‘value’ as regular columns.
Sorting Results
Results are sorted by the grouping column(s) by default. To sort by the aggregated values instead:
result = df.groupby('category')['value'].sum().sort_values(ascending=False)
For large datasets where sorting isn’t needed, disable it for a performance boost:
result = df.groupby('category', sort=False)['value'].sum()
Conclusion
Pandas offers multiple ways to group and sum data, each suited to different situations:
- Basic
groupby().sum(): Quick exploration and simple aggregations - Multiple column grouping: Hierarchical analysis with
groupby(['col1', 'col2']) - Named aggregation: Production code where clarity and custom column names matter
Start with the simplest approach that meets your needs. For one-off analysis, basic syntax works fine. For code that others will read or that feeds into dashboards and reports, named aggregation pays dividends in maintainability.
These same patterns apply to other aggregation functions—replace sum with mean, count, min, max, or std as needed. You can also combine multiple functions in a single .agg() call to build comprehensive summary tables efficiently.