How to Create a Swarm Plot in Seaborn
Swarm plots display individual data points for categorical data while automatically adjusting their positions to prevent overlap. Unlike strip plots where points can pile on top of each other, or box...
Key Insights
- Swarm plots prevent point overlap by adjusting positions along the categorical axis, making them ideal for small to medium datasets where you need to see every individual observation
- Layer swarm plots over violin or box plots to combine statistical summaries with raw data points, giving viewers both distribution shape and actual values in one visualization
- Swarm plots become impractical above ~500 points per category due to performance issues and visual clutter—use strip plots with jitter for larger datasets instead
Introduction to Swarm Plots
Swarm plots display individual data points for categorical data while automatically adjusting their positions to prevent overlap. Unlike strip plots where points can pile on top of each other, or box plots that hide individual observations, swarm plots arrange each point in a way that resembles a swarm of bees—hence the name.
Use swarm plots when you have categorical data with a moderate number of observations per category and you want to show both the distribution and every individual data point. They’re particularly effective for datasets with 10-500 points per category. Below that range, a simple strip plot works fine. Above it, the computational cost and visual density make them impractical.
The key advantage over box plots or violin plots is transparency. You see every single observation, making outliers, gaps, and clustering patterns immediately obvious. The advantage over strip plots is readability—no overlapping points means no hidden data.
Basic Swarm Plot Setup
Install Seaborn if you haven’t already:
pip install seaborn matplotlib pandas
Here’s the minimal code to create a swarm plot:
import seaborn as sns
import matplotlib.pyplot as plt
# Load sample data
tips = sns.load_dataset('tips')
# Create basic swarm plot
plt.figure(figsize=(8, 6))
sns.swarmplot(data=tips, x='day', y='total_bill')
plt.title('Total Bill by Day of Week')
plt.ylabel('Total Bill ($)')
plt.xlabel('Day')
plt.tight_layout()
plt.show()
This creates a swarm plot showing the distribution of total bills across different days of the week. Each point represents one dining party’s bill, and Seaborn automatically positions points to avoid overlap while keeping them centered over their respective categories.
The basic syntax is straightforward: specify your data source, the categorical variable for the x-axis, and the continuous variable for the y-axis. Seaborn handles the complex positioning algorithm behind the scenes.
Customizing Swarm Plots
Swarm plots offer extensive customization options. You can control colors, point sizes, category order, and orientation:
# Custom swarm plot with styling
plt.figure(figsize=(10, 6))
# Define custom order for days
day_order = ['Thur', 'Fri', 'Sat', 'Sun']
sns.swarmplot(
data=tips,
x='day',
y='total_bill',
order=day_order,
palette='Set2',
size=6,
alpha=0.7
)
plt.title('Total Bill Distribution by Day (Ordered)', fontsize=14, fontweight='bold')
plt.ylabel('Total Bill ($)', fontsize=12)
plt.xlabel('Day of Week', fontsize=12)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
The order parameter controls category sequence—critical for chronological or logical ordering. The palette parameter accepts Seaborn color palettes or custom color lists. Adjusting size changes point diameter, while alpha controls transparency.
For horizontal orientation, simply swap x and y:
plt.figure(figsize=(8, 10))
sns.swarmplot(data=tips, x='total_bill', y='day', order=day_order, palette='muted')
plt.xlabel('Total Bill ($)')
plt.ylabel('Day of Week')
plt.tight_layout()
plt.show()
Horizontal layouts work better when category names are long or when you have many categories.
Adding Hue for Multi-dimensional Analysis
The hue parameter adds a third dimension by coloring points based on another categorical variable:
plt.figure(figsize=(10, 6))
sns.swarmplot(
data=tips,
x='day',
y='total_bill',
hue='time',
order=day_order,
palette='coolwarm',
dodge=True
)
plt.title('Total Bill by Day and Time Period', fontsize=14, fontweight='bold')
plt.ylabel('Total Bill ($)', fontsize=12)
plt.xlabel('Day of Week', fontsize=12)
plt.legend(title='Time', loc='upper right', frameon=True)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
The dodge=True parameter is crucial here—it separates hue categories horizontally within each main category. Without it, points from different hue values would intermingle, making the visualization confusing.
This approach reveals patterns across three dimensions simultaneously. In the tips dataset, you can see how lunch versus dinner bills differ across days of the week, with each individual transaction visible.
Combining Swarm Plots with Other Plot Types
Layering swarm plots over violin or box plots creates powerful hybrid visualizations that show both statistical summaries and individual observations:
plt.figure(figsize=(12, 6))
# Create violin plot first (background layer)
sns.violinplot(
data=tips,
x='day',
y='total_bill',
order=day_order,
palette='muted',
alpha=0.4,
inner=None
)
# Overlay swarm plot
sns.swarmplot(
data=tips,
x='day',
y='total_bill',
order=day_order,
color='black',
size=3,
alpha=0.6
)
plt.title('Bill Distribution: Violin + Swarm Overlay', fontsize=14, fontweight='bold')
plt.ylabel('Total Bill ($)', fontsize=12)
plt.xlabel('Day of Week', fontsize=12)
plt.tight_layout()
plt.show()
The violin plot shows the probability density, while the swarm plot shows actual data points. Setting inner=None on the violin plot removes its internal markers since the swarm plot provides that detail.
You can also combine with box plots:
plt.figure(figsize=(12, 6))
sns.boxplot(
data=tips,
x='day',
y='total_bill',
order=day_order,
palette='Set3',
width=0.5
)
sns.swarmplot(
data=tips,
x='day',
y='total_bill',
order=day_order,
color='black',
size=3,
alpha=0.5
)
plt.title('Bill Distribution: Box + Swarm Overlay', fontsize=14, fontweight='bold')
plt.ylabel('Total Bill ($)', fontsize=12)
plt.xlabel('Day of Week', fontsize=12)
plt.tight_layout()
plt.show()
This combination gives you quartiles, median, and outlier detection from the box plot, plus visibility into every data point from the swarm plot.
Handling Large Datasets and Performance Considerations
Swarm plots use a complex algorithm to prevent overlap, which becomes computationally expensive with large datasets. Seaborn will warn you when a category has too many points:
import numpy as np
import pandas as pd
# Create datasets of different sizes
small_data = pd.DataFrame({
'category': np.repeat(['A', 'B', 'C'], 100),
'value': np.random.randn(300)
})
large_data = pd.DataFrame({
'category': np.repeat(['A', 'B', 'C'], 1000),
'value': np.random.randn(3000)
})
# Small dataset - works well
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.swarmplot(data=small_data, x='category', y='value', ax=axes[0])
axes[0].set_title('Swarm Plot: 100 points per category')
# Large dataset - performance issues and warnings
sns.swarmplot(data=large_data, x='category', y='value', warn_thresh=0.5, ax=axes[1])
axes[1].set_title('Swarm Plot: 1000 points per category (Not Recommended)')
plt.tight_layout()
plt.show()
The warn_thresh parameter controls when Seaborn issues warnings about overlap. A value of 0.5 means warn when more than 50% of points might overlap.
For large datasets, use strip plots with jitter instead:
plt.figure(figsize=(8, 6))
sns.stripplot(data=large_data, x='category', y='value', alpha=0.4, jitter=True)
plt.title('Strip Plot with Jitter: Better for Large Datasets')
plt.tight_layout()
plt.show()
Strip plots with jitter are much faster and handle thousands of points per category without performance degradation.
Real-World Use Case
Let’s analyze student test scores across different teaching methods:
# Create realistic test score data
np.random.seed(42)
teaching_methods = ['Traditional', 'Flipped', 'Hybrid']
n_students = 40
data = []
for method in teaching_methods:
if method == 'Traditional':
scores = np.random.normal(72, 12, n_students)
elif method == 'Flipped':
scores = np.random.normal(78, 10, n_students)
else: # Hybrid
scores = np.random.normal(82, 8, n_students)
for score in scores:
data.append({
'teaching_method': method,
'test_score': np.clip(score, 0, 100),
'student_type': np.random.choice(['Full-time', 'Part-time'])
})
scores_df = pd.DataFrame(data)
# Create publication-ready visualization
plt.figure(figsize=(12, 7))
sns.violinplot(
data=scores_df,
x='teaching_method',
y='test_score',
hue='student_type',
palette='Set2',
alpha=0.3,
inner=None,
split=False
)
sns.swarmplot(
data=scores_df,
x='teaching_method',
y='test_score',
hue='student_type',
palette='Set2',
dodge=True,
size=4,
alpha=0.8
)
plt.title('Student Test Scores by Teaching Method and Enrollment Status',
fontsize=15, fontweight='bold', pad=20)
plt.ylabel('Test Score (0-100)', fontsize=12)
plt.xlabel('Teaching Method', fontsize=12)
plt.axhline(y=70, color='red', linestyle='--', alpha=0.5, label='Passing Threshold')
plt.legend(title='Student Type', loc='lower right', frameon=True)
plt.grid(axis='y', alpha=0.2)
plt.ylim(40, 105)
# Add mean annotations
for i, method in enumerate(teaching_methods):
method_mean = scores_df[scores_df['teaching_method'] == method]['test_score'].mean()
plt.text(i, 102, f'μ={method_mean:.1f}', ha='center', fontsize=10, fontweight='bold')
plt.tight_layout()
plt.show()
This visualization combines violin plots for distribution shape, swarm plots for individual observations, a reference line for the passing threshold, and mean annotations. It clearly shows that hybrid teaching methods produce higher scores with less variance, while also revealing the distribution of individual student performance and differences between full-time and part-time students.
Swarm plots excel in scenarios like this where stakeholders need to see both the big picture and the individual data points. Every dot represents a real student, making the analysis tangible and the conclusions defensible.