How to Create a Scatter Plot in Seaborn

Scatter plots are fundamental for understanding relationships between continuous variables. Seaborn elevates scatter plot creation beyond matplotlib's basic functionality by providing intelligent...

Key Insights

  • Seaborn’s scatterplot() function provides a high-level interface for creating scatter plots with automatic handling of categorical variables, legends, and color palettes—making it superior to matplotlib’s basic scatter for exploratory data analysis.
  • Adding semantic dimensions through hue, size, and style parameters transforms basic scatter plots into multi-dimensional visualizations that can reveal patterns across 4-5 variables simultaneously without cluttering the plot.
  • For production-quality visualizations, combine sns.scatterplot() with custom color palettes, appropriate alpha transparency (0.6-0.7 for overlapping points), and sns.set_theme() to match your organization’s style guidelines.

Introduction & Setup

Scatter plots are fundamental for understanding relationships between continuous variables. Seaborn elevates scatter plot creation beyond matplotlib’s basic functionality by providing intelligent defaults, automatic legend generation, and seamless integration with pandas DataFrames. Use scatter plots when you need to identify correlations, detect outliers, or visualize clustering patterns in your data.

Seaborn excels at exploratory data analysis because it handles the tedious aspects of visualization—color mapping, legend creation, and aesthetic styling—allowing you to focus on understanding your data rather than fighting with plot configuration.

Install Seaborn and related dependencies:

pip install seaborn pandas matplotlib numpy

Import the necessary libraries:

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

Basic Scatter Plot

The sns.scatterplot() function requires at minimum an x-axis variable, y-axis variable, and a data source. Seaborn’s built-in datasets provide excellent starting points for learning.

# Load the tips dataset
tips = sns.load_dataset('tips')

# Create a basic scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(data=tips, x='total_bill', y='tip')
plt.title('Restaurant Tips vs Total Bill')
plt.xlabel('Total Bill ($)')
plt.ylabel('Tip ($)')
plt.tight_layout()
plt.show()

This creates a straightforward scatter plot showing the relationship between bill amount and tip. The positive correlation is immediately visible—larger bills generally receive larger tips. Notice how Seaborn automatically handles axis labels from the DataFrame column names, though you should override these with more descriptive labels for clarity.

The data parameter accepts any pandas DataFrame, making it trivial to visualize your own datasets. This DataFrame-first approach eliminates the need to extract individual arrays like you would with matplotlib’s plt.scatter().

Customizing with Hue, Size, and Style

Scatter plots become powerful analytical tools when you encode additional variables through visual properties. Seaborn’s semantic mapping parameters—hue, size, and style—let you represent multiple dimensions simultaneously.

# Create a multi-dimensional scatter plot
plt.figure(figsize=(12, 7))
sns.scatterplot(
    data=tips,
    x='total_bill',
    y='tip',
    hue='day',           # Color by day of week
    size='size',         # Point size by party size
    style='time',        # Marker shape by lunch/dinner
    sizes=(50, 400),     # Control size range
    alpha=0.7            # Transparency for overlapping points
)
plt.title('Restaurant Tips Analysis: Multi-dimensional View')
plt.xlabel('Total Bill ($)')
plt.ylabel('Tip ($)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

This single plot now conveys five dimensions of information: total bill (x-axis), tip amount (y-axis), day of week (color), party size (point size), and meal time (marker shape). The sizes parameter controls the range of point sizes—adjust these values based on your data’s distribution to ensure smaller points remain visible while larger points don’t dominate.

The alpha parameter (0-1 range) controls transparency. Values between 0.6 and 0.7 work well for datasets with moderate overlap, allowing you to see density patterns where points cluster.

Styling and Aesthetics

Production-quality visualizations require thoughtful styling. Seaborn provides themes, color palettes, and fine-grained control over visual elements.

# Set the overall theme
sns.set_theme(style='whitegrid', context='notebook')

# Create a styled scatter plot with custom palette
plt.figure(figsize=(12, 7))
sns.scatterplot(
    data=tips,
    x='total_bill',
    y='tip',
    hue='smoker',
    palette={'Yes': '#e74c3c', 'No': '#3498db'},  # Custom colors
    s=100,                # Fixed marker size
    alpha=0.6,
    edgecolor='white',    # White border around points
    linewidth=0.5
)

plt.title('Tips by Smoking Section', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Total Bill ($)', fontsize=12, fontweight='bold')
plt.ylabel('Tip ($)', fontsize=12, fontweight='bold')
plt.legend(title='Smoker', title_fontsize=11, fontsize=10)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Seaborn’s built-in themes include darkgrid, whitegrid, dark, white, and ticks. The whitegrid style works well for presentations, while white or ticks suit publications. The context parameter (paper, notebook, talk, poster) automatically scales elements for different viewing contexts.

Custom color palettes let you match corporate branding or improve accessibility. Use colorblind-friendly palettes like 'colorblind' or validate your choices with tools like Color Oracle.

The edgecolor parameter adds definition to points, preventing them from bleeding together visually. This subtle touch significantly improves readability, especially when points overlap.

Advanced Features

Seaborn provides specialized functions for common scatter plot enhancements. The regplot() function adds regression lines, while relplot() creates faceted plots for comparing across categories.

# Scatter plot with regression line
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left plot: regression with confidence interval
sns.regplot(
    data=tips,
    x='total_bill',
    y='tip',
    ax=axes[0],
    scatter_kws={'alpha': 0.5, 's': 50},
    line_kws={'color': 'red', 'linewidth': 2}
)
axes[0].set_title('Tips vs Bill with Regression Line')
axes[0].set_xlabel('Total Bill ($)')
axes[0].set_ylabel('Tip ($)')

# Right plot: separate regressions by category
sns.lmplot(
    data=tips,
    x='total_bill',
    y='tip',
    hue='smoker',
    height=5,
    aspect=1.2,
    scatter_kws={'alpha': 0.5},
    legend=True
)
plt.tight_layout()
plt.show()

The regplot() function fits a linear regression and displays a 95% confidence interval by default. Adjust ci=None to remove the interval or change the confidence level with ci=99.

For faceted plots across multiple categories:

# Create faceted scatter plots
g = sns.relplot(
    data=tips,
    x='total_bill',
    y='tip',
    hue='smoker',
    col='time',          # Separate plot for lunch/dinner
    row='sex',           # Separate rows for male/female
    height=4,
    aspect=1.2,
    alpha=0.6
)
g.set_axis_labels('Total Bill ($)', 'Tip ($)')
g.set_titles('{row_name} - {col_name}')
plt.tight_layout()
plt.show()

The relplot() function creates a FacetGrid, automatically handling subplot layout and shared axes. This approach reveals patterns that might be obscured in a single crowded plot.

Real-World Example

Let’s analyze a realistic scenario: examining the relationship between house size and price, considering location and age.

# Create synthetic housing data
np.random.seed(42)
n_samples = 500

housing_data = pd.DataFrame({
    'square_feet': np.random.normal(2000, 500, n_samples),
    'price': np.random.normal(400000, 100000, n_samples),
    'bedrooms': np.random.choice([2, 3, 4, 5], n_samples),
    'neighborhood': np.random.choice(['Downtown', 'Suburbs', 'Rural'], n_samples),
    'age_years': np.random.uniform(0, 50, n_samples)
})

# Add correlation between size and price
housing_data['price'] = (
    housing_data['square_feet'] * 150 + 
    np.random.normal(0, 50000, n_samples) +
    100000
)

# Create comprehensive scatter plot
sns.set_theme(style='whitegrid', context='talk')
plt.figure(figsize=(14, 8))

scatter = sns.scatterplot(
    data=housing_data,
    x='square_feet',
    y='price',
    hue='neighborhood',
    size='bedrooms',
    palette='viridis',
    sizes=(100, 500),
    alpha=0.6,
    edgecolor='white',
    linewidth=0.5
)

plt.title('Housing Prices by Size and Location', 
          fontsize=18, fontweight='bold', pad=20)
plt.xlabel('Square Feet', fontsize=14, fontweight='bold')
plt.ylabel('Price ($)', fontsize=14, fontweight='bold')

# Format y-axis as currency
ax = plt.gca()
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

plt.legend(title='Details', bbox_to_anchor=(1.05, 1), 
           loc='upper left', frameon=True, shadow=True)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate and display correlation
correlation = housing_data['square_feet'].corr(housing_data['price'])
print(f'Correlation between size and price: {correlation:.3f}')

This example demonstrates production-ready visualization: custom formatting for currency values, thoughtful color choices, appropriate transparency, and clear labeling. The plot immediately reveals the strong positive correlation between house size and price, while the color coding shows neighborhood patterns and point sizes indicate bedroom count.

When working with real data, always validate your visualizations by checking for outliers, verifying correlations numerically, and ensuring your visual encodings accurately represent the underlying data relationships. Seaborn’s scatter plots provide the foundation—your domain knowledge and attention to detail create insights.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.