How to Create a Scatter Plot in Matplotlib

Key Insights

Scatter plots reveal correlations and patterns between two continuous variables that other chart types miss—use them when you need to see if variables move together or identify outliers in your data
The plt.scatter() function gives you granular control over every visual element (size, color, transparency, markers) while plt.plot() with markers works better for connected data points
Always encode additional dimensions through size or color when you have multivariate data—a well-designed scatter plot can effectively visualize 4-5 dimensions simultaneously

Introduction to Scatter Plots and Matplotlib

Scatter plots are the workhorse visualization for exploring relationships between two continuous variables. Unlike line charts that imply continuity or bar charts that compare categories, scatter plots show you the raw distribution of data points in two-dimensional space. This makes them invaluable for correlation analysis, outlier detection, and pattern recognition.

Matplotlib is Python’s foundational plotting library, and while newer libraries like Plotly and Seaborn offer sleeker defaults, Matplotlib gives you complete control over every visual element. You’ll find it underlying most other Python visualization tools.

Use scatter plots when you need to answer questions like: “Does advertising spend correlate with revenue?” or “Are there distinct clusters in my customer data?” Don’t use them for time series with many points (line charts work better) or when you’re comparing totals across categories (use bar charts instead).

Basic Scatter Plot Creation

The plt.scatter() function requires just two arguments: x-coordinates and y-coordinates as array-like objects. Here’s the minimal implementation:

import matplotlib.pyplot as plt
import numpy as np

# Generate sample data
np.random.seed(42)
x = np.random.randn(50)
y = 2 * x + np.random.randn(50) * 0.5

# Create scatter plot
plt.scatter(x, y)
plt.show()

This creates a basic scatter plot with 50 points. The data shows a positive correlation—as x increases, y tends to increase. The np.random.seed(42) ensures reproducibility.

The difference between plt.scatter() and plt.plot(x, y, 'o') matters: scatter() allows individual customization of each point’s size and color, while plot() treats all markers identically. Use scatter() when you need that granularity.

Customizing Scatter Plot Appearance

Raw scatter plots rarely communicate effectively. You need to adjust visual properties to highlight patterns and make your plot readable.

import matplotlib.pyplot as plt
import numpy as np

np.random.seed(42)
x = np.random.randn(100)
y = 2 * x + np.random.randn(100) * 0.5

# Generate varying sizes and colors
sizes = np.random.randint(20, 200, 100)
colors = np.random.rand(100)

# Create customized scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(x, y, 
           s=sizes,           # Point sizes
           c=colors,          # Point colors
           alpha=0.6,         # Transparency
           cmap='viridis',    # Color map
           edgecolors='black', # Edge color
           linewidth=0.5)     # Edge width

plt.show()

Key parameters:

s: Size in points squared (scalar or array for varying sizes)
c: Color (single color, array of colors, or values to map)
alpha: Transparency from 0 (invisible) to 1 (opaque)
cmap: Color map name when c is numeric
edgecolors: Border color around markers
marker: Shape (‘o’, ’s’, ‘^’, ‘D’, etc.)

Transparency (alpha=0.6) is crucial when points overlap—it reveals density patterns that solid markers hide.

Adding Labels, Titles, and Legends

Professional visualizations require context. Always label your axes and add titles that explain what the viewer should understand.

import matplotlib.pyplot as plt
import numpy as np

np.random.seed(42)

# Create two data series
x1 = np.random.randn(50)
y1 = 2 * x1 + np.random.randn(50) * 0.5

x2 = np.random.randn(50) + 2
y2 = -1.5 * x2 + np.random.randn(50) * 0.5

# Create plot with full labeling
plt.figure(figsize=(10, 6))
plt.scatter(x1, y1, alpha=0.6, s=100, c='blue', 
           edgecolors='black', linewidth=0.5, label='Positive Correlation')
plt.scatter(x2, y2, alpha=0.6, s=100, c='red', 
           edgecolors='black', linewidth=0.5, label='Negative Correlation')

plt.xlabel('Independent Variable (X)', fontsize=12, fontweight='bold')
plt.ylabel('Dependent Variable (Y)', fontsize=12, fontweight='bold')
plt.title('Correlation Analysis: Two Distinct Patterns', 
         fontsize=14, fontweight='bold', pad=20)
plt.legend(loc='upper left', frameon=True, shadow=True)
plt.grid(True, alpha=0.3, linestyle='--')

plt.tight_layout()
plt.show()

The tight_layout() function prevents label cutoff—always use it. The label parameter combined with plt.legend() creates the legend automatically. Grid lines (plt.grid()) help viewers read exact values but keep them subtle with low alpha.

Advanced Techniques

Real-world data often has more than two dimensions. Encode additional variables through color or size to create information-dense visualizations.

import matplotlib.pyplot as plt
import numpy as np

np.random.seed(42)

# Simulate data with three dimensions
n_points = 200
x = np.random.randn(n_points) * 10
y = 2 * x + np.random.randn(n_points) * 15
temperature = 20 + 0.5 * x + np.random.randn(n_points) * 5  # Third dimension

# Create scatter plot with colorbar
plt.figure(figsize=(12, 7))
scatter = plt.scatter(x, y, 
                     c=temperature,      # Color represents temperature
                     s=100, 
                     alpha=0.6,
                     cmap='coolwarm',    # Blue=cold, Red=hot
                     edgecolors='black',
                     linewidth=0.5)

# Add colorbar
cbar = plt.colorbar(scatter)
cbar.set_label('Temperature (°C)', rotation=270, labelpad=20, 
              fontsize=11, fontweight='bold')

plt.xlabel('Variable X', fontsize=12, fontweight='bold')
plt.ylabel('Variable Y', fontsize=12, fontweight='bold')
plt.title('Three-Dimensional Data Visualization', 
         fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

For logarithmic scales (common with exponential data), use plt.xscale('log') or plt.yscale('log') after creating the scatter plot. This is essential for data spanning multiple orders of magnitude.

Real-World Application

Let’s analyze a practical scenario: the relationship between advertising spend and sales revenue across different product categories.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Simulate realistic business data
np.random.seed(42)
n_products = 150

data = pd.DataFrame({
    'ad_spend': np.random.exponential(scale=5000, size=n_products),
    'revenue': np.random.exponential(scale=20000, size=n_products),
    'category': np.random.choice(['Electronics', 'Clothing', 'Home'], n_products)
})

# Add correlation between ad spend and revenue
data['revenue'] = data['revenue'] + 3.5 * data['ad_spend'] + \
                  np.random.randn(n_products) * 5000

# Remove negative values
data = data[(data['ad_spend'] > 0) & (data['revenue'] > 0)]

# Create category-based visualization
fig, ax = plt.subplots(figsize=(12, 7))

categories = data['category'].unique()
colors = {'Electronics': '#FF6B6B', 'Clothing': '#4ECDC4', 'Home': '#45B7D1'}

for category in categories:
    subset = data[data['category'] == category]
    ax.scatter(subset['ad_spend'], subset['revenue'],
              label=category,
              alpha=0.6,
              s=100,
              c=colors[category],
              edgecolors='black',
              linewidth=0.5)

ax.set_xlabel('Advertising Spend ($)', fontsize=12, fontweight='bold')
ax.set_ylabel('Revenue ($)', fontsize=12, fontweight='bold')
ax.set_title('ROI Analysis: Advertising Spend vs. Revenue by Category', 
            fontsize=14, fontweight='bold', pad=20)
ax.legend(title='Product Category', frameon=True, shadow=True)
ax.grid(True, alpha=0.3, linestyle='--')

# Format axis labels as currency
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

plt.tight_layout()
plt.show()

# Calculate correlation
correlation = data['ad_spend'].corr(data['revenue'])
print(f"Correlation coefficient: {correlation:.3f}")

This example demonstrates several best practices: using meaningful colors, formatting axis labels for readability, and calculating correlation coefficients to quantify relationships. The currency formatting makes the plot immediately interpretable for business stakeholders.

Conclusion and Best Practices

Effective scatter plots require thoughtful design decisions. Here are the rules I follow:

Avoid overplotting: When you have thousands of points, reduce alpha to 0.1-0.3 or use hexbin plots instead. Overlapping solid markers hide density patterns.

Choose meaningful colors: Use diverging colormaps (like ‘coolwarm’) for data with a meaningful center point, sequential colormaps (like ‘viridis’) for continuous data, and distinct colors for categories.

Scale appropriately: If your data spans orders of magnitude, use logarithmic scales. If one variable has a much larger range, consider normalizing or using separate y-axes.

Limit point size variation: When encoding data through size, keep the ratio between smallest and largest under 10:1 or points become unreadable.

Add reference lines: Include trend lines, target thresholds, or confidence intervals when they add insight. Use plt.axhline(), plt.axvline(), or plt.plot() for regression lines.

Matplotlib’s scatter plots give you the control needed for publication-quality visualizations. Master these techniques and you’ll effectively communicate complex relationships in your data.