How to Create a Sankey Diagram in Plotly

Key Insights

Sankey diagrams excel at visualizing multi-step flows where traditional bar or line charts fall short—use them for conversion funnels, budget allocations, or any process where quantities split and merge across stages
Plotly’s Sankey implementation requires just three arrays (source indices, target indices, and values) but offers extensive customization through node colors, link gradients, and interactive hover templates
The most common pitfall is index misalignment between your node list and link references—always validate that source/target integers map correctly to your node array positions

Understanding Sankey Diagrams and When to Use Them

Sankey diagrams visualize flows between entities, with arrow width proportional to flow magnitude. Unlike traditional flowcharts that show process logic, Sankey diagrams quantify how much of something moves from one state to another.

Use Sankey diagrams when you need to show:

Multi-stage conversions: Website visitors → signups → paid customers
Resource allocation: Budget distribution across departments and projects
Energy or material flows: Manufacturing inputs to outputs with waste streams
Migration patterns: User movement between app sections or geographic regions

Don’t use Sankey diagrams for simple two-variable comparisons (use bar charts), time series data (use line charts), or when you have more than 15-20 nodes (readability suffers). Sankey diagrams shine when you have 3-5 stages with multiple paths at each stage.

Setting Up Your Environment

Install Plotly and ensure you have a recent version that supports the full Sankey feature set:

pip install plotly>=5.0.0

Your basic imports should include:

import plotly.graph_objects as go
import plotly.io as pio
import pandas as pd
import numpy as np

# Check version
print(f"Plotly version: {plotly.__version__}")

# Set default renderer for your environment
# Use 'browser' for scripts, 'notebook' for Jupyter
pio.renderers.default = 'browser'

Basic Sankey Structure: Nodes and Links

Every Sankey diagram has two components: nodes (the boxes/entities) and links (the flows between them). The data structure is deceptively simple but requires careful indexing.

Here’s a minimal example showing data flowing from A to B to C:

import plotly.graph_objects as go

# Define nodes
node_labels = ['A', 'B', 'C']

# Define links using node indices
link_sources = [0, 1]  # A→B, B→C
link_targets = [1, 2]  # A→B, B→C
link_values = [100, 75]  # 100 from A to B, 75 from B to C

fig = go.Figure(data=[go.Sankey(
    node=dict(
        pad=15,
        thickness=20,
        line=dict(color='black', width=0.5),
        label=node_labels
    ),
    link=dict(
        source=link_sources,
        target=link_targets,
        value=link_values
    )
)])

fig.update_layout(title_text="Basic Sankey: A→B→C", font_size=12)
fig.show()

Critical concept: Source and target arrays use zero-based indices referencing the node_labels list. Source [0, 1] means the first link starts from node 0 (A) and the second from node 1 (B).

Building a Real-World Example: Website Traffic Flow

Let’s create a realistic Sankey showing how users navigate a website from landing pages through content to conversions:

import plotly.graph_objects as go

# Define all nodes in our flow
nodes = [
    'Google Search',      # 0
    'Social Media',       # 1
    'Direct Traffic',     # 2
    'Homepage',          # 3
    'Blog',              # 4
    'Pricing Page',      # 5
    'Signup',            # 6
    'Exit'               # 7
]

# Define flows: [source_index, target_index, value]
flows = [
    # Traffic sources to landing pages
    (0, 3, 5000),   # Google → Homepage
    (0, 4, 3000),   # Google → Blog
    (1, 3, 2000),   # Social → Homepage
    (1, 4, 1500),   # Social → Blog
    (2, 3, 1000),   # Direct → Homepage
    
    # Landing pages to conversion points
    (3, 5, 3500),   # Homepage → Pricing
    (3, 7, 4500),   # Homepage → Exit
    (4, 5, 2000),   # Blog → Pricing
    (4, 7, 2500),   # Blog → Exit
    
    # Pricing to final outcome
    (5, 6, 2200),   # Pricing → Signup
    (5, 7, 3300),   # Pricing → Exit
]

# Unpack flows into separate arrays
sources = [f[0] for f in flows]
targets = [f[1] for f in flows]
values = [f[2] for f in flows]

fig = go.Figure(data=[go.Sankey(
    node=dict(
        pad=20,
        thickness=25,
        line=dict(color='white', width=2),
        label=nodes,
    ),
    link=dict(
        source=sources,
        target=targets,
        value=values,
    )
)])

fig.update_layout(
    title_text="Website Traffic Flow Analysis",
    font_size=14,
    height=600,
    width=1200
)

fig.show()

This diagram immediately reveals that Homepage has high traffic but poor conversion, while Blog traffic converts better to the Pricing page.

Customization and Styling

Raw Sankey diagrams work but lack visual hierarchy. Add colors, custom hover information, and styling to make insights pop:

import plotly.graph_objects as go

nodes = ['Google Search', 'Social Media', 'Direct Traffic', 
         'Homepage', 'Blog', 'Pricing Page', 'Signup', 'Exit']

sources = [0, 0, 1, 1, 2, 3, 3, 4, 4, 5, 5]
targets = [3, 4, 3, 4, 3, 5, 7, 5, 7, 6, 7]
values = [5000, 3000, 2000, 1500, 1000, 3500, 4500, 2000, 2500, 2200, 3300]

# Define node colors by category
node_colors = [
    '#2E86AB',  # Google - blue
    '#A23B72',  # Social - purple
    '#F18F01',  # Direct - orange
    '#C73E1D',  # Homepage - red
    '#C73E1D',  # Blog - red
    '#6A994E',  # Pricing - green
    '#06A77D',  # Signup - teal (success)
    '#D62828',  # Exit - dark red (loss)
]

# Color links based on whether they lead to conversion or exit
link_colors = []
for src, tgt in zip(sources, targets):
    if tgt == 6:  # Signup node
        link_colors.append('rgba(6, 167, 125, 0.4)')
    elif tgt == 7:  # Exit node
        link_colors.append('rgba(214, 40, 40, 0.2)')
    else:
        link_colors.append('rgba(100, 100, 100, 0.3)')

fig = go.Figure(data=[go.Sankey(
    node=dict(
        pad=20,
        thickness=25,
        line=dict(color='white', width=2),
        label=nodes,
        color=node_colors,
        customdata=[f"Node {i}" for i in range(len(nodes))],
        hovertemplate='%{label}<br />Total: %{value:,.0f}<extra></extra>',
    ),
    link=dict(
        source=sources,
        target=target,
        value=values,
        color=link_colors,
        customdata=[f"{values[i]:,.0f} users" for i in range(len(values))],
        hovertemplate='%{source.label} → %{target.label}<br />%{customdata}<extra></extra>',
    )
)])

fig.update_layout(
    title_text="Website Traffic Flow with Color Coding",
    font=dict(size=14, family='Arial'),
    height=700,
    width=1400,
    plot_bgcolor='#F5F5F5'
)

fig.show()

The hovertemplate parameter controls what appears when users hover over nodes or links. Use %{label}, %{value}, and %{customdata} placeholders for dynamic content.

Advanced Techniques

For complex flows, implement conditional coloring based on performance thresholds:

import plotly.graph_objects as go

# Calculate conversion rate for each link
def get_link_color(source_idx, target_idx, value, total_from_source):
    conversion_rate = value / total_from_source
    
    if target_idx == 6:  # Leads to signup
        if conversion_rate > 0.4:
            return 'rgba(6, 167, 125, 0.6)'  # Strong green
        else:
            return 'rgba(6, 167, 125, 0.3)'  # Weak green
    elif target_idx == 7:  # Leads to exit
        if conversion_rate > 0.5:
            return 'rgba(214, 40, 40, 0.6)'  # Strong red (problem!)
        else:
            return 'rgba(214, 40, 40, 0.2)'  # Acceptable loss
    else:
        return 'rgba(100, 100, 100, 0.3)'

# Calculate totals per source for conversion rate
source_totals = {}
for src, val in zip(sources, values):
    source_totals[src] = source_totals.get(src, 0) + val

# Apply conditional coloring
link_colors_advanced = [
    get_link_color(src, tgt, val, source_totals[src])
    for src, tgt, val in zip(sources, targets, values)
]

# Export options
fig.write_html("sankey_diagram.html")  # Interactive HTML
fig.write_image("sankey_diagram.png", width=1400, height=700)  # Static image

For static image export, install kaleido: pip install kaleido

Common Pitfalls and Best Practices

Index misalignment is the #1 error. Always validate:

# Validation check
max_source = max(sources)
max_target = max(targets)
num_nodes = len(nodes)

assert max_source < num_nodes, f"Source index {max_source} exceeds node count"
assert max_target < num_nodes, f"Target index {max_target} exceeds node count"

Node overlap occurs with complex flows. Increase pad parameter or reduce thickness:

node=dict(pad=30, thickness=15)  # More spacing, thinner nodes

Performance degrades above 50 nodes or 200 links. For large datasets, aggregate smaller flows into an “Other” category or create multiple diagrams by subsystem.

Data validation should check for negative values and ensure conservation of flow (what goes in equals what comes out):

# Check flow conservation for each intermediate node
for i in range(len(nodes)):
    inflow = sum(val for src, tgt, val in zip(sources, targets, values) if tgt == i)
    outflow = sum(val for src, tgt, val in zip(sources, targets, values) if src == i)
    if inflow > 0 and outflow > 0:  # Intermediate node
        if abs(inflow - outflow) > 0.01:
            print(f"Warning: Node {nodes[i]} has imbalanced flow (in: {inflow}, out: {outflow})")

Sankey diagrams in Plotly transform complex flow data into intuitive visualizations. Start with clean data structure, validate your indices, and layer on styling to highlight the story in your data. The interactive nature of Plotly’s implementation—with hover details and click-to-highlight—makes these diagrams particularly effective for exploratory analysis and stakeholder presentations.