Tarjan's Algorithm: Strongly Connected Components

Key Insights

Tarjan’s algorithm finds all strongly connected components in a single DFS pass using discovery times and low-link values, achieving O(V + E) time complexity
The low-link value tracks the smallest discovery time reachable from a node’s subtree, and when it equals the node’s discovery time, that node is an SCC root
Unlike Kosaraju’s two-pass approach, Tarjan’s algorithm uses a stack to track nodes in the current DFS path, making it more cache-friendly and often faster in practice

Introduction to Strongly Connected Components

A strongly connected component (SCC) is a maximal subgraph where every vertex can reach every other vertex through directed edges. “Maximal” means you can’t add another vertex without breaking this property.

SCCs appear everywhere in software systems. Package managers use them to detect circular dependencies—if packages A, B, and C form an SCC, you’ve got a dependency cycle that needs breaking. Compilers identify SCCs in control flow graphs to optimize loops. Social networks cluster users into tightly-knit communities. Circuit designers analyze feedback loops in digital systems.

For small graphs, naive approaches work fine. But production systems deal with dependency graphs containing millions of nodes. You need an algorithm that scales linearly with graph size. Tarjan’s algorithm delivers exactly that: O(V + E) time complexity in a single depth-first traversal.

Prerequisites: Graph Representation and DFS

Before diving into Tarjan’s algorithm, let’s establish our foundation. We’ll represent directed graphs using adjacency lists—the standard choice for sparse graphs.

from collections import defaultdict
from typing import List, Set

class DirectedGraph:
    def __init__(self):
        self.adj: dict[int, List[int]] = defaultdict(list)
        self.vertices: Set[int] = set()
    
    def add_edge(self, u: int, v: int) -> None:
        self.adj[u].append(v)
        self.vertices.add(u)
        self.vertices.add(v)
    
    def dfs_iterative(self, start: int) -> List[int]:
        """Basic iterative DFS for reference."""
        visited = set()
        result = []
        stack = [start]
        
        while stack:
            node = stack.pop()
            if node in visited:
                continue
            visited.add(node)
            result.append(node)
            
            # Add neighbors in reverse for consistent ordering
            for neighbor in reversed(self.adj[node]):
                if neighbor not in visited:
                    stack.append(neighbor)
        
        return result

DFS assigns each node two timestamps: discovery time (when first visited) and finish time (when all descendants are processed). Tarjan’s algorithm exploits the discovery time to identify SCC boundaries.

Core Concepts: Discovery Time and Low-Link Values

Tarjan’s algorithm tracks two values for each node:

Discovery time (disc): A counter incremented each time we visit a new node. If we visit nodes in order A → B → C, they get discovery times 0, 1, 2.

Low-link value (low): The smallest discovery time reachable from this node through its DFS subtree, including back edges to ancestors still on the stack.

The key insight: when a node’s low-link value equals its discovery time, that node is the root of an SCC. It can’t reach any node discovered earlier, meaning it and its descendants form a closed component.

class TarjanState:
    """Tracks algorithm state during traversal."""
    
    def __init__(self):
        self.disc: dict[int, int] = {}      # Discovery times
        self.low: dict[int, int] = {}       # Low-link values
        self.on_stack: dict[int, bool] = {} # Currently on DFS stack
        self.stack: List[int] = []          # The DFS stack
        self.time: int = 0                  # Global timestamp counter
        self.sccs: List[List[int]] = []     # Discovered SCCs
    
    def initialize_node(self, node: int) -> None:
        """Set initial values when discovering a node."""
        self.disc[node] = self.time
        self.low[node] = self.time
        self.time += 1
        self.stack.append(node)
        self.on_stack[node] = True
    
    def update_low(self, node: int, neighbor: int) -> None:
        """Update low-link after processing a neighbor."""
        if self.on_stack.get(neighbor, False):
            # Back edge to ancestor on stack
            self.low[node] = min(self.low[node], self.disc[neighbor])
        elif neighbor in self.low:
            # Tree edge to already-processed descendant
            self.low[node] = min(self.low[node], self.low[neighbor])

The on_stack check is crucial. We only consider back edges to nodes currently in our DFS path—not to nodes in already-completed SCCs.

Tarjan’s Algorithm: Step-by-Step Walkthrough

Here’s the complete implementation. I’m using an iterative approach to avoid stack overflow on deep graphs:

from typing import List, Tuple
from enum import Enum, auto

class NodeState(Enum):
    UNVISITED = auto()
    PROCESSING = auto()
    DONE = auto()

def tarjan_scc(graph: DirectedGraph) -> List[List[int]]:
    """
    Find all strongly connected components using Tarjan's algorithm.
    Returns list of SCCs, each SCC is a list of node IDs.
    """
    disc: dict[int, int] = {}
    low: dict[int, int] = {}
    on_stack: dict[int, bool] = defaultdict(bool)
    stack: List[int] = []
    time_counter: List[int] = [0]  # Mutable container for closure
    sccs: List[List[int]] = []
    state: dict[int, NodeState] = defaultdict(lambda: NodeState.UNVISITED)
    
    def strongconnect(start: int) -> None:
        # Iterative DFS using explicit call stack
        # Each frame: (node, neighbor_index, phase)
        # phase 0: initial visit, phase 1: processing neighbors
        call_stack: List[Tuple[int, int]] = [(start, 0)]
        
        while call_stack:
            node, neighbor_idx = call_stack.pop()
            neighbors = graph.adj[node]
            
            if neighbor_idx == 0 and state[node] == NodeState.UNVISITED:
                # First visit: initialize node
                disc[node] = time_counter[0]
                low[node] = time_counter[0]
                time_counter[0] += 1
                stack.append(node)
                on_stack[node] = True
                state[node] = NodeState.PROCESSING
            
            # Process neighbors starting from neighbor_idx
            found_unvisited = False
            for i in range(neighbor_idx, len(neighbors)):
                neighbor = neighbors[i]
                
                if state[neighbor] == NodeState.UNVISITED:
                    # Push current node back with next index, then recurse
                    call_stack.append((node, i + 1))
                    call_stack.append((neighbor, 0))
                    found_unvisited = True
                    break
                elif on_stack[neighbor]:
                    # Back edge to node in current SCC
                    low[node] = min(low[node], disc[neighbor])
            
            if found_unvisited:
                continue
            
            # All neighbors processed - update parent's low-link
            if call_stack:
                parent, _ = call_stack[-1]
                low[parent] = min(low[parent], low[node])
            
            # Check if node is SCC root
            if low[node] == disc[node]:
                scc = []
                while True:
                    w = stack.pop()
                    on_stack[w] = False
                    state[w] = NodeState.DONE
                    scc.append(w)
                    if w == node:
                        break
                sccs.append(scc)
    
    # Run from each unvisited vertex
    for v in graph.vertices:
        if state[v] == NodeState.UNVISITED:
            strongconnect(v)
    
    return sccs

The algorithm maintains a stack of nodes in the current DFS path. When we find an SCC root (low-link equals discovery time), we pop nodes from the stack until we reach the root—all those nodes belong to the same SCC.

Visual Example: Tracing the Algorithm

Consider this graph with 6 nodes:

0 → 1 → 2 → 0    (cycle: 0,1,2)
    ↓
    3 → 4 → 5 → 3  (cycle: 3,4,5)

Let’s trace the algorithm:

Step	Action	Stack	disc	low	Notes
1	Visit 0	[0]	{0:0}	{0:0}	Start DFS
2	Visit 1	[0,1]	{0:0,1:1}	{0:0,1:1}	Follow edge 0→1
3	Visit 2	[0,1,2]	{0:0,1:1,2:2}	{0:0,1:1,2:2}	Follow edge 1→2
4	Back edge 2→0	[0,1,2]	same	{0:0,1:1,2:0}	low[2] = min(2, disc[0]) = 0
5	Return to 1	[0,1,2]	same	{0:0,1:0,2:0}	low[1] = min(1, low[2]) = 0
6	Visit 3	[0,1,2,3]	+{3:3}	+{3:3}	Follow edge 1→3
7	Visit 4	[0,1,2,3,4]	+{4:4}	+{4:4}	Continue DFS
8	Visit 5	[0,1,2,3,4,5]	+{5:5}	+{5:5}	Continue DFS
9	Back edge 5→3	same	same	{5:3}	low[5] = min(5, disc[3]) = 3
10	Return, pop SCC	[0,1,2]	same	{4:3,3:3}	low[3]==disc[3], SCC: [5,4,3]
11	Return to 1	[0,1,2]	same	same	low[1] unchanged (3 not on stack)
12	Return to 0	[0,1,2]	same	{0:0,1:0}	low[0] = min(0, low[1]) = 0
13	Pop SCC	[]	same	same	low[0]==disc[0], SCC: [2,1,0]

Final SCCs: [[5,4,3], [2,1,0]]

Complexity Analysis and Comparison with Kosaraju’s

Tarjan’s Algorithm:

Time: O(V + E) — single DFS pass
Space: O(V) — stack, disc, low arrays

Kosaraju’s Algorithm:

Time: O(V + E) — but requires two DFS passes
Space: O(V + E) — needs to store the transposed graph

Both achieve linear time, but Tarjan’s has practical advantages:

Single pass: Better cache locality, fewer memory accesses
No graph transposition: Kosaraju’s must reverse all edges, which costs O(E) time and space
Online processing: Tarjan’s can output SCCs as they’re found

Choose Kosaraju’s when you already have the transposed graph or when code simplicity matters more than performance. Choose Tarjan’s for production systems where every microsecond counts.

Practical Applications

Here’s a real-world example: detecting circular dependencies in a module system.

def find_circular_dependencies(modules: dict[str, List[str]]) -> List[List[str]]:
    """
    Given a dict of module -> [dependencies], find circular dependency groups.
    
    Example:
        modules = {
            "auth": ["database", "utils"],
            "database": ["config"],
            "api": ["auth", "database"],
            "config": ["utils"],
            "utils": ["config"],  # Circular: utils -> config -> utils
        }
    """
    # Build graph with string-to-int mapping
    name_to_id: dict[str, int] = {}
    id_to_name: dict[int, str] = {}
    
    for i, name in enumerate(modules.keys()):
        name_to_id[name] = i
        id_to_name[i] = name
    
    graph = DirectedGraph()
    for module, deps in modules.items():
        for dep in deps:
            if dep in name_to_id:  # Only internal dependencies
                graph.add_edge(name_to_id[module], name_to_id[dep])
    
    sccs = tarjan_scc(graph)
    
    # Filter to SCCs with more than one node (actual cycles)
    cycles = [
        [id_to_name[node] for node in scc]
        for scc in sccs
        if len(scc) > 1
    ]
    
    return cycles

# Usage
modules = {
    "auth": ["database", "utils"],
    "database": ["config"],
    "api": ["auth", "database"],
    "config": ["utils"],
    "utils": ["config"],
}

cycles = find_circular_dependencies(modules)
print(f"Circular dependencies found: {cycles}")
# Output: Circular dependencies found: [['config', 'utils']]

Beyond dependency analysis, Tarjan’s algorithm powers:

2-SAT solvers: Each SCC in the implication graph represents equivalent literals
Compiler optimization: Identifying loops for strength reduction and invariant code motion
Deadlock detection: Cycles in resource allocation graphs indicate potential deadlocks

Tarjan’s algorithm is one of those foundational tools every systems programmer should internalize. It’s elegant, efficient, and solves problems you’ll encounter repeatedly throughout your career.