Python - Garbage Collection and Memory Management

Key Insights

• Python uses reference counting as its primary garbage collection mechanism, supplemented by a generational garbage collector to handle circular references that reference counting alone cannot resolve. • Understanding memory management patterns helps identify memory leaks, particularly in long-running applications where circular references, unclosed resources, or unintentional object retention can accumulate. • The gc module provides direct control over garbage collection behavior, enabling profiling, debugging, and performance optimization in memory-intensive applications.

Reference Counting Fundamentals

Python manages memory primarily through reference counting. Every object maintains a count of references pointing to it. When the count reaches zero, Python immediately deallocates the memory.

import sys

# Create an object and check its reference count
x = [1, 2, 3]
print(sys.getrefcount(x))  # Returns 2 (x + temporary reference in getrefcount)

# Add another reference
y = x
print(sys.getrefcount(x))  # Returns 3

# Remove a reference
del y
print(sys.getrefcount(x))  # Returns 2

Reference counting provides deterministic, immediate cleanup for most objects. When you delete the last reference or it goes out of scope, Python reclaims the memory instantly. This works efficiently for acyclic object graphs.

class Resource:
    def __init__(self, name):
        self.name = name
        print(f"Resource {self.name} created")
    
    def __del__(self):
        print(f"Resource {self.name} destroyed")

def create_resource():
    r = Resource("temp")
    # r is destroyed immediately when function exits
    
create_resource()
print("Function completed")

# Output:
# Resource temp created
# Resource temp destroyed
# Function completed

The Circular Reference Problem

Reference counting fails with circular references. When objects reference each other, their counts never reach zero, creating memory leaks.

import gc

class Node:
    def __init__(self, value):
        self.value = value
        self.next = None
    
    def __del__(self):
        print(f"Node {self.value} deleted")

# Disable automatic garbage collection to demonstrate the issue
gc.disable()

# Create a circular reference
node1 = Node(1)
node2 = Node(2)
node1.next = node2
node2.next = node1

# Delete references
del node1
del node2

print("References deleted, but objects still in memory")
print(f"Garbage objects: {len(gc.garbage)}")

# Manually trigger collection
collected = gc.collect()
print(f"Collected {collected} objects")

gc.enable()

The generational garbage collector solves this by periodically scanning for unreachable object cycles. It organizes objects into three generations based on survival time, focusing collection efforts on younger generations where most objects die quickly.

Controlling Garbage Collection

The gc module provides fine-grained control over collection behavior. Understanding these controls helps optimize performance in specific scenarios.

import gc

# Get current collection thresholds
print(gc.get_threshold())  # Default: (700, 10, 10)

# Adjust thresholds (generation0, generation1, generation2)
gc.set_threshold(1000, 15, 15)

# Disable/enable automatic collection
gc.disable()
# ... perform memory-intensive operations
gc.enable()

# Manually trigger collection
collected = gc.collect()
print(f"Collected {collected} objects")

# Get collection statistics
stats = gc.get_stats()
for i, stat in enumerate(stats):
    print(f"Generation {i}: {stat}")

For performance-critical sections, temporarily disabling garbage collection can reduce overhead:

import gc
import time

def process_large_dataset(data):
    gc_enabled = gc.isenabled()
    gc.disable()
    
    try:
        # Process data without GC interruptions
        result = [item * 2 for item in data]
        return result
    finally:
        if gc_enabled:
            gc.enable()
            gc.collect()  # Clean up after processing

# Benchmark
data = list(range(1000000))

start = time.time()
result1 = process_large_dataset(data)
print(f"With GC control: {time.time() - start:.4f}s")

Detecting Memory Leaks

Memory leaks in Python typically involve circular references, unclosed resources, or unintended object retention in containers. The gc module helps identify these issues.

import gc
import weakref

# Track all instances of a class
class TrackedObject:
    instances = []
    
    def __init__(self, name):
        self.name = name
        # Use weak reference to avoid preventing garbage collection
        TrackedObject.instances.append(weakref.ref(self))
    
    @classmethod
    def get_live_instances(cls):
        # Clean up dead references
        cls.instances = [ref for ref in cls.instances if ref() is not None]
        return [ref() for ref in cls.instances]

# Create and destroy objects
obj1 = TrackedObject("first")
obj2 = TrackedObject("second")
del obj1

print(f"Live instances: {len(TrackedObject.get_live_instances())}")

# Find all objects of a specific type
def find_objects_of_type(obj_type):
    return [obj for obj in gc.get_objects() if isinstance(obj, obj_type)]

# Warning: This is expensive for large applications
tracked_objects = find_objects_of_type(TrackedObject)
print(f"Found {len(tracked_objects)} TrackedObject instances")

Use gc.get_referrers() to trace what’s keeping an object alive:

import gc

class LeakyContainer:
    cache = []
    
    def __init__(self, data):
        self.data = data
        LeakyContainer.cache.append(self)  # Leak: never removed

obj = LeakyContainer("test data")
del obj

# Object still exists due to class-level cache
remaining = [o for o in gc.get_objects() if isinstance(o, LeakyContainer)]
if remaining:
    obj = remaining[0]
    referrers = gc.get_referrers(obj)
    print(f"Object kept alive by: {referrers}")

Memory Profiling with tracemalloc

Python’s tracemalloc module provides detailed memory allocation tracking, essential for identifying memory hotspots.

import tracemalloc
import linecache

def display_top_memory_allocations(snapshot, key_type='lineno', limit=10):
    snapshot = snapshot.filter_traces((
        tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
        tracemalloc.Filter(False, "<unknown>"),
    ))
    top_stats = snapshot.statistics(key_type)
    
    print(f"Top {limit} memory allocations:")
    for index, stat in enumerate(top_stats[:limit], 1):
        frame = stat.traceback[0]
        print(f"#{index}: {frame.filename}:{frame.lineno}: {stat.size / 1024:.1f} KiB")
        line = linecache.getline(frame.filename, frame.lineno).strip()
        if line:
            print(f"    {line}")

# Start tracking
tracemalloc.start()

# Simulate memory allocations
data_structures = []
for i in range(1000):
    data_structures.append([x for x in range(1000)])

# Take a snapshot
snapshot = tracemalloc.take_snapshot()
display_top_memory_allocations(snapshot)

# Compare snapshots to find memory growth
snapshot1 = tracemalloc.take_snapshot()
# ... more operations
more_data = [list(range(10000)) for _ in range(100)]
snapshot2 = tracemalloc.take_snapshot()

top_stats = snapshot2.compare_to(snapshot1, 'lineno')
print("\nMemory allocation differences:")
for stat in top_stats[:5]:
    print(f"{stat.size_diff / 1024:.1f} KiB: {stat}")

tracemalloc.stop()

Context Managers and Resource Management

Proper resource management prevents memory leaks. Context managers ensure cleanup even when exceptions occur.

import weakref
from contextlib import contextmanager

class ResourcePool:
    def __init__(self):
        self._resources = weakref.WeakSet()
    
    def acquire(self):
        resource = Resource()
        self._resources.add(resource)
        return resource
    
    def active_count(self):
        return len(self._resources)

class Resource:
    def __init__(self):
        self.data = bytearray(1024 * 1024)  # 1MB
    
    def close(self):
        self.data = None

@contextmanager
def managed_resource(pool):
    resource = pool.acquire()
    try:
        yield resource
    finally:
        resource.close()

# Usage
pool = ResourcePool()

with managed_resource(pool) as res:
    # Use resource
    pass

print(f"Active resources: {pool.active_count()}")  # Should be 0

Weak References for Cache Implementation

Weak references allow caching without preventing garbage collection, ideal for memory-sensitive applications.

import weakref
import sys

class ExpensiveObject:
    def __init__(self, data):
        self.data = data
        self.computed = self._expensive_computation()
    
    def _expensive_computation(self):
        return sum(self.data)

class SmartCache:
    def __init__(self):
        self._cache = weakref.WeakValueDictionary()
    
    def get_or_create(self, key, data):
        if key in self._cache:
            print(f"Cache hit for {key}")
            return self._cache[key]
        
        print(f"Cache miss for {key}")
        obj = ExpensiveObject(data)
        self._cache[key] = obj
        return obj

# Demonstrate weak reference caching
cache = SmartCache()

obj1 = cache.get_or_create("key1", [1, 2, 3, 4, 5])
obj2 = cache.get_or_create("key1", [1, 2, 3, 4, 5])  # Cache hit

assert obj1 is obj2

del obj1, obj2
# Objects can now be garbage collected

obj3 = cache.get_or_create("key1", [1, 2, 3, 4, 5])  # Cache miss

Understanding Python’s memory management enables building efficient, leak-free applications. Use reference counting for immediate cleanup, leverage the garbage collector for complex object graphs, and employ profiling tools to identify and resolve memory issues in production systems.