Python Hypothesis: Property-Based Testing

Key Insights

Property-based testing with Hypothesis generates hundreds of random inputs automatically, finding edge cases that hand-written examples miss—including empty lists, Unicode strings, and boundary values you’d never think to test.
The real skill isn’t learning the API; it’s learning to think in properties: invariants that hold true regardless of input, round-trip operations, and relationships between functions.
Hypothesis’s shrinking algorithm is its killer feature—when a test fails, it automatically reduces the failing input to the minimal reproducible case, turning a 500-element list failure into a 2-element one.

The Problem with Example-Based Testing

Every developer writes tests like this:

def test_sort_numbers():
    assert sort([3, 1, 2]) == [1, 2, 3]
    assert sort([]) == []
    assert sort([1]) == [1]

You pick a few representative inputs, assert the expected outputs, and move on. This approach has a fundamental flaw: you’re testing the cases you thought of. The bugs that ship to production are the ones you didn’t think of.

What about negative numbers? Duplicates? Very large lists? Lists with None values? Unicode strings if the sort is generic? You could add more examples, but you’re playing whack-a-mole with edge cases.

Property-based testing flips the approach. Instead of specifying inputs and outputs, you specify properties that should hold true for any valid input. The testing framework generates hundreds of random inputs and verifies your properties hold for all of them.

Getting Started with Hypothesis

Install Hypothesis alongside pytest:

pip install hypothesis pytest

Here’s your first property-based test:

from hypothesis import given
from hypothesis import strategies as st

def reverse(lst):
    return lst[::-1]

@given(st.lists(st.integers()))
def test_reverse_twice_returns_original(lst):
    assert reverse(reverse(lst)) == lst

The @given decorator tells Hypothesis to generate test data. st.lists(st.integers()) is a strategy that produces lists of integers. Hypothesis will run this test with hundreds of different lists: empty lists, single-element lists, lists with negative numbers, lists with duplicates, very long lists.

Run it with pytest as usual:

pytest test_example.py -v

The anatomy is straightforward: @given takes one or more strategies, and your test function receives the generated values as arguments. The test body contains your assertions about properties that should always hold.

Understanding Strategies

Strategies are recipes for generating data. Hypothesis provides built-in strategies for all common types:

from hypothesis import strategies as st

# Primitives
st.integers()                    # Any integer
st.integers(min_value=0, max_value=100)  # Bounded
st.floats(allow_nan=False)       # Floats without NaN
st.text()                        # Unicode strings
st.text(min_size=1, max_size=50) # Bounded length
st.booleans()                    # True or False

# Collections
st.lists(st.integers())          # List of integers
st.lists(st.text(), min_size=1)  # Non-empty list of strings
st.dictionaries(st.text(), st.integers())  # Dict[str, int]
st.tuples(st.integers(), st.text())  # Tuple[int, str]

Compose strategies for complex data:

from hypothesis import given
from hypothesis import strategies as st

@st.composite
def user_data(draw):
    return {
        "email": draw(st.emails()),
        "username": draw(st.text(min_size=3, max_size=20, alphabet=st.characters(whitelist_categories=('L', 'N')))),
        "age": draw(st.integers(min_value=13, max_value=120)),
    }

def validate_user(data):
    if not data["email"] or "@" not in data["email"]:
        raise ValueError("Invalid email")
    if len(data["username"]) < 3:
        raise ValueError("Username too short")
    if not 13 <= data["age"] <= 120:
        raise ValueError("Invalid age")
    return True

@given(user_data())
def test_valid_user_data_passes_validation(data):
    assert validate_user(data) is True

Use one_of() for union types and sampled_from() for enums:

st.one_of(st.integers(), st.text())  # Either int or str
st.sampled_from(["pending", "active", "cancelled"])  # One of these values

Writing Effective Properties

The hardest part of property-based testing is identifying properties. Here are the patterns that work:

Round-trip properties: Encode then decode returns the original.

import json
from hypothesis import given
from hypothesis import strategies as st

# Recursive strategy for JSON-compatible data
json_primitives = st.one_of(
    st.none(),
    st.booleans(),
    st.integers(),
    st.floats(allow_nan=False, allow_infinity=False),
    st.text(),
)

json_data = st.recursive(
    json_primitives,
    lambda children: st.one_of(
        st.lists(children),
        st.dictionaries(st.text(), children),
    ),
    max_leaves=10,
)

@given(json_data)
def test_json_roundtrip(data):
    encoded = json.dumps(data)
    decoded = json.loads(encoded)
    assert decoded == data

Invariants: Properties that must always hold after an operation.

@given(st.lists(st.integers()))
def test_sort_preserves_length(lst):
    sorted_lst = sorted(lst)
    assert len(sorted_lst) == len(lst)

@given(st.lists(st.integers()))
def test_sort_preserves_elements(lst):
    sorted_lst = sorted(lst)
    assert sorted(sorted_lst) == sorted(lst)  # Same elements

@given(st.lists(st.integers()))
def test_sort_is_ordered(lst):
    sorted_lst = sorted(lst)
    for i in range(len(sorted_lst) - 1):
        assert sorted_lst[i] <= sorted_lst[i + 1]

Oracle comparisons: Compare against a known-correct (but possibly slow) implementation.

def fast_median(lst):
    # Your optimized implementation
    sorted_lst = sorted(lst)
    n = len(sorted_lst)
    if n % 2 == 1:
        return sorted_lst[n // 2]
    return (sorted_lst[n // 2 - 1] + sorted_lst[n // 2]) / 2

def slow_median(lst):
    # Obviously correct reference implementation
    import statistics
    return statistics.median(lst)

@given(st.lists(st.integers(), min_size=1))
def test_fast_median_matches_stdlib(lst):
    assert fast_median(lst) == slow_median(lst)

Hard to compute, easy to verify: Check the result is correct without reimplementing the function.

@given(st.integers(min_value=2, max_value=1000))
def test_prime_factors(n):
    factors = prime_factors(n)
    # Easy to verify: product of factors equals n
    product = 1
    for f in factors:
        product *= f
    assert product == n
    # Each factor should be prime (we can check this separately)

Shrinking and Debugging

When Hypothesis finds a failing input, it doesn’t just report it—it shrinks it to the minimal failing case. This is invaluable for debugging.

from hypothesis import given
from hypothesis import strategies as st

def buggy_sort(lst):
    # Bug: fails when list has duplicates
    if len(lst) != len(set(lst)):
        return lst  # Oops, forgot to sort
    return sorted(lst)

@given(st.lists(st.integers()))
def test_buggy_sort_is_ordered(lst):
    result = buggy_sort(lst)
    for i in range(len(result) - 1):
        assert result[i] <= result[i + 1]

Hypothesis might initially find a failure with [47, -892, 47, 103, 47], but it will shrink this to [1, 0, 0] or even [0, 0]—the simplest case that triggers the bug.

The output shows you the minimal failing example:

Falsifying example: test_buggy_sort_is_ordered(lst=[1, 0, 0])

Once you fix the bug, add the failing case as a regression test using @example:

from hypothesis import given, example
from hypothesis import strategies as st

@given(st.lists(st.integers()))
@example([1, 0, 0])  # Regression test for the duplicate bug
def test_sort_is_ordered(lst):
    result = sort(lst)
    for i in range(len(result) - 1):
        assert result[i] <= result[i + 1]

Advanced Techniques

Stateful testing verifies that a sequence of operations leaves your system in a valid state:

from hypothesis import strategies as st
from hypothesis.stateful import RuleBasedStateMachine, rule, precondition, invariant

class Stack:
    def __init__(self):
        self._items = []
    
    def push(self, item):
        self._items.append(item)
    
    def pop(self):
        if not self._items:
            raise IndexError("Pop from empty stack")
        return self._items.pop()
    
    def peek(self):
        if not self._items:
            raise IndexError("Peek at empty stack")
        return self._items[-1]
    
    def is_empty(self):
        return len(self._items) == 0

class StackMachine(RuleBasedStateMachine):
    def __init__(self):
        super().__init__()
        self.stack = Stack()
        self.model = []  # Reference model
    
    @rule(item=st.integers())
    def push(self, item):
        self.stack.push(item)
        self.model.append(item)
    
    @precondition(lambda self: len(self.model) > 0)
    @rule()
    def pop(self):
        actual = self.stack.pop()
        expected = self.model.pop()
        assert actual == expected
    
    @invariant()
    def size_matches(self):
        assert len(self.stack._items) == len(self.model)

TestStack = StackMachine.TestCase

Settings and profiles let you configure Hypothesis for different environments:

from hypothesis import settings, Verbosity

# In conftest.py or settings file
settings.register_profile("ci", max_examples=1000, deadline=None)
settings.register_profile("dev", max_examples=10)
settings.register_profile("debug", max_examples=10, verbosity=Verbosity.verbose)

# Run with: pytest --hypothesis-profile=ci

Integration and Best Practices

Combine Hypothesis with pytest fixtures for setup:

import pytest
from hypothesis import given
from hypothesis import strategies as st

@pytest.fixture
def database_connection():
    conn = create_connection()
    yield conn
    conn.close()

@given(st.text(min_size=1))
def test_user_creation(database_connection, username):
    user = create_user(database_connection, username)
    assert user.username == username

When to use property-based testing:

Pure functions with clear mathematical properties
Serialization/deserialization
Data structure implementations
Parsers and validators
Any code where you can define “correct behavior” without specifying exact outputs

When to stick with example-based testing:

Integration tests with external services
Tests where setup cost is high
Behavior that’s genuinely example-specific (specific business rules)

Performance tip: Property-based tests are slower than example-based tests. Run fewer examples during development (max_examples=10) and more in CI (max_examples=500).

Property-based testing won’t replace all your tests, but it will find bugs you’d never catch otherwise. Start with one function, write one property, and let Hypothesis show you what you missed.