How to Concatenate DataFrames in Polars

DataFrame concatenation is one of those operations you'll perform constantly in data engineering work. Whether you're combining daily log files, merging results from parallel processing, or...

Key Insights

  • Polars provides four concatenation strategies (vertical, horizontal, diagonal, align) that handle different schema scenarios, unlike pandas’ single concat function with confusing axis parameters.
  • The diagonal and align modes solve the common pain point of combining DataFrames with mismatched columns without manual preprocessing.
  • Lazy concatenation with scan_* functions enables memory-efficient processing of datasets larger than RAM by deferring computation until necessary.

Introduction

DataFrame concatenation is one of those operations you’ll perform constantly in data engineering work. Whether you’re combining daily log files, merging results from parallel processing, or assembling data from multiple sources, you need a reliable way to stack or join DataFrames together.

Polars handles concatenation differently than pandas, and frankly, it does it better. Instead of a single function with an axis parameter that nobody can remember (is axis=0 rows or columns?), Polars uses explicit how parameters that clearly describe the operation. The performance difference is also substantial—Polars’ concatenation is built on Apache Arrow’s columnar format, making it significantly faster for large datasets.

Let’s walk through each concatenation method and when to use it.

Vertical Concatenation with pl.concat()

Vertical concatenation stacks DataFrames on top of each other, adding rows. This is the default behavior and the most common use case—think combining monthly sales data or appending new records to existing datasets.

import polars as pl

# Create two DataFrames with identical schemas
df1 = pl.DataFrame({
    "user_id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "score": [85, 92, 78]
})

df2 = pl.DataFrame({
    "user_id": [4, 5, 6],
    "name": ["Diana", "Eve", "Frank"],
    "score": [91, 88, 95]
})

# Vertical concatenation (default)
combined = pl.concat([df1, df2])
print(combined)

Output:

shape: (6, 3)
┌─────────┬─────────┬───────┐
│ user_id ┆ name    ┆ score │
│ ---     ┆ ---     ┆ ---   │
│ i64     ┆ str     ┆ i64   │
╞═════════╪═════════╪═══════╡
│ 1       ┆ Alice   ┆ 85    │
│ 2       ┆ Bob     ┆ 92    │
│ 3       ┆ Charlie ┆ 78    │
│ 4       ┆ Diana   ┆ 91    │
│ 5       ┆ Eve     ┆ 88    │
│ 6       ┆ Frank   ┆ 95    │
└─────────┴─────────┴───────┘

The how="vertical" parameter is implicit here. Polars requires that all DataFrames have exactly the same columns in the same order with matching data types. If they don’t match, you’ll get an error—which is actually a feature, not a bug. It prevents silent data corruption that can happen when pandas tries to be too clever.

You can concatenate any number of DataFrames by passing them as a list:

monthly_data = [january_df, february_df, march_df, april_df]
quarterly_report = pl.concat(monthly_data)

Horizontal Concatenation

Horizontal concatenation combines DataFrames side-by-side, adding columns. Use this when you have related data in separate DataFrames that you want to merge without a join key.

# User demographics
demographics = pl.DataFrame({
    "age": [28, 34, 45],
    "city": ["NYC", "LA", "Chicago"]
})

# User activity metrics
activity = pl.DataFrame({
    "logins": [150, 89, 203],
    "purchases": [12, 8, 25]
})

# Horizontal concatenation
user_profiles = pl.concat([demographics, activity], how="horizontal")
print(user_profiles)

Output:

shape: (3, 4)
┌─────┬─────────┬────────┬───────────┐
│ age ┆ city    ┆ logins ┆ purchases │
│ --- ┆ ---     ┆ ---    ┆ ---       │
│ i64 ┆ str     ┆ i64    ┆ i64       │
╞═════╪═════════╪════════╪═══════════╡
│ 28  ┆ NYC     ┆ 150    ┆ 12        │
│ 34  ┆ LA      ┆ 89     ┆ 8         │
│ 45  ┆ Chicago ┆ 203    ┆ 25        │
└─────┴─────────┴────────┴───────────┘

Critical warning: horizontal concatenation aligns rows by position, not by any key. Row 0 from the first DataFrame pairs with row 0 from the second. If your DataFrames have different row counts, Polars will raise an error. If your rows aren’t already aligned, you need a proper join operation instead.

Diagonal Concatenation

Diagonal concatenation is where Polars really shines. When your DataFrames have different columns, how="diagonal" combines them by filling missing values with nulls. This is incredibly useful when dealing with data from different sources that have evolved over time or have optional fields.

# Old schema (legacy data)
legacy_data = pl.DataFrame({
    "id": [1, 2],
    "name": ["Product A", "Product B"],
    "price": [29.99, 49.99]
})

# New schema (includes category)
new_data = pl.DataFrame({
    "id": [3, 4],
    "name": ["Product C", "Product D"],
    "price": [39.99, 59.99],
    "category": ["Electronics", "Home"]
})

# Diagonal concatenation handles the mismatch
all_products = pl.concat([legacy_data, new_data], how="diagonal")
print(all_products)

Output:

shape: (4, 4)
┌─────┬───────────┬───────┬─────────────┐
│ id  ┆ name      ┆ price ┆ category    │
│ --- ┆ ---       ┆ ---   ┆ ---         │
│ i64 ┆ str       ┆ f64   ┆ str         │
╞═════╪═══════════╪═══════╪═════════════╡
│ 1   ┆ Product A ┆ 29.99 ┆ null        │
│ 2   ┆ Product B ┆ 49.99 ┆ null        │
│ 3   ┆ Product C ┆ 39.99 ┆ Electronics │
│ 4   ┆ Product D ┆ 59.99 ┆ Home        │
└─────┴───────────┴───────┴─────────────┘

The legacy rows get null for the category column they didn’t have. This behavior makes diagonal concatenation perfect for ETL pipelines where schemas evolve over time.

Handling Schema Mismatches

Real-world data is messy. You’ll encounter DataFrames with columns in different orders, different data types for the same column, or partially overlapping schemas. Polars provides the how="align" option to handle these cases gracefully.

# DataFrame with columns in different order and subset of columns
df_a = pl.DataFrame({
    "z": [1, 2],
    "x": ["a", "b"],
    "y": [1.0, 2.0]
})

df_b = pl.DataFrame({
    "x": ["c", "d"],
    "y": [3.0, 4.0],
    "w": [True, False]
})

# Align mode reorders columns and fills missing with nulls
aligned = pl.concat([df_a, df_b], how="align")
print(aligned)

Output:

shape: (4, 4)
┌──────┬─────┬─────┬───────┐
│ w    ┆ x   ┆ y   ┆ z     │
│ ---  ┆ --- ┆ --- ┆ ---   │
│ bool ┆ str ┆ f64 ┆ i64   │
╞══════╪═════╪═════╪═══════╡
│ null ┆ a   ┆ 1.0 ┆ 1     │
│ null ┆ b   ┆ 2.0 ┆ 2     │
│ true ┆ c   ┆ 3.0 ┆ null  │
│ false┆ d   ┆ 4.0 ┆ null  │
└──────┴─────┴─────┴───────┘

When you have type mismatches, you’ll need to cast columns before concatenation:

# Type mismatch scenario
df_int = pl.DataFrame({"value": [1, 2, 3]})  # i64
df_float = pl.DataFrame({"value": [4.5, 5.5, 6.5]})  # f64

# Cast to common type before concat
df_int_casted = df_int.cast({"value": pl.Float64})
combined = pl.concat([df_int_casted, df_float])

My recommendation: establish a schema contract early in your pipeline and validate incoming data against it. Catching type mismatches explicitly is better than silent coercion.

Performance Considerations

Concatenation performance matters when you’re dealing with hundreds of files or billions of rows. Here are the key optimizations to understand.

The rechunk parameter controls whether Polars consolidates memory after concatenation. By default, rechunk=True creates a single contiguous memory block, which speeds up subsequent operations but requires extra memory during concatenation:

# For memory-constrained environments
combined = pl.concat(dataframes, rechunk=False)

# For maximum query performance after concat
combined = pl.concat(dataframes, rechunk=True)  # default

The real performance win comes from lazy concatenation. When working with files on disk, use scan_* functions to build a lazy query plan:

import polars as pl
from pathlib import Path

# Lazy concatenation of multiple Parquet files
parquet_files = list(Path("data/").glob("*.parquet"))

# This doesn't load any data yet
lazy_frames = [pl.scan_parquet(f) for f in parquet_files]
combined_lazy = pl.concat(lazy_frames)

# Add transformations before collecting
result = (
    combined_lazy
    .filter(pl.col("status") == "active")
    .group_by("region")
    .agg(pl.col("revenue").sum())
    .collect()  # Only now does processing happen
)

This approach has two massive advantages. First, Polars can push down predicates and projections to the file scan level, reading only the columns and rows you need. Second, it enables processing datasets larger than available RAM by streaming chunks through the query plan.

For the common case of concatenating all Parquet files in a directory, Polars provides a shortcut:

# Single-line alternative using glob pattern
df = pl.scan_parquet("data/*.parquet").collect()

Conclusion

Polars’ concatenation API is more explicit and powerful than pandas’ approach. Here’s a quick reference for choosing the right method:

Scenario Method
Same schema, stack rows how="vertical" (default)
Same row count, add columns how="horizontal"
Different columns, fill nulls how="diagonal"
Different column order, auto-align how="align"
Large files on disk Use scan_* + lazy concat

Start with vertical for simple cases, reach for diagonal when schemas don’t match, and always prefer lazy evaluation when working with files. The explicit nature of these options means fewer surprises in production pipelines—and that’s exactly what you want when processing critical data.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.