How to Create a DataFrame in Polars

Polars has emerged as a serious alternative to pandas for DataFrame operations in Python. Built in Rust with a focus on performance, Polars consistently outperforms pandas on benchmarks—often by...

Key Insights

  • Polars offers multiple DataFrame creation methods, but dictionaries with column-oriented data provide the best performance and clearest intent for most use cases.
  • Always specify schemas explicitly when working with empty DataFrames or when type inference might produce unexpected results—this prevents silent bugs and improves memory efficiency.
  • Use lazy evaluation (scan_* functions) for large files to let Polars optimize your query plan before loading data into memory.

Introduction to Polars DataFrames

Polars has emerged as a serious alternative to pandas for DataFrame operations in Python. Built in Rust with a focus on performance, Polars consistently outperforms pandas on benchmarks—often by 10-50x for large datasets. But speed isn’t the only reason to consider it. Polars enforces stricter typing, provides a more consistent API, and handles memory more efficiently.

This article covers the practical ways to create DataFrames in Polars. Whether you’re migrating from pandas or starting fresh, you’ll learn the patterns that work best for different scenarios. I’ll focus on the approaches you’ll actually use in production code, not every obscure method in the API.

Installation and Setup

Getting Polars installed takes seconds:

pip install polars

For conda users:

conda install -c conda-forge polars

If you need additional functionality like timezone support or integration with cloud storage, install the full package:

pip install 'polars[all]'

Your imports are straightforward:

import polars as pl

That’s it. Polars has no required dependencies beyond what pip handles automatically. This simplicity extends to the rest of the library—you won’t find yourself hunting through submodules to find basic functionality.

Creating DataFrames from Dictionaries

The dictionary approach is the bread and butter of DataFrame creation in Polars. You pass a dictionary where keys become column names and values are lists (or other iterables) of column data:

import polars as pl

df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [28, 34, 42],
    "salary": [75000.0, 82000.0, 95000.0],
    "active": [True, True, False]
})

print(df)

Output:

shape: (3, 4)
┌─────────┬─────┬──────────┬────────┐
│ name    ┆ age ┆ salary   ┆ active │
│ ---     ┆ --- ┆ ---      ┆ ---    │
│ str     ┆ i64 ┆ f64      ┆ bool   │
╞═════════╪═════╪══════════╪════════╡
│ Alice   ┆ 28  ┆ 75000.0  ┆ true   │
│ Bob     ┆ 34  ┆ 82000.0  ┆ true   │
│ Charlie ┆ 42  ┆ 95000.0  ┆ false  │
└─────────┴─────┴──────────┴────────┘

Polars infers types automatically. Notice how integers become i64, floats become f64, strings become str, and booleans become bool. This inference is usually correct, but you can override it when needed:

df = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "value": [100, 200, 300],
    },
    schema={
        "id": pl.Int32,
        "value": pl.Float32,
    }
)

Specifying the schema explicitly gives you control over memory usage. An Int32 uses half the memory of an Int64. For large datasets, these decisions matter.

Creating DataFrames from Lists and Sequences

Sometimes your data arrives as a list of records rather than columns. Polars handles this with a slight syntax variation:

records = [
    {"name": "Alice", "score": 95},
    {"name": "Bob", "score": 87},
    {"name": "Charlie", "score": 92},
]

df = pl.DataFrame(records)
print(df)

This row-oriented approach works but performs worse than the column-oriented dictionary method. Polars stores data in columnar format internally, so providing data by column avoids an internal transpose operation.

For NumPy arrays, you’ll need to provide column names since arrays don’t carry that metadata:

import numpy as np

data = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

df = pl.DataFrame(
    data,
    schema=["a", "b", "c"],
    orient="row"
)
print(df)

The orient parameter tells Polars how to interpret the array dimensions. Use "row" when each inner array is a row, or "col" when each inner array is a column:

# Column-oriented NumPy array
column_data = np.array([
    [1, 4, 7],  # column a
    [2, 5, 8],  # column b
    [3, 6, 9],  # column c
])

df = pl.DataFrame(
    column_data,
    schema=["a", "b", "c"],
    orient="col"
)

When working with NumPy, Polars can often use zero-copy operations if the array is already in the right memory layout. Column-oriented data with contiguous memory gets the best performance.

Reading Data from Files

Real-world data rarely starts as Python dictionaries. Polars provides optimized readers for common file formats:

# CSV files
df = pl.read_csv("data.csv")

# Parquet files (highly recommended for large datasets)
df = pl.read_parquet("data.parquet")

# JSON files
df = pl.read_json("data.json")

# NDJSON (newline-delimited JSON, common in logging)
df = pl.read_ndjson("logs.jsonl")

The CSV reader accepts numerous options for handling messy real-world files:

df = pl.read_csv(
    "data.csv",
    separator=";",
    has_header=True,
    skip_rows=2,
    null_values=["NA", "NULL", ""],
    dtypes={"date_column": pl.Date},
    try_parse_dates=True,
)

For large files, lazy evaluation changes the game. Instead of loading everything into memory immediately, Polars builds a query plan and optimizes it:

# Lazy evaluation - nothing loads yet
lazy_df = pl.scan_csv("huge_file.csv")

# Add transformations - still nothing loads
result = (
    lazy_df
    .filter(pl.col("status") == "active")
    .select(["id", "name", "value"])
    .group_by("name")
    .agg(pl.col("value").sum())
)

# Now it executes, loading only necessary data
df = result.collect()

The scan_* functions return a LazyFrame instead of a DataFrame. Polars analyzes your entire query chain, pushes filters down to the file reader, projects only needed columns, and parallelizes operations. For a 10GB CSV where you only need 3 columns and 10% of rows, this means loading megabytes instead of gigabytes.

Parquet files deserve special mention. They’re columnar, compressed, and carry schema information. Polars reads them faster than CSV and uses less memory:

# Write a DataFrame to Parquet
df.write_parquet("output.parquet", compression="zstd")

# Read with predicate pushdown (filter applied during read)
lazy_df = pl.scan_parquet("output.parquet")
filtered = lazy_df.filter(pl.col("value") > 100).collect()

Creating Empty and Schema-Defined DataFrames

Sometimes you need an empty DataFrame with a predefined structure—for collecting results in a loop, defining an interface, or initializing before conditional logic:

schema = {
    "id": pl.Int64,
    "name": pl.String,
    "timestamp": pl.Datetime("us"),
    "value": pl.Float64,
    "tags": pl.List(pl.String),
}

empty_df = pl.DataFrame(schema=schema)
print(empty_df)

Output:

shape: (0, 5)
┌─────┬──────┬───────────┬───────┬──────┐
│ id  ┆ name ┆ timestamp ┆ value ┆ tags │
│ --- ┆ ---  ┆ ---       ┆ ---   ┆ ---  │
│ i64 ┆ str  ┆ datetime  ┆ f64   ┆ list │
╞═════╪══════╪═══════════╪═══════╪══════╡
└─────┴──────┴───────────┴───────┴──────┘

The DataFrame has zero rows but a fully defined schema. You can concatenate data into it later:

new_data = pl.DataFrame({
    "id": [1, 2],
    "name": ["test1", "test2"],
    "timestamp": [datetime.now(), datetime.now()],
    "value": [1.5, 2.5],
    "tags": [["a", "b"], ["c"]],
})

combined = pl.concat([empty_df, new_data])

Polars supports complex nested types that pandas struggles with. Lists, structs, and nested combinations work naturally:

df = pl.DataFrame({
    "user_id": [1, 2],
    "metadata": [
        {"role": "admin", "permissions": ["read", "write"]},
        {"role": "user", "permissions": ["read"]},
    ]
})

Quick Tips and Best Practices

Specify types when they matter. Type inference is convenient but not free. For DataFrames you’ll create repeatedly (like in a data pipeline), define the schema once and reuse it:

USER_SCHEMA = {
    "id": pl.UInt32,
    "email": pl.String,
    "created_at": pl.Datetime("us", "UTC"),
}

def load_users(path: str) -> pl.DataFrame:
    return pl.read_csv(path, dtypes=USER_SCHEMA)

Prefer column-oriented data. When you control the data format, structure it by columns. This matches Polars’ internal representation and avoids unnecessary transformations.

Use lazy evaluation for file operations. Unless you’re working with small files that fit comfortably in memory, start with scan_* functions. The optimization benefits compound as your queries get more complex.

Watch for silent type coercion. Polars is stricter than pandas but will still coerce types in some situations. A column of [1, 2, None] becomes nullable Int64, not a mix of int and NoneType. This is usually what you want, but verify your schemas in production code.

Avoid row-by-row operations. If you find yourself iterating over rows to build a DataFrame, step back and find a vectorized approach. Polars is fast because it operates on columns, not because it makes row iteration fast.

Polars DataFrames are the foundation of everything else you’ll do with the library. Master these creation patterns, and you’ll spend less time fighting with data loading and more time on actual analysis.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.