How to Select Columns in Polars
Polars has rapidly become the go-to DataFrame library for Python developers who need speed. Built in Rust with a lazy execution engine, it consistently outperforms pandas by 10-100x on common...
Key Insights
- Polars provides multiple column selection methods—
select(),pl.col(), and column selectors—each optimized for different use cases, from simple name-based selection to complex pattern matching. - Column selectors (
cs) offer the most expressive way to select columns by data type, name patterns, or combinations thereof, and they compose naturally with set operations. - Unlike pandas, Polars column selection is lazy-evaluation friendly, meaning your selection logic integrates seamlessly with query optimization for better performance on large datasets.
Introduction
Polars has rapidly become the go-to DataFrame library for Python developers who need speed. Built in Rust with a lazy execution engine, it consistently outperforms pandas by 10-100x on common operations. But before you can transform, aggregate, or analyze data, you need to select the right columns.
Column selection in Polars is more powerful than what you’re used to from pandas. Instead of simple bracket notation, Polars provides a composable expression system that integrates with its query optimizer. This means your column selection logic can be optimized alongside your transformations—something pandas simply cannot do.
This article covers every column selection method in Polars, from basic name-based selection to advanced pattern matching with selectors. By the end, you’ll know exactly which approach to use for any situation.
Basic Column Selection with select()
The select() method is your entry point for column selection. It returns a new DataFrame containing only the columns you specify.
import polars as pl
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"salary": [50000, 60000, 70000],
"department": ["Engineering", "Sales", "Engineering"]
})
# Select a single column
single_col = df.select("name")
print(single_col)
# shape: (3, 1)
# ┌─────────┐
# │ name │
# │ --- │
# │ str │
# ╞═════════╡
# │ Alice │
# │ Bob │
# │ Charlie │
# └─────────┘
# Select multiple columns
multi_col = df.select(["name", "salary"])
print(multi_col)
# shape: (3, 2)
# ┌─────────┬────────┐
# │ name ┆ salary │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════════╪════════╡
# │ Alice ┆ 50000 │
# │ Bob ┆ 60000 │
# │ Charlie ┆ 70000 │
# └─────────┴────────┘
Note that select() always returns a DataFrame, even when selecting a single column. If you need a Series, use get_column() instead:
name_series = df.get_column("name")
print(type(name_series)) # <class 'polars.series.series.Series'>
One key difference from pandas: Polars doesn’t support bracket notation for column selection like df["name"] returning a Series. Instead, df["name"] returns a Series, while df[["name", "age"]] returns a DataFrame. Stick with select() for consistency and clarity.
Using pl.col() for Expression-Based Selection
The pl.col() function creates a column expression, which is the foundation of Polars’ expression system. While passing strings directly to select() works, pl.col() unlocks chaining operations.
# Basic pl.col() usage
result = df.select(pl.col("name"))
# Select multiple columns with pl.col()
result = df.select(pl.col(["name", "age"]))
# The power: chain operations on selected columns
result = df.select(
pl.col("name"),
pl.col("salary").mul(1.1).alias("salary_with_raise")
)
print(result)
# shape: (3, 2)
# ┌─────────┬───────────────────┐
# │ name ┆ salary_with_raise │
# │ --- ┆ --- │
# │ str ┆ f64 │
# ╞═════════╪═══════════════════╡
# │ Alice ┆ 55000.0 │
# │ Bob ┆ 66000.0 │
# │ Charlie ┆ 77000.0 │
# └─────────┴───────────────────┘
You can also use pl.col() to select all columns:
# Select all columns
all_cols = df.select(pl.col("*"))
# Apply an operation to all columns (where applicable)
# This adds 0 to numeric columns, concatenates "" to string columns
result = df.select(pl.all()) # pl.all() is shorthand for pl.col("*")
The expression-based approach becomes essential when you need to apply transformations during selection, which is a common pattern in data pipelines.
Selecting Columns by Data Type
Polars provides column selectors through the polars.selectors module (conventionally imported as cs). These selectors let you choose columns based on their data type.
import polars.selectors as cs
df = pl.DataFrame({
"name": ["Alice", "Bob"],
"age": [25, 30],
"salary": [50000.0, 60000.0],
"hired_date": ["2020-01-15", "2019-06-01"],
"is_active": [True, False]
}).with_columns(
pl.col("hired_date").str.to_date()
)
# Select all numeric columns
numeric_cols = df.select(cs.numeric())
print(numeric_cols.columns) # ['age', 'salary']
# Select string columns
string_cols = df.select(cs.string())
print(string_cols.columns) # ['name']
# Select temporal columns (dates, times, datetimes)
temporal_cols = df.select(cs.temporal())
print(temporal_cols.columns) # ['hired_date']
# Select boolean columns
bool_cols = df.select(cs.boolean())
print(bool_cols.columns) # ['is_active']
# Select integer columns specifically
int_cols = df.select(cs.integer())
print(int_cols.columns) # ['age']
# Select float columns specifically
float_cols = df.select(cs.float())
print(float_cols.columns) # ['salary']
Type-based selection is invaluable when processing datasets with many columns. Instead of manually listing column names, you can select all numeric columns for normalization or all string columns for text cleaning.
Pattern-Based Selection with Regex and Wildcards
Real-world datasets often have naming conventions—columns prefixed with feature_, suffixed with _id, or following other patterns. Polars handles these elegantly.
df = pl.DataFrame({
"user_id": [1, 2, 3],
"user_name": ["Alice", "Bob", "Charlie"],
"feature_age": [25, 30, 35],
"feature_income": [50000, 60000, 70000],
"feature_score": [0.8, 0.6, 0.9],
"target_label": [1, 0, 1]
})
# Using regex with pl.col()
feature_cols = df.select(pl.col("^feature_.*$"))
print(feature_cols.columns) # ['feature_age', 'feature_income', 'feature_score']
# Using selectors for prefix matching
feature_cols = df.select(cs.starts_with("feature_"))
print(feature_cols.columns) # ['feature_age', 'feature_income', 'feature_score']
# Suffix matching
id_cols = df.select(cs.ends_with("_id"))
print(id_cols.columns) # ['user_id']
# Contains matching
user_cols = df.select(cs.contains("user"))
print(user_cols.columns) # ['user_id', 'user_name']
# Regex with selectors
pattern_cols = df.select(cs.matches("^(user|target)_"))
print(pattern_cols.columns) # ['user_id', 'user_name', 'target_label']
The selector-based approach (cs.starts_with()) is generally more readable than regex, but regex gives you maximum flexibility for complex patterns.
Excluding and Combining Selections
Sometimes it’s easier to specify what you don’t want. Polars supports exclusion and set operations on selections.
# Exclude specific columns
without_id = df.select(pl.exclude("user_id"))
print(without_id.columns)
# ['user_name', 'feature_age', 'feature_income', 'feature_score', 'target_label']
# Exclude multiple columns
without_user = df.select(pl.exclude(["user_id", "user_name"]))
print(without_user.columns)
# ['feature_age', 'feature_income', 'feature_score', 'target_label']
# Combine selectors with set operations
# Union: columns matching either condition
combined = df.select(cs.starts_with("user_") | cs.starts_with("target_"))
print(combined.columns) # ['user_id', 'user_name', 'target_label']
# Intersection: columns matching both conditions
# (numeric AND starts with feature_)
numeric_features = df.select(cs.numeric() & cs.starts_with("feature_"))
print(numeric_features.columns) # ['feature_age', 'feature_income', 'feature_score']
# Difference: exclude from selection
# All columns except those starting with feature_
non_features = df.select(cs.all() - cs.starts_with("feature_"))
print(non_features.columns) # ['user_id', 'user_name', 'target_label']
# Complement: everything NOT matching
non_numeric = df.select(~cs.numeric())
print(non_numeric.columns) # ['user_name', 'target_label']
These set operations make complex selection logic readable and maintainable. Compare cs.numeric() & cs.starts_with("feature_") to the equivalent pandas code—there’s no contest.
Practical Tips and Performance Considerations
Understanding when to use each selection method matters for both code clarity and performance.
Use string names for simple cases. When selecting known columns by name, just pass strings. It’s the most readable option:
# Clear and simple
df.select(["name", "age", "salary"])
Use selectors for dynamic selection. When column names aren’t known ahead of time or follow patterns, selectors prevent hardcoding:
# Works regardless of how many feature columns exist
df.select(cs.starts_with("feature_"))
Leverage lazy evaluation. Column selection in lazy frames gets optimized with the rest of your query:
# Lazy frame: selection is part of the query plan
lazy_df = pl.scan_parquet("large_dataset.parquet")
result = (
lazy_df
.select(cs.numeric())
.filter(pl.col("age") > 25)
.group_by("department")
.agg(pl.all().mean())
.collect() # Execution happens here
)
# Polars only reads the numeric columns from disk
# The filter and aggregation are optimized together
Avoid repeated selection. Each select() call creates a new DataFrame. Chain operations instead:
# Less efficient: multiple intermediate DataFrames
step1 = df.select(["a", "b", "c"])
step2 = step1.with_columns(pl.col("a") * 2)
# More efficient: single expression
result = df.select(
pl.col("a") * 2,
pl.col("b"),
pl.col("c")
)
Watch out for empty selections. Selecting columns that don’t exist raises an error. Use cs.by_name() with require_all=False for optional columns:
# This raises an error if "optional_col" doesn't exist
# df.select(cs.by_name("optional_col"))
# This returns empty DataFrame if column doesn't exist
df.select(cs.by_name("optional_col", require_all=False))
Column selection in Polars is more than syntax—it’s a gateway to the expression system that makes Polars fast. Master these patterns, and you’ll write cleaner, faster data processing code.