PySpark - Create DataFrame from List
PySpark DataFrames are the fundamental data structure for distributed data processing, but you don't always need massive datasets to leverage their power. Creating DataFrames from Python lists is a...
Key Insights
- PySpark offers multiple methods to create DataFrames from Python lists, each suited for different data structures: tuples for simple tabular data, dictionaries for named fields, and explicit schemas for type control
- Always define explicit schemas for production code to avoid schema inference overhead and ensure type safety, especially when dealing with nullable fields or specific data types like dates and decimals
- Creating DataFrames from lists is ideal for testing and prototyping but should be avoided for datasets larger than a few thousand rows due to memory constraints on the driver node
Introduction
PySpark DataFrames are the fundamental data structure for distributed data processing, but you don’t always need massive datasets to leverage their power. Creating DataFrames from Python lists is a critical skill for unit testing, rapid prototyping, and working with small reference datasets. This approach lets you quickly mock data structures, validate transformations, and develop pipeline logic before connecting to production data sources.
The flexibility of creating DataFrames from lists makes it invaluable during development. Whether you’re testing a complex aggregation, validating a join condition, or demonstrating a concept to colleagues, being able to spin up a DataFrame from hardcoded data in seconds accelerates your workflow significantly.
Prerequisites and Setup
Before creating DataFrames, you need an active SparkSession. This is your entry point to PySpark functionality and manages the connection to your Spark cluster (or local Spark instance).
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType
from datetime import date
# Initialize SparkSession
spark = SparkSession.builder \
.appName("DataFrame from List") \
.master("local[*]") \
.getOrCreate()
The master("local[*]") configuration runs Spark locally using all available cores. For production environments, you’d connect to a cluster instead. The getOrCreate() method ensures you reuse an existing session if one exists, preventing resource conflicts.
Creating DataFrame from Simple List
The most straightforward approach uses a list of tuples, where each tuple represents a row. This mirrors the traditional tabular data structure you’d find in a database.
# List of tuples
data = [
("Alice", 34, "Engineering"),
("Bob", 45, "Sales"),
("Charlie", 28, "Marketing"),
("Diana", 52, "Engineering")
]
# Method 1: Using createDataFrame with column names
df = spark.createDataFrame(data, ["name", "age", "department"])
df.show()
# Output:
# +-------+---+------------+
# | name|age| department|
# +-------+---+------------+
# | Alice| 34| Engineering|
# | Bob| 45| Sales|
# |Charlie| 28| Marketing|
# | Diana| 52| Engineering|
# +-------+---+------------+
Alternatively, you can use the toDF() method after creating an RDD:
# Method 2: Using toDF()
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(["name", "age", "department"])
df.printSchema()
# Output:
# root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
# |-- department: string (nullable = true)
Notice that PySpark infers the schema automatically. While convenient, this inference can be unpredictable with complex data types, which is why explicit schemas are preferred for production code.
Creating DataFrame from List of Dictionaries
Lists of dictionaries map naturally to DataFrames since each dictionary represents a row with named fields. This approach is particularly readable and self-documenting.
# List of dictionaries
data_dict = [
{"name": "Alice", "age": 34, "department": "Engineering", "salary": 95000},
{"name": "Bob", "age": 45, "department": "Sales", "salary": 78000},
{"name": "Charlie", "age": 28, "department": "Marketing", "salary": 65000},
{"name": "Diana", "age": 52, "department": "Engineering", "salary": 120000}
]
df = spark.createDataFrame(data_dict)
df.show()
Dictionaries also handle missing values gracefully. If a key is missing from some dictionaries, PySpark creates a nullable column:
# Handling incomplete data
data_partial = [
{"name": "Alice", "age": 34, "city": "New York"},
{"name": "Bob", "age": 45}, # Missing city
{"name": "Charlie", "city": "Chicago"} # Missing age
]
df_partial = spark.createDataFrame(data_partial)
df_partial.show()
# Output:
# +-------+----+--------+
# | name| age| city|
# +-------+----+--------+
# | Alice| 34|New York|
# | Bob| 45| null|
# |Charlie|null| Chicago|
# +-------+----+--------+
For nested dictionaries, PySpark creates struct types automatically:
# Nested dictionaries
data_nested = [
{"name": "Alice", "address": {"city": "New York", "zip": "10001"}},
{"name": "Bob", "address": {"city": "Boston", "zip": "02101"}}
]
df_nested = spark.createDataFrame(data_nested)
df_nested.printSchema()
# Output:
# root
# |-- address: struct (nullable = true)
# | |-- city: string (nullable = true)
# | |-- zip: string (nullable = true)
# |-- name: string (nullable = true)
Defining and Applying Schemas
Explicit schemas give you complete control over data types, nullability, and structure. This is critical for production pipelines where data type mismatches can cause silent failures or incorrect results.
# Define explicit schema
schema = StructType([
StructField("employee_id", IntegerType(), False), # Non-nullable
StructField("name", StringType(), False),
StructField("hire_date", DateType(), True), # Nullable
StructField("salary", DoubleType(), True),
StructField("department", StringType(), True)
])
# Data matching the schema
data = [
(1, "Alice", date(2020, 1, 15), 95000.0, "Engineering"),
(2, "Bob", date(2019, 3, 22), 78000.0, "Sales"),
(3, "Charlie", date(2021, 7, 10), 65000.0, "Marketing"),
(4, "Diana", None, 120000.0, "Engineering") # Null hire_date
]
df_schema = spark.createDataFrame(data, schema)
df_schema.printSchema()
# Output:
# root
# |-- employee_id: integer (nullable = false)
# |-- name: string (nullable = false)
# |-- hire_date: date (nullable = true)
# |-- salary: double (nullable = true)
# |-- department: string (nullable = true)
The explicit schema prevents type coercion issues and makes your data contract clear. For instance, without an explicit schema, PySpark might infer an integer as a long, which could cause issues when joining with other DataFrames.
Here’s a more complex schema with multiple data types:
from pyspark.sql.types import ArrayType, MapType
# Complex schema with arrays and maps
complex_schema = StructType([
StructField("id", IntegerType(), False),
StructField("name", StringType(), False),
StructField("skills", ArrayType(StringType()), True),
StructField("certifications", MapType(StringType(), DateType()), True)
])
complex_data = [
(1, "Alice", ["Python", "Spark", "SQL"], {"AWS": date(2021, 5, 1), "Azure": date(2022, 3, 15)}),
(2, "Bob", ["Java", "Kafka"], {"GCP": date(2020, 8, 10)}),
(3, "Charlie", None, None)
]
df_complex = spark.createDataFrame(complex_data, complex_schema)
df_complex.show(truncate=False)
Creating DataFrame from List of Lists
When you have data as a list of lists (common when reading from CSV-like sources), you need to specify column names separately:
# List of lists
data_lists = [
[1, "Alice", 34, "Engineering"],
[2, "Bob", 45, "Sales"],
[3, "Charlie", 28, "Marketing"],
[4, "Diana", 52, "Engineering"]
]
columns = ["id", "name", "age", "department"]
df_lists = spark.createDataFrame(data_lists, columns)
df_lists.show()
You can also combine this with an explicit schema:
schema = StructType([
StructField("id", IntegerType(), False),
StructField("name", StringType(), False),
StructField("age", IntegerType(), False),
StructField("department", StringType(), True)
])
df_lists_schema = spark.createDataFrame(data_lists, schema)
df_lists_schema.printSchema()
Best Practices and Performance Considerations
When to Use Lists: Creating DataFrames from lists is perfect for unit tests, demonstrations, and datasets under 10,000 rows. Beyond that, you’re better off reading from files or databases to avoid overwhelming the driver node’s memory.
Always Define Schemas in Production: Schema inference requires PySpark to scan your data, adding overhead. More importantly, inferred schemas can change unexpectedly if your data changes, breaking downstream transformations. Explicit schemas serve as documentation and prevent surprises.
Memory Limitations: When you create a DataFrame from a list, all data initially resides on the driver node before being distributed. For large lists, this can cause OutOfMemoryErrors. If you’re working with substantial data, use spark.read methods instead.
Type Safety Matters: Be explicit about nullable fields. Setting nullable=False for required fields helps catch data quality issues early. PySpark will throw an error if you try to insert null values into non-nullable columns, which is exactly what you want.
Testing Strategy: For unit tests, create small DataFrames from lists to validate transformation logic. Use schemas that match your production data structure to ensure your tests are meaningful:
def test_salary_calculation():
schema = StructType([
StructField("base_salary", DoubleType(), False),
StructField("bonus_pct", DoubleType(), False)
])
test_data = [
(50000.0, 0.10),
(75000.0, 0.15),
(100000.0, 0.20)
]
df = spark.createDataFrame(test_data, schema)
# Test your transformation logic here
Alternative Approaches: For larger datasets, consider spark.read.csv(), spark.read.json(), or spark.read.parquet(). These methods distribute the data reading process and handle larger volumes efficiently.
Creating DataFrames from lists is a fundamental skill that bridges Python’s native data structures with PySpark’s distributed computing capabilities. Master these patterns, use explicit schemas, and you’ll write more robust, maintainable PySpark code.