PySpark - Create RDD from List (parallelize)
Resilient Distributed Datasets (RDDs) are the fundamental data structure in PySpark, representing immutable, distributed collections that can be processed in parallel across cluster nodes. While...
Key Insights
- The
parallelize()method converts in-memory Python collections into distributed RDDs, enabling parallel processing across your Spark cluster - Partition count directly impacts parallelism—aim for 2-4 partitions per CPU core for optimal performance
- Use
parallelize()for testing and small datasets only; for production workloads, load data from distributed storage like HDFS or S3
Introduction
Resilient Distributed Datasets (RDDs) are the fundamental data structure in PySpark, representing immutable, distributed collections that can be processed in parallel across cluster nodes. While production Spark applications typically load data from distributed file systems, the parallelize() method serves a critical role: it converts local Python collections into distributed RDDs.
You’ll use parallelize() primarily in three scenarios: prototyping transformations with sample data, writing unit tests for Spark jobs, and broadcasting small reference datasets across your cluster. The method takes your in-memory list, tuple, or range and distributes it across partitions, making it immediately available for parallel operations.
Understanding parallelize() is essential because it reveals how Spark thinks about data distribution. Every RDD operation you perform—whether loaded from a file or created from a list—follows the same partition-based execution model.
Basic Syntax and Simple Example
The parallelize() method belongs to the SparkContext object, which serves as your entry point to Spark functionality. Here’s the straightforward syntax:
from pyspark import SparkContext
# Initialize SparkContext
sc = SparkContext("local[*]", "ParallelizeExample")
# Create RDD from a list
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
numbers_rdd = sc.parallelize(numbers)
# Retrieve data back to driver
result = numbers_rdd.collect()
print(result)
# Output: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
sc.stop()
The collect() action retrieves all RDD elements back to the driver program. Use it cautiously—calling collect() on large RDDs will overwhelm your driver’s memory. For production code, prefer actions like take(n) to sample data or write results directly to storage.
If you’re using SparkSession (the modern API), access SparkContext through the sparkContext property:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("ParallelizeExample") \
.master("local[*]") \
.getOrCreate()
sc = spark.sparkContext
data_rdd = sc.parallelize([1, 2, 3, 4, 5])
Controlling Partitions
Partitions determine how your data splits across the cluster. Each partition processes independently, potentially on different executor cores. The partition count fundamentally controls your parallelism level.
# Default partitioning (Spark decides based on cluster configuration)
default_rdd = sc.parallelize(range(100))
print(f"Default partitions: {default_rdd.getNumPartitions()}")
# Output varies: typically 8 on local[*] with 8 cores
# Explicit partition count
partitioned_rdd = sc.parallelize(range(100), numSlices=4)
print(f"Explicit partitions: {partitioned_rdd.getNumPartitions()}")
# Output: 4
# Inspect data distribution across partitions
partition_contents = partitioned_rdd.glom().collect()
for idx, partition in enumerate(partition_contents):
print(f"Partition {idx}: {len(partition)} elements")
# Output shows roughly equal distribution
The glom() transformation returns an RDD where each partition becomes a single list, letting you visualize distribution. Here’s what you’ll see:
data = list(range(10))
rdd = sc.parallelize(data, numSlices=3)
for idx, partition in enumerate(rdd.glom().collect()):
print(f"Partition {idx}: {partition}")
# Output:
# Partition 0: [0, 1, 2]
# Partition 1: [3, 4, 5, 6]
# Partition 2: [7, 8, 9]
Choose partition counts based on your cluster resources. Too few partitions underutilize your cluster; too many create excessive scheduling overhead. The sweet spot is typically 2-4 partitions per CPU core.
Parallelizing Different Data Types
The parallelize() method handles any Python iterable, making it versatile for various data structures.
Strings:
words = ["spark", "hadoop", "python", "scala", "java"]
words_rdd = sc.parallelize(words)
uppercase_rdd = words_rdd.map(lambda word: word.upper())
print(uppercase_rdd.collect())
# Output: ['SPARK', 'HADOOP', 'PYTHON', 'SCALA', 'JAVA']
Tuples (Key-Value Pairs):
# Perfect for pair RDD operations
key_value_pairs = [("apple", 5), ("banana", 3), ("apple", 2), ("orange", 4)]
pairs_rdd = sc.parallelize(key_value_pairs)
# Use pair RDD transformations
totals = pairs_rdd.reduceByKey(lambda a, b: a + b)
print(totals.collect())
# Output: [('apple', 7), ('banana', 3), ('orange', 4)]
Dictionaries:
users = [
{"id": 1, "name": "Alice", "age": 30},
{"id": 2, "name": "Bob", "age": 25},
{"id": 3, "name": "Charlie", "age": 35}
]
users_rdd = sc.parallelize(users)
names = users_rdd.map(lambda user: user["name"])
print(names.collect())
# Output: ['Alice', 'Bob', 'Charlie']
Nested Lists:
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
matrix_rdd = sc.parallelize(matrix)
# Flatten nested structure
flattened = matrix_rdd.flatMap(lambda row: row)
print(flattened.collect())
# Output: [1, 2, 3, 4, 5, 6, 7, 8, 9]
Common Operations on Parallelized RDDs
Once parallelized, RDDs support the full range of Spark transformations and actions. Here are the operations you’ll use constantly:
numbers = sc.parallelize(range(1, 11), numSlices=2)
# Transformation: map (lazy evaluation)
squared = numbers.map(lambda x: x ** 2)
# Transformation: filter (lazy evaluation)
even_squares = squared.filter(lambda x: x % 2 == 0)
# Action: collect (triggers execution)
result = even_squares.collect()
print(result)
# Output: [4, 16, 36, 64, 100]
Chaining Operations:
# Process in a single pipeline
result = sc.parallelize([1, 2, 3, 4, 5]) \
.map(lambda x: x * 2) \
.filter(lambda x: x > 5) \
.reduce(lambda a, b: a + b)
print(result)
# Output: 18 (6 + 8 + 10)
Word Count Example:
text = ["hello world", "hello spark", "spark is fast"]
text_rdd = sc.parallelize(text)
word_counts = text_rdd \
.flatMap(lambda line: line.split()) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
print(word_counts.collect())
# Output: [('hello', 2), ('world', 1), ('spark', 2), ('is', 1), ('fast', 1)]
Best Practices and Performance Considerations
When to Use Parallelize:
Use parallelize() for small datasets under 100MB. For anything larger, load from distributed storage. Parallelizing large local collections creates a bottleneck—all data must transfer from the driver to executors.
Optimal Partition Sizing:
Each partition should contain 100-200MB of data for optimal performance. Calculate partitions accordingly:
# If you have 1000 records averaging 100KB each
# Total: ~100MB, aim for 1-2 partitions
data = list(range(1000))
rdd = sc.parallelize(data, numSlices=2)
Memory Implications:
The entire collection must fit in driver memory before parallelization. If you’re hitting memory limits, you’re using the wrong approach—load from files instead.
Partition Count Impact:
import time
data = list(range(1000000))
# Too few partitions
start = time.time()
rdd1 = sc.parallelize(data, numSlices=1)
result1 = rdd1.map(lambda x: x * 2).count()
time1 = time.time() - start
# Optimal partitions (for 8-core machine)
start = time.time()
rdd2 = sc.parallelize(data, numSlices=16)
result2 = rdd2.map(lambda x: x * 2).count()
time2 = time.time() - start
print(f"1 partition: {time1:.2f}s")
print(f"16 partitions: {time2:.2f}s")
# 16 partitions will be significantly faster
Avoid Repeated Collect:
# Bad: Multiple collects
rdd = sc.parallelize(range(10000))
sum_result = sum(rdd.collect()) # Collects all data
count_result = len(rdd.collect()) # Collects again!
# Good: Use RDD actions
sum_result = rdd.reduce(lambda a, b: a + b)
count_result = rdd.count()
Conclusion
The parallelize() method bridges Python’s local collections and Spark’s distributed computing model. Master it for testing and prototyping, but remember its limitations—production Spark applications read from distributed storage systems like HDFS, S3, or cloud data warehouses.
Key takeaways: control your partitions explicitly, understand that partition count equals maximum parallelism, and reserve parallelize() for datasets that comfortably fit in driver memory. When your data grows beyond a few hundred megabytes, transition to textFile(), parquet(), or other file-based readers that distribute the loading process itself across your cluster.