PySpark - Add Auto-Increment Column to DataFrame

PySpark DataFrames don't have a native auto-increment column like traditional SQL databases. This becomes problematic when you need unique row identifiers for tracking, joining datasets, or...

Key Insights

  • monotonically_increasing_id() is fastest but generates non-consecutive IDs with gaps, making it ideal for unique identifiers but poor for sequential numbering
  • row_number() with window functions creates true sequential integers starting from 1, at the cost of triggering a full dataset shuffle
  • zipWithIndex() on RDDs guarantees consecutive numbering but forces collection to the driver, creating a performance bottleneck for large datasets

Understanding the Need for Auto-Increment Columns

PySpark DataFrames don’t have a native auto-increment column like traditional SQL databases. This becomes problematic when you need unique row identifiers for tracking, joining datasets, or maintaining insertion order. Unlike pandas where you have a default index, distributed DataFrames in Spark require explicit strategies to add sequential or unique identifiers.

Common scenarios include creating surrogate keys for dimension tables, assigning unique transaction IDs, or simply numbering rows for debugging and analysis. Each approach has distinct trade-offs between performance, sequence guarantees, and consecutive numbering.

Let’s start with a sample DataFrame to demonstrate each method:

from pyspark.sql import SparkSession
from pyspark.sql.functions import monotonically_increasing_id, row_number
from pyspark.sql.window import Window

spark = SparkSession.builder.appName("AutoIncrement").getOrCreate()

# Sample data
data = [
    ("Alice", "Engineering", 95000),
    ("Bob", "Marketing", 65000),
    ("Charlie", "Engineering", 85000),
    ("Diana", "Sales", 75000),
    ("Eve", "Engineering", 90000)
]

df = spark.createDataFrame(data, ["name", "department", "salary"])
df.show()
+-------+------------+------+
|   name|  department|salary|
+-------+------------+------+
|  Alice| Engineering| 95000|
|    Bob|   Marketing| 65000|
|Charlie| Engineering| 85000|
|  Diana|       Sales| 75000|
|    Eve| Engineering| 90000|
+-------+------------+------+

Using monotonically_increasing_id()

The monotonically_increasing_id() function is PySpark’s built-in solution for generating unique identifiers. It creates a 64-bit integer that increases monotonically and is unique across all partitions. However, the values are not consecutive and don’t start from 1.

The function works by combining partition IDs with a counter within each partition, ensuring uniqueness without requiring coordination between partitions. This makes it extremely fast and scalable.

df_with_id = df.withColumn("id", monotonically_increasing_id())
df_with_id.show()
+-------+------------+------+------------+
|   name|  department|salary|          id|
+-------+------------+------+------------+
|  Alice| Engineering| 95000|           0|
|    Bob|   Marketing| 65000|  8589934592|
|Charlie| Engineering| 85000| 17179869184|
|  Diana|       Sales| 75000| 25769803776|
|    Eve| Engineering| 90000| 34359738368|
+-------+------------+------+------------+

Notice the large gaps between IDs. These gaps correspond to partition boundaries. Each partition gets a range of IDs (typically 8.5 billion per partition), and the counter increments within that range.

When to use this approach:

  • You need unique identifiers but don’t care about consecutiveness
  • Performance is critical and you’re working with large datasets
  • The IDs are used for joining or deduplication, not display

Limitations:

  • IDs have large gaps and aren’t human-readable
  • The starting value and gaps depend on partitioning
  • Not suitable when you need sequential numbering (1, 2, 3…)

Using row_number() Window Function

For true sequential numbering starting from 1, use the row_number() window function. This approach gives you complete control over ordering and allows partitioned numbering within groups.

# Create a window specification ordered by name
window_spec = Window.orderBy("name")

df_with_row_number = df.withColumn("row_num", row_number().over(window_spec))
df_with_row_number.show()
+-------+------------+------+-------+
|   name|  department|salary|row_num|
+-------+------------+------+-------+
|  Alice| Engineering| 95000|      1|
|    Bob|   Marketing| 65000|      2|
|Charlie| Engineering| 85000|      3|
|  Diana|       Sales| 75000|      4|
|    Eve| Engineering| 90000|      5|
+-------+------------+------+-------+

You can also partition the numbering by groups. For example, to number employees within each department:

# Partition by department and order by salary
window_spec_partitioned = Window.partitionBy("department").orderBy("salary")

df_partitioned = df.withColumn(
    "dept_rank", 
    row_number().over(window_spec_partitioned)
)
df_partitioned.orderBy("department", "dept_rank").show()
+-------+------------+------+---------+
|   name|  department|salary|dept_rank|
+-------+------------+------+---------+
|Charlie| Engineering| 85000|        1|
|    Eve| Engineering| 90000|        2|
|  Alice| Engineering| 95000|        3|
|    Bob|   Marketing| 65000|        1|
|  Diana|       Sales| 75000|        1|
+-------+------------+------+---------+

When to use this approach:

  • You need consecutive integers starting from 1
  • The ordering of rows matters for your use case
  • You need to partition numbering within groups
  • Dataset size is moderate (window operations trigger shuffles)

Limitations:

  • Triggers a full shuffle operation, impacting performance
  • Requires explicit ordering (no guaranteed order without orderBy)
  • Slower than monotonically_increasing_id() on large datasets

Using zipWithIndex() on RDD

The RDD-based zipWithIndex() method provides consecutive numbering starting from 0. This requires converting the DataFrame to an RDD, applying the operation, and converting back.

# Convert to RDD, apply zipWithIndex, convert back
rdd_with_index = df.rdd.zipWithIndex()

# Transform to include index as a column
df_with_index = rdd_with_index.map(
    lambda x: (x[1],) + tuple(x[0])
).toDF(["index", "name", "department", "salary"])

df_with_index.show()
+-----+-------+------------+------+
|index|   name|  department|salary|
+-----+-------+------------+------+
|    0|  Alice| Engineering| 95000|
|    1|    Bob|   Marketing| 65000|
|    2|Charlie| Engineering| 85000|
|    3|  Diana|       Sales| 75000|
|    4|    Eve| Engineering| 90000|
+-----+-------+------------+------+

When to use this approach:

  • You need consecutive integers starting from 0
  • You’re already working with RDDs
  • Dataset is small to medium-sized

Limitations:

  • Forces computation and collection of partition sizes to the driver
  • Not scalable for very large datasets
  • Breaks DataFrame API optimizations
  • Order depends on partition order, which may not be deterministic

Performance Considerations

The performance characteristics differ significantly between these methods:

import time

# Create a larger dataset for timing
large_data = [(f"Person_{i}", f"Dept_{i%10}", i * 1000) 
              for i in range(100000)]
large_df = spark.createDataFrame(large_data, ["name", "department", "salary"])

# Time monotonically_increasing_id
start = time.time()
result1 = large_df.withColumn("id", monotonically_increasing_id())
result1.count()  # Force evaluation
time1 = time.time() - start
print(f"monotonically_increasing_id: {time1:.2f}s")

# Time row_number
start = time.time()
window_spec = Window.orderBy("name")
result2 = large_df.withColumn("row_num", row_number().over(window_spec))
result2.count()  # Force evaluation
time2 = time.time() - start
print(f"row_number: {time2:.2f}s")

# Time zipWithIndex (on smaller subset to avoid driver OOM)
small_df = large_df.limit(10000)
start = time.time()
result3 = small_df.rdd.zipWithIndex().map(
    lambda x: (x[1],) + tuple(x[0])
).toDF(["index", "name", "department", "salary"])
result3.count()
time3 = time.time() - start
print(f"zipWithIndex: {time3:.2f}s")

Performance hierarchy (fastest to slowest):

  1. monotonically_increasing_id() - No shuffle, partition-local operation
  2. zipWithIndex() - Single pass but requires driver coordination
  3. row_number() - Full shuffle required for global ordering

For datasets with millions of rows, monotonically_increasing_id() can be 5-10x faster than row_number(). The shuffle operation in window functions becomes the bottleneck, especially when ordering across the entire dataset.

Practical Use Cases and Best Practices

Use Case 1: Creating Surrogate Keys for Joins

When joining DataFrames without natural keys, monotonically_increasing_id() works perfectly:

df1 = spark.createDataFrame([("A",), ("B",), ("C",)], ["value"])
df2 = spark.createDataFrame([(100,), (200,), (300,)], ["score"])

df1_indexed = df1.withColumn("id", monotonically_increasing_id())
df2_indexed = df2.withColumn("id", monotonically_increasing_id())

joined = df1_indexed.join(df2_indexed, "id")
joined.show()

Use Case 2: Ranking with Ties

When you need to handle ties in rankings, combine row_number() with other window functions:

from pyspark.sql.functions import rank, dense_rank

window_spec = Window.orderBy(df["salary"].desc())

df_ranked = df.withColumn("row_num", row_number().over(window_spec)) \
              .withColumn("rank", rank().over(window_spec)) \
              .withColumn("dense_rank", dense_rank().over(window_spec))

df_ranked.show()

Best Practices:

  1. Cache before windowing: If you’ll reuse the DataFrame, cache it before applying window functions to avoid recomputation
  2. Minimize partitions: Reduce the number of partitions before zipWithIndex() to minimize driver overhead
  3. Choose based on requirements: Don’t use row_number() if monotonically_increasing_id() suffices
  4. Consider repartitioning: For monotonically_increasing_id(), repartition first if you need predictable ID ranges

The right choice depends on your specific requirements: uniqueness vs. consecutiveness, performance vs. readability, and whether ordering matters for your use case.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.