PySpark - NTILE Window Function
NTILE is a window function that divides an ordered dataset into N roughly equal buckets or tiles, assigning each row a bucket number from 1 to N. Think of it as automatically creating quartiles (4...
Key Insights
- NTILE divides ordered data into N equal-sized buckets, making it ideal for percentile analysis, customer segmentation, and performance tiering without complex calculations
- Unlike RANK() which assigns sequential numbers based on value, NTILE guarantees roughly equal bucket sizes even with duplicate values, distributing remainder rows to earlier buckets
- Combining NTILE with partitionBy() enables independent bucketing within categories (like departments or regions), perfect for relative performance analysis across different groups
Introduction to NTILE Window Function
NTILE is a window function that divides an ordered dataset into N roughly equal buckets or tiles, assigning each row a bucket number from 1 to N. Think of it as automatically creating quartiles (4 buckets), deciles (10 buckets), or any custom segmentation you need.
The primary use cases are straightforward and powerful: identifying top 25% performers, segmenting customers into price sensitivity tiers, creating A/B/C/D grade distributions, or dividing any ordered dataset into equal groups for analysis. Unlike percentile calculations that require aggregations, NTILE gives you immediate bucket assignments in a single pass.
Here’s a simple example showing quartile distribution:
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import ntile, col
spark = SparkSession.builder.appName("NTILEExample").getOrCreate()
# Simple dataset
data = [
("Alice", 95),
("Bob", 87),
("Charlie", 76),
("David", 92),
("Eve", 68),
("Frank", 81),
("Grace", 89),
("Henry", 73)
]
df = spark.createDataFrame(data, ["name", "score"])
# Create quartiles
window_spec = Window.orderBy(col("score").desc())
df_with_quartile = df.withColumn("quartile", ntile(4).over(window_spec))
df_with_quartile.orderBy(col("score").desc()).show()
Output:
+-------+-----+--------+
| name|score|quartile|
+-------+-----+--------+
| Alice| 95| 1|
| David| 92| 1|
| Grace| 89| 2|
| Bob | 87| 2|
| Frank| 81| 3|
| Charlie| 76| 3|
| Henry| 73| 4|
| Eve| 68| 4|
+-------+-----+--------+
Basic Syntax and Setup
The NTILE function requires proper window specification using PySpark’s Window class. The essential imports and basic syntax pattern are:
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import ntile, col, desc
# Initialize Spark session
spark = SparkSession.builder \
.appName("NTILE_Analysis") \
.getOrCreate()
# Basic NTILE syntax
window_spec = Window.orderBy(col("column_name").desc())
df.withColumn("bucket", ntile(N).over(window_spec))
# With partitioning
window_spec = Window \
.partitionBy("category_column") \
.orderBy(col("value_column").desc())
df.withColumn("bucket", ntile(N).over(window_spec))
The orderBy() clause is mandatory for NTILE—it determines how rows are sorted before bucketing. Use desc() for descending order (highest values in bucket 1) or ascending by default. The partitionBy() clause is optional but powerful, creating independent bucket sets within each partition.
Practical Example: Sales Data Analysis
Let’s analyze sales performance to identify top, middle, and bottom-tier performers. This is a common business scenario where NTILE excels:
from pyspark.sql.functions import ntile, col, desc, round as spark_round
# Sales representative data
sales_data = [
("John", "East", 145000),
("Sarah", "West", 198000),
("Mike", "East", 167000),
("Emily", "West", 223000),
("David", "Central", 134000),
("Lisa", "Central", 189000),
("Tom", "East", 156000),
("Anna", "West", 201000),
("Chris", "Central", 178000),
("Jessica", "East", 142000),
("Mark", "West", 165000),
("Rachel", "Central", 191000)
]
sales_df = spark.createDataFrame(
sales_data,
["rep_name", "region", "total_sales"]
)
# Create performance tiers (quartiles)
performance_window = Window.orderBy(col("total_sales").desc())
tiered_sales = sales_df.withColumn(
"performance_tier",
ntile(4).over(performance_window)
)
# Add tier labels
from pyspark.sql.functions import when
final_df = tiered_sales.withColumn(
"tier_label",
when(col("performance_tier") == 1, "Top Performer")
.when(col("performance_tier") == 2, "Above Average")
.when(col("performance_tier") == 3, "Average")
.otherwise("Needs Improvement")
)
final_df.orderBy("performance_tier", col("total_sales").desc()).show()
This creates clear performance categories. You can now filter for performance_tier == 1 to identify your top performers for bonuses, or performance_tier == 4 for coaching opportunities.
NTILE with Multiple Partitions
NTILE becomes even more powerful when partitioning by categories. This lets you create relative rankings within each group:
# Employee salary data by department
employee_data = [
("Alice", "Engineering", 125000),
("Bob", "Engineering", 115000),
("Charlie", "Engineering", 135000),
("David", "Engineering", 105000),
("Eve", "Engineering", 145000),
("Frank", "Engineering", 120000),
("Grace", "Sales", 95000),
("Henry", "Sales", 110000),
("Iris", "Sales", 88000),
("Jack", "Sales", 105000),
("Kate", "Sales", 98000),
("Leo", "Sales", 92000),
("Mary", "Marketing", 78000),
("Nathan", "Marketing", 85000),
("Olivia", "Marketing", 72000),
("Paul", "Marketing", 82000),
("Quinn", "Marketing", 88000),
("Rita", "Marketing", 75000)
]
emp_df = spark.createDataFrame(
employee_data,
["name", "department", "salary"]
)
# Create terciles within each department
dept_window = Window \
.partitionBy("department") \
.orderBy(col("salary").desc())
emp_with_tercile = emp_df.withColumn(
"salary_tercile",
ntile(3).over(dept_window)
)
emp_with_tercile.orderBy("department", "salary_tercile").show(20)
This shows who’s in the top, middle, or bottom third of their department—far more useful than company-wide rankings when departments have different salary ranges.
Handling Edge Cases
NTILE handles uneven distributions intelligently. When the total row count isn’t evenly divisible by N, the remainder rows are distributed to the earlier buckets (1, 2, 3, etc.):
# Dataset with 10 rows, creating 3 buckets
edge_case_data = [
("A", 100), ("B", 95), ("C", 90), ("D", 85), ("E", 80),
("F", 75), ("G", 70), ("H", 65), ("I", 60), ("J", 55)
]
edge_df = spark.createDataFrame(edge_case_data, ["id", "value"])
window = Window.orderBy(col("value").desc())
result = edge_df.withColumn("bucket", ntile(3).over(window))
result.groupBy("bucket").count().orderBy("bucket").show()
Output:
+------+-----+
|bucket|count|
+------+-----+
| 1| 4| # Gets extra row (10 % 3 = 1 remainder)
| 2| 3|
| 3| 3|
+------+-----+
For duplicate values in the ordering column, NTILE doesn’t guarantee which bucket they land in—it depends on the physical row order. If you need consistent tie-breaking, add a secondary sort column:
# Consistent tie-breaking
window = Window.orderBy(col("value").desc(), col("id"))
Null values in the ordering column are sorted to the end (or beginning with desc_nulls_last), and they’ll be assigned to the last buckets.
Comparison with Other Window Functions
Understanding when to use NTILE versus alternatives is crucial:
from pyspark.sql.functions import rank, dense_rank, percent_rank
comparison_data = [
("A", 100), ("B", 100), ("C", 90),
("D", 80), ("E", 80), ("F", 70)
]
comp_df = spark.createDataFrame(comparison_data, ["id", "score"])
window = Window.orderBy(col("score").desc())
result = comp_df \
.withColumn("ntile_3", ntile(3).over(window)) \
.withColumn("rank", rank().over(window)) \
.withColumn("dense_rank", dense_rank().over(window)) \
.withColumn("percent_rank", spark_round(percent_rank().over(window), 2))
result.show()
Output:
+---+-----+-------+----+----------+------------+
| id|score|ntile_3|rank|dense_rank|percent_rank|
+---+-----+-------+----+----------+------------+
| A| 100| 1| 1| 1| 0.00|
| B| 100| 1| 1| 1| 0.00|
| C| 90| 2| 3| 2| 0.40|
| D| 80| 2| 4| 3| 0.60|
| E| 80| 3| 4| 3| 0.60|
| F| 70| 3| 6| 4| 1.00|
+---+-----+-------+----+----------+------------+
Use NTILE when you need equal-sized groups regardless of value distribution. Use RANK/DENSE_RANK when you need to preserve value-based ordering. Use PERCENT_RANK for normalized percentile values.
Performance Considerations and Best Practices
NTILE operations trigger full dataset shuffles when using partitionBy(). Follow these practices for optimal performance:
# 1. Cache the source DataFrame if using multiple window functions
sales_df.cache()
# 2. Combine multiple window operations with the same partition/order
window = Window.partitionBy("region").orderBy(col("sales").desc())
result = sales_df \
.withColumn("quartile", ntile(4).over(window)) \
.withColumn("rank", rank().over(window))
# 3. Filter before windowing when possible
filtered_df = sales_df.filter(col("sales") > 50000)
result = filtered_df.withColumn("tier", ntile(5).over(window))
# 4. Use appropriate partition keys to avoid skew
# Good: partition by region (balanced distribution)
# Bad: partition by country (if one country dominates)
# 5. Repartition large datasets before complex window operations
sales_df = sales_df.repartition(200, "region")
For very large datasets, consider sampling for exploratory analysis before running full NTILE operations. The computation cost scales linearly with data size, but shuffle operations can become bottlenecks.
NTILE is your go-to function for dividing ordered data into equal buckets. It’s simpler than percentile calculations, more intuitive than rank functions for bucketing, and handles edge cases gracefully. Use it whenever you need to segment data into performance tiers, price bands, or any N-way split of ordered data.