SQL - PERCENT_RANK() and CUME_DIST()
PERCENT_RANK() calculates the relative rank of each row within a result set as a percentage. The formula is: (rank - 1) / (total rows - 1). This means the first row always gets 0, the last row gets...
Key Insights
- PERCENT_RANK() calculates the relative rank of a row as a percentage from 0 to 1, useful for identifying top/bottom performers in datasets
- CUME_DIST() returns the cumulative distribution of values, showing what percentage of rows have values less than or equal to the current row
- Both functions are window functions that require OVER() clause with ORDER BY, and they handle ties differently—PERCENT_RANK() gives tied rows the same rank while CUME_DIST() counts all tied rows in the distribution
Understanding PERCENT_RANK()
PERCENT_RANK() calculates the relative rank of each row within a result set as a percentage. The formula is: (rank - 1) / (total rows - 1). This means the first row always gets 0, the last row gets 1, and everything else falls proportionally in between.
SELECT
employee_id,
department,
salary,
PERCENT_RANK() OVER (ORDER BY salary) as salary_percentile
FROM employees;
This query shows where each employee’s salary falls in the overall distribution. An employee with a percent_rank of 0.75 earns more than 75% of other employees.
The real power comes when partitioning data:
SELECT
employee_id,
department,
salary,
PERCENT_RANK() OVER (
PARTITION BY department
ORDER BY salary
) as dept_salary_percentile
FROM employees;
Now each department has its own ranking scale. A developer earning $80K might be at the 60th percentile in engineering but would be at the 90th percentile in marketing.
Understanding CUME_DIST()
CUME_DIST() calculates cumulative distribution—the percentage of rows with values less than or equal to the current row. The formula is: (number of rows with values <= current row) / (total rows).
SELECT
product_id,
product_name,
price,
CUME_DIST() OVER (ORDER BY price) as price_distribution
FROM products;
If a product has a CUME_DIST() of 0.80, it means 80% of all products cost the same or less than this product.
The key difference from PERCENT_RANK(): CUME_DIST() never returns 0 (minimum is 1/n where n is the number of rows), and it accounts for tied values differently.
SELECT
student_id,
exam_score,
PERCENT_RANK() OVER (ORDER BY exam_score) as percent_rank,
CUME_DIST() OVER (ORDER BY exam_score) as cumulative_dist
FROM exam_results
ORDER BY exam_score;
With scores like [70, 80, 80, 90, 100], PERCENT_RANK() for the first 80 would be 0.25, but CUME_DIST() would be 0.60 because 3 out of 5 students scored 80 or below.
Practical Use Case: Sales Performance Analysis
Analyzing sales rep performance across regions requires understanding both relative ranking and cumulative distribution:
WITH sales_data AS (
SELECT
rep_id,
rep_name,
region,
SUM(sale_amount) as total_sales
FROM sales
WHERE sale_date >= '2024-01-01'
GROUP BY rep_id, rep_name, region
)
SELECT
rep_name,
region,
total_sales,
PERCENT_RANK() OVER (
PARTITION BY region
ORDER BY total_sales
) as regional_percentile,
CUME_DIST() OVER (
PARTITION BY region
ORDER BY total_sales
) as regional_cumulative,
CASE
WHEN PERCENT_RANK() OVER (
PARTITION BY region
ORDER BY total_sales
) >= 0.90 THEN 'Top 10%'
WHEN PERCENT_RANK() OVER (
PARTITION BY region
ORDER BY total_sales
) >= 0.75 THEN 'Top 25%'
ELSE 'Standard'
END as performance_tier
FROM sales_data
ORDER BY region, total_sales DESC;
This query segments sales reps into performance tiers within their regions. The PERCENT_RANK() identifies top performers, while CUME_DIST() shows the actual distribution of sales performance.
Identifying Outliers and Anomalies
Both functions excel at detecting outliers in datasets:
WITH response_times AS (
SELECT
endpoint,
request_timestamp,
response_time_ms,
PERCENT_RANK() OVER (
PARTITION BY endpoint
ORDER BY response_time_ms
) as response_percentile
FROM api_logs
WHERE request_timestamp >= CURRENT_DATE - INTERVAL '7 days'
)
SELECT
endpoint,
request_timestamp,
response_time_ms,
response_percentile
FROM response_times
WHERE response_percentile >= 0.95
ORDER BY endpoint, response_time_ms DESC;
This identifies the slowest 5% of API responses per endpoint. You can set alerts when too many requests fall into this category.
For cumulative analysis:
SELECT
order_amount,
COUNT(*) as order_count,
CUME_DIST() OVER (ORDER BY order_amount) as cumulative_pct
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY order_amount
HAVING CUME_DIST() OVER (ORDER BY order_amount) <= 0.80;
This shows all order amounts that represent the bottom 80% of orders by value, useful for understanding where most of your revenue concentration lies.
Handling Ties and Edge Cases
Understanding how these functions handle ties is critical:
CREATE TABLE test_scores (
student_id INT,
score INT
);
INSERT INTO test_scores VALUES
(1, 70), (2, 80), (3, 80), (4, 80), (5, 90), (6, 100);
SELECT
student_id,
score,
PERCENT_RANK() OVER (ORDER BY score) as pr,
CUME_DIST() OVER (ORDER BY score) as cd,
RANK() OVER (ORDER BY score) as rnk,
ROW_NUMBER() OVER (ORDER BY score) as rn
FROM test_scores;
Results:
student_id | score | pr | cd | rnk | rn
-----------|-------|------|------|-----|---
1 | 70 | 0.00 | 0.17 | 1 | 1
2 | 80 | 0.20 | 0.67 | 2 | 2
3 | 80 | 0.20 | 0.67 | 2 | 3
4 | 80 | 0.20 | 0.67 | 2 | 4
5 | 90 | 0.80 | 0.83 | 5 | 5
6 | 100 | 1.00 | 1.00 | 6 | 6
All three students with score 80 get the same PERCENT_RANK (0.20) but their CUME_DIST is 0.67 because 4 out of 6 students scored 80 or below.
Performance Optimization
Window functions can be expensive. Optimize by limiting the partition size and using appropriate indexes:
-- Inefficient: calculating over entire table
SELECT
product_id,
sale_date,
revenue,
PERCENT_RANK() OVER (ORDER BY revenue) as revenue_rank
FROM sales; -- millions of rows
-- Efficient: partition by time period
SELECT
product_id,
sale_date,
revenue,
PERCENT_RANK() OVER (
PARTITION BY DATE_TRUNC('month', sale_date)
ORDER BY revenue
) as monthly_revenue_rank
FROM sales
WHERE sale_date >= CURRENT_DATE - INTERVAL '3 months';
Create indexes on columns used in PARTITION BY and ORDER BY clauses:
CREATE INDEX idx_sales_date_revenue
ON sales(sale_date, revenue);
Combining with Other Window Functions
PERCENT_RANK() and CUME_DIST() work well alongside other analytical functions:
SELECT
product_category,
product_name,
units_sold,
PERCENT_RANK() OVER (
PARTITION BY product_category
ORDER BY units_sold
) as sales_percentile,
NTILE(4) OVER (
PARTITION BY product_category
ORDER BY units_sold
) as quartile,
LAG(units_sold) OVER (
PARTITION BY product_category
ORDER BY units_sold
) as previous_product_sales,
AVG(units_sold) OVER (
PARTITION BY product_category
) as category_avg
FROM product_sales
WHERE sale_year = 2024;
This provides multiple perspectives on product performance: percentile ranking, quartile grouping, comparison with adjacent products, and category averages.
Real-World Application: Dynamic Pricing
Use these functions to implement dynamic pricing based on demand distribution:
WITH booking_demand AS (
SELECT
hotel_id,
check_in_date,
COUNT(*) as bookings,
PERCENT_RANK() OVER (
PARTITION BY hotel_id
ORDER BY COUNT(*)
) as demand_percentile
FROM reservations
WHERE check_in_date BETWEEN CURRENT_DATE AND CURRENT_DATE + INTERVAL '90 days'
GROUP BY hotel_id, check_in_date
)
SELECT
hotel_id,
check_in_date,
bookings,
demand_percentile,
CASE
WHEN demand_percentile >= 0.90 THEN 1.50 -- 50% premium
WHEN demand_percentile >= 0.75 THEN 1.25 -- 25% premium
WHEN demand_percentile >= 0.50 THEN 1.10 -- 10% premium
WHEN demand_percentile >= 0.25 THEN 1.00 -- base price
ELSE 0.85 -- 15% discount
END as price_multiplier
FROM booking_demand
ORDER BY hotel_id, check_in_date;
This creates a data-driven pricing strategy based on booking demand distribution, automatically adjusting prices for high-demand dates while discounting low-demand periods.