SQL - PERCENT_RANK() and CUME_DIST()

Key Insights

PERCENT_RANK() calculates the relative rank of a row as a percentage from 0 to 1, useful for identifying top/bottom performers in datasets
CUME_DIST() returns the cumulative distribution of values, showing what percentage of rows have values less than or equal to the current row
Both functions are window functions that require OVER() clause with ORDER BY, and they handle ties differently—PERCENT_RANK() gives tied rows the same rank while CUME_DIST() counts all tied rows in the distribution

Understanding PERCENT_RANK()

PERCENT_RANK() calculates the relative rank of each row within a result set as a percentage. The formula is: (rank - 1) / (total rows - 1). This means the first row always gets 0, the last row gets 1, and everything else falls proportionally in between.

SELECT 
    employee_id,
    department,
    salary,
    PERCENT_RANK() OVER (ORDER BY salary) as salary_percentile
FROM employees;

This query shows where each employee’s salary falls in the overall distribution. An employee with a percent_rank of 0.75 earns more than 75% of other employees.

The real power comes when partitioning data:

SELECT 
    employee_id,
    department,
    salary,
    PERCENT_RANK() OVER (
        PARTITION BY department 
        ORDER BY salary
    ) as dept_salary_percentile
FROM employees;

Now each department has its own ranking scale. A developer earning $80K might be at the 60th percentile in engineering but would be at the 90th percentile in marketing.

Understanding CUME_DIST()

CUME_DIST() calculates cumulative distribution—the percentage of rows with values less than or equal to the current row. The formula is: (number of rows with values <= current row) / (total rows).

SELECT 
    product_id,
    product_name,
    price,
    CUME_DIST() OVER (ORDER BY price) as price_distribution
FROM products;

If a product has a CUME_DIST() of 0.80, it means 80% of all products cost the same or less than this product.

The key difference from PERCENT_RANK(): CUME_DIST() never returns 0 (minimum is 1/n where n is the number of rows), and it accounts for tied values differently.

SELECT 
    student_id,
    exam_score,
    PERCENT_RANK() OVER (ORDER BY exam_score) as percent_rank,
    CUME_DIST() OVER (ORDER BY exam_score) as cumulative_dist
FROM exam_results
ORDER BY exam_score;

With scores like [70, 80, 80, 90, 100], PERCENT_RANK() for the first 80 would be 0.25, but CUME_DIST() would be 0.60 because 3 out of 5 students scored 80 or below.

Practical Use Case: Sales Performance Analysis

Analyzing sales rep performance across regions requires understanding both relative ranking and cumulative distribution:

WITH sales_data AS (
    SELECT 
        rep_id,
        rep_name,
        region,
        SUM(sale_amount) as total_sales
    FROM sales
    WHERE sale_date >= '2024-01-01'
    GROUP BY rep_id, rep_name, region
)
SELECT 
    rep_name,
    region,
    total_sales,
    PERCENT_RANK() OVER (
        PARTITION BY region 
        ORDER BY total_sales
    ) as regional_percentile,
    CUME_DIST() OVER (
        PARTITION BY region 
        ORDER BY total_sales
    ) as regional_cumulative,
    CASE 
        WHEN PERCENT_RANK() OVER (
            PARTITION BY region 
            ORDER BY total_sales
        ) >= 0.90 THEN 'Top 10%'
        WHEN PERCENT_RANK() OVER (
            PARTITION BY region 
            ORDER BY total_sales
        ) >= 0.75 THEN 'Top 25%'
        ELSE 'Standard'
    END as performance_tier
FROM sales_data
ORDER BY region, total_sales DESC;

This query segments sales reps into performance tiers within their regions. The PERCENT_RANK() identifies top performers, while CUME_DIST() shows the actual distribution of sales performance.

Identifying Outliers and Anomalies

Both functions excel at detecting outliers in datasets:

WITH response_times AS (
    SELECT 
        endpoint,
        request_timestamp,
        response_time_ms,
        PERCENT_RANK() OVER (
            PARTITION BY endpoint 
            ORDER BY response_time_ms
        ) as response_percentile
    FROM api_logs
    WHERE request_timestamp >= CURRENT_DATE - INTERVAL '7 days'
)
SELECT 
    endpoint,
    request_timestamp,
    response_time_ms,
    response_percentile
FROM response_times
WHERE response_percentile >= 0.95
ORDER BY endpoint, response_time_ms DESC;

This identifies the slowest 5% of API responses per endpoint. You can set alerts when too many requests fall into this category.

For cumulative analysis:

SELECT 
    order_amount,
    COUNT(*) as order_count,
    CUME_DIST() OVER (ORDER BY order_amount) as cumulative_pct
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY order_amount
HAVING CUME_DIST() OVER (ORDER BY order_amount) <= 0.80;

This shows all order amounts that represent the bottom 80% of orders by value, useful for understanding where most of your revenue concentration lies.

Handling Ties and Edge Cases

Understanding how these functions handle ties is critical:

CREATE TABLE test_scores (
    student_id INT,
    score INT
);

INSERT INTO test_scores VALUES 
(1, 70), (2, 80), (3, 80), (4, 80), (5, 90), (6, 100);

SELECT 
    student_id,
    score,
    PERCENT_RANK() OVER (ORDER BY score) as pr,
    CUME_DIST() OVER (ORDER BY score) as cd,
    RANK() OVER (ORDER BY score) as rnk,
    ROW_NUMBER() OVER (ORDER BY score) as rn
FROM test_scores;

Results:

student_id | score | pr   | cd   | rnk | rn
-----------|-------|------|------|-----|---
1          | 70    | 0.00 | 0.17 | 1   | 1
2          | 80    | 0.20 | 0.67 | 2   | 2
3          | 80    | 0.20 | 0.67 | 2   | 3
4          | 80    | 0.20 | 0.67 | 2   | 4
5          | 90    | 0.80 | 0.83 | 5   | 5
6          | 100   | 1.00 | 1.00 | 6   | 6

All three students with score 80 get the same PERCENT_RANK (0.20) but their CUME_DIST is 0.67 because 4 out of 6 students scored 80 or below.

Performance Optimization

Window functions can be expensive. Optimize by limiting the partition size and using appropriate indexes:

-- Inefficient: calculating over entire table
SELECT 
    product_id,
    sale_date,
    revenue,
    PERCENT_RANK() OVER (ORDER BY revenue) as revenue_rank
FROM sales; -- millions of rows

-- Efficient: partition by time period
SELECT 
    product_id,
    sale_date,
    revenue,
    PERCENT_RANK() OVER (
        PARTITION BY DATE_TRUNC('month', sale_date)
        ORDER BY revenue
    ) as monthly_revenue_rank
FROM sales
WHERE sale_date >= CURRENT_DATE - INTERVAL '3 months';

Create indexes on columns used in PARTITION BY and ORDER BY clauses:

CREATE INDEX idx_sales_date_revenue 
ON sales(sale_date, revenue);

Combining with Other Window Functions

PERCENT_RANK() and CUME_DIST() work well alongside other analytical functions:

SELECT 
    product_category,
    product_name,
    units_sold,
    PERCENT_RANK() OVER (
        PARTITION BY product_category 
        ORDER BY units_sold
    ) as sales_percentile,
    NTILE(4) OVER (
        PARTITION BY product_category 
        ORDER BY units_sold
    ) as quartile,
    LAG(units_sold) OVER (
        PARTITION BY product_category 
        ORDER BY units_sold
    ) as previous_product_sales,
    AVG(units_sold) OVER (
        PARTITION BY product_category
    ) as category_avg
FROM product_sales
WHERE sale_year = 2024;

This provides multiple perspectives on product performance: percentile ranking, quartile grouping, comparison with adjacent products, and category averages.

Real-World Application: Dynamic Pricing

Use these functions to implement dynamic pricing based on demand distribution:

WITH booking_demand AS (
    SELECT 
        hotel_id,
        check_in_date,
        COUNT(*) as bookings,
        PERCENT_RANK() OVER (
            PARTITION BY hotel_id
            ORDER BY COUNT(*)
        ) as demand_percentile
    FROM reservations
    WHERE check_in_date BETWEEN CURRENT_DATE AND CURRENT_DATE + INTERVAL '90 days'
    GROUP BY hotel_id, check_in_date
)
SELECT 
    hotel_id,
    check_in_date,
    bookings,
    demand_percentile,
    CASE 
        WHEN demand_percentile >= 0.90 THEN 1.50  -- 50% premium
        WHEN demand_percentile >= 0.75 THEN 1.25  -- 25% premium
        WHEN demand_percentile >= 0.50 THEN 1.10  -- 10% premium
        WHEN demand_percentile >= 0.25 THEN 1.00  -- base price
        ELSE 0.85  -- 15% discount
    END as price_multiplier
FROM booking_demand
ORDER BY hotel_id, check_in_date;

This creates a data-driven pricing strategy based on booking demand distribution, automatically adjusting prices for high-demand dates while discounting low-demand periods.