SQL - PERCENT_RANK() and CUME_DIST()

PERCENT_RANK() calculates the relative rank of each row within a result set as a percentage. The formula is: (rank - 1) / (total rows - 1). This means the first row always gets 0, the last row gets...

Key Insights

  • PERCENT_RANK() calculates the relative rank of a row as a percentage from 0 to 1, useful for identifying top/bottom performers in datasets
  • CUME_DIST() returns the cumulative distribution of values, showing what percentage of rows have values less than or equal to the current row
  • Both functions are window functions that require OVER() clause with ORDER BY, and they handle ties differently—PERCENT_RANK() gives tied rows the same rank while CUME_DIST() counts all tied rows in the distribution

Understanding PERCENT_RANK()

PERCENT_RANK() calculates the relative rank of each row within a result set as a percentage. The formula is: (rank - 1) / (total rows - 1). This means the first row always gets 0, the last row gets 1, and everything else falls proportionally in between.

SELECT 
    employee_id,
    department,
    salary,
    PERCENT_RANK() OVER (ORDER BY salary) as salary_percentile
FROM employees;

This query shows where each employee’s salary falls in the overall distribution. An employee with a percent_rank of 0.75 earns more than 75% of other employees.

The real power comes when partitioning data:

SELECT 
    employee_id,
    department,
    salary,
    PERCENT_RANK() OVER (
        PARTITION BY department 
        ORDER BY salary
    ) as dept_salary_percentile
FROM employees;

Now each department has its own ranking scale. A developer earning $80K might be at the 60th percentile in engineering but would be at the 90th percentile in marketing.

Understanding CUME_DIST()

CUME_DIST() calculates cumulative distribution—the percentage of rows with values less than or equal to the current row. The formula is: (number of rows with values <= current row) / (total rows).

SELECT 
    product_id,
    product_name,
    price,
    CUME_DIST() OVER (ORDER BY price) as price_distribution
FROM products;

If a product has a CUME_DIST() of 0.80, it means 80% of all products cost the same or less than this product.

The key difference from PERCENT_RANK(): CUME_DIST() never returns 0 (minimum is 1/n where n is the number of rows), and it accounts for tied values differently.

SELECT 
    student_id,
    exam_score,
    PERCENT_RANK() OVER (ORDER BY exam_score) as percent_rank,
    CUME_DIST() OVER (ORDER BY exam_score) as cumulative_dist
FROM exam_results
ORDER BY exam_score;

With scores like [70, 80, 80, 90, 100], PERCENT_RANK() for the first 80 would be 0.25, but CUME_DIST() would be 0.60 because 3 out of 5 students scored 80 or below.

Practical Use Case: Sales Performance Analysis

Analyzing sales rep performance across regions requires understanding both relative ranking and cumulative distribution:

WITH sales_data AS (
    SELECT 
        rep_id,
        rep_name,
        region,
        SUM(sale_amount) as total_sales
    FROM sales
    WHERE sale_date >= '2024-01-01'
    GROUP BY rep_id, rep_name, region
)
SELECT 
    rep_name,
    region,
    total_sales,
    PERCENT_RANK() OVER (
        PARTITION BY region 
        ORDER BY total_sales
    ) as regional_percentile,
    CUME_DIST() OVER (
        PARTITION BY region 
        ORDER BY total_sales
    ) as regional_cumulative,
    CASE 
        WHEN PERCENT_RANK() OVER (
            PARTITION BY region 
            ORDER BY total_sales
        ) >= 0.90 THEN 'Top 10%'
        WHEN PERCENT_RANK() OVER (
            PARTITION BY region 
            ORDER BY total_sales
        ) >= 0.75 THEN 'Top 25%'
        ELSE 'Standard'
    END as performance_tier
FROM sales_data
ORDER BY region, total_sales DESC;

This query segments sales reps into performance tiers within their regions. The PERCENT_RANK() identifies top performers, while CUME_DIST() shows the actual distribution of sales performance.

Identifying Outliers and Anomalies

Both functions excel at detecting outliers in datasets:

WITH response_times AS (
    SELECT 
        endpoint,
        request_timestamp,
        response_time_ms,
        PERCENT_RANK() OVER (
            PARTITION BY endpoint 
            ORDER BY response_time_ms
        ) as response_percentile
    FROM api_logs
    WHERE request_timestamp >= CURRENT_DATE - INTERVAL '7 days'
)
SELECT 
    endpoint,
    request_timestamp,
    response_time_ms,
    response_percentile
FROM response_times
WHERE response_percentile >= 0.95
ORDER BY endpoint, response_time_ms DESC;

This identifies the slowest 5% of API responses per endpoint. You can set alerts when too many requests fall into this category.

For cumulative analysis:

SELECT 
    order_amount,
    COUNT(*) as order_count,
    CUME_DIST() OVER (ORDER BY order_amount) as cumulative_pct
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY order_amount
HAVING CUME_DIST() OVER (ORDER BY order_amount) <= 0.80;

This shows all order amounts that represent the bottom 80% of orders by value, useful for understanding where most of your revenue concentration lies.

Handling Ties and Edge Cases

Understanding how these functions handle ties is critical:

CREATE TABLE test_scores (
    student_id INT,
    score INT
);

INSERT INTO test_scores VALUES 
(1, 70), (2, 80), (3, 80), (4, 80), (5, 90), (6, 100);

SELECT 
    student_id,
    score,
    PERCENT_RANK() OVER (ORDER BY score) as pr,
    CUME_DIST() OVER (ORDER BY score) as cd,
    RANK() OVER (ORDER BY score) as rnk,
    ROW_NUMBER() OVER (ORDER BY score) as rn
FROM test_scores;

Results:

student_id | score | pr   | cd   | rnk | rn
-----------|-------|------|------|-----|---
1          | 70    | 0.00 | 0.17 | 1   | 1
2          | 80    | 0.20 | 0.67 | 2   | 2
3          | 80    | 0.20 | 0.67 | 2   | 3
4          | 80    | 0.20 | 0.67 | 2   | 4
5          | 90    | 0.80 | 0.83 | 5   | 5
6          | 100   | 1.00 | 1.00 | 6   | 6

All three students with score 80 get the same PERCENT_RANK (0.20) but their CUME_DIST is 0.67 because 4 out of 6 students scored 80 or below.

Performance Optimization

Window functions can be expensive. Optimize by limiting the partition size and using appropriate indexes:

-- Inefficient: calculating over entire table
SELECT 
    product_id,
    sale_date,
    revenue,
    PERCENT_RANK() OVER (ORDER BY revenue) as revenue_rank
FROM sales; -- millions of rows

-- Efficient: partition by time period
SELECT 
    product_id,
    sale_date,
    revenue,
    PERCENT_RANK() OVER (
        PARTITION BY DATE_TRUNC('month', sale_date)
        ORDER BY revenue
    ) as monthly_revenue_rank
FROM sales
WHERE sale_date >= CURRENT_DATE - INTERVAL '3 months';

Create indexes on columns used in PARTITION BY and ORDER BY clauses:

CREATE INDEX idx_sales_date_revenue 
ON sales(sale_date, revenue);

Combining with Other Window Functions

PERCENT_RANK() and CUME_DIST() work well alongside other analytical functions:

SELECT 
    product_category,
    product_name,
    units_sold,
    PERCENT_RANK() OVER (
        PARTITION BY product_category 
        ORDER BY units_sold
    ) as sales_percentile,
    NTILE(4) OVER (
        PARTITION BY product_category 
        ORDER BY units_sold
    ) as quartile,
    LAG(units_sold) OVER (
        PARTITION BY product_category 
        ORDER BY units_sold
    ) as previous_product_sales,
    AVG(units_sold) OVER (
        PARTITION BY product_category
    ) as category_avg
FROM product_sales
WHERE sale_year = 2024;

This provides multiple perspectives on product performance: percentile ranking, quartile grouping, comparison with adjacent products, and category averages.

Real-World Application: Dynamic Pricing

Use these functions to implement dynamic pricing based on demand distribution:

WITH booking_demand AS (
    SELECT 
        hotel_id,
        check_in_date,
        COUNT(*) as bookings,
        PERCENT_RANK() OVER (
            PARTITION BY hotel_id
            ORDER BY COUNT(*)
        ) as demand_percentile
    FROM reservations
    WHERE check_in_date BETWEEN CURRENT_DATE AND CURRENT_DATE + INTERVAL '90 days'
    GROUP BY hotel_id, check_in_date
)
SELECT 
    hotel_id,
    check_in_date,
    bookings,
    demand_percentile,
    CASE 
        WHEN demand_percentile >= 0.90 THEN 1.50  -- 50% premium
        WHEN demand_percentile >= 0.75 THEN 1.25  -- 25% premium
        WHEN demand_percentile >= 0.50 THEN 1.10  -- 10% premium
        WHEN demand_percentile >= 0.25 THEN 1.00  -- base price
        ELSE 0.85  -- 15% discount
    END as price_multiplier
FROM booking_demand
ORDER BY hotel_id, check_in_date;

This creates a data-driven pricing strategy based on booking demand distribution, automatically adjusting prices for high-demand dates while discounting low-demand periods.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.