Algorithms

How to Properly Sort Content Based on User Ratings

Why simple upvote-minus-downvote and average rating approaches fail, and how the Wilson score confidence interval provides the statistically correct way to rank user-rated content.

Problem Statement

You're developing web applications where users rate content. You want highly-rated content to appear at the top and poorly-rated content at the bottom. You need to calculate a "rating" based on user assessments.

Incorrect Solution #1

Rating = (Number of positive votes) − (Number of negative votes)

Why this fails: An item with 600 positive and 400 negative votes (60% positive, rating 200) should rank higher than one with 5500 positive and 4500 negative votes (55% positive, rating 1000). This algorithm reverses the correct order.

Sites making this error: Urban Dictionary

Incorrect Solution #2

Rating = Average rating = (Positive votes) / (Total votes)

Why this fails: An item with 2 positive votes and 0 negative votes (100%) ranks above an item with 100 positive votes and 1 negative vote (99%). This penalizes items with more feedback.

Sites making this error: Amazon

The Correct Solution

Rating = Lower bound of the Wilson score confidence interval for a Bernoulli parameter

The mathematical framework, developed by Edwin Wilson in 1927, answers the question: "Given my data, can I say with 95% confidence what the true positive proportion is?"

Formula:

Use the minus sign to compute the lower bound. Here p̂ is the observed positive proportion, z_α/2 is the (1−α/2) quantile of the standard normal distribution, and n is the total number of votes.

Ruby Implementation

require 'statistics2'

def ci_lower_bound(pos, n, confidence)
    if n == 0
        return 0
    end
    z = Statistics2.pnormaldist(1-(1-confidence)/2)
    phat = 1.0*pos/n
    (phat + z*z/(2*n) - z * Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
end

Parameters: pos = number of positive votes, n = total number of votes, confidence = confidence level (use 0.95 for 95% confidence). Use z = 1.96 for 95% confidence if you don't have a statistical library.

SQL Query

SELECT
    widget_id,
    ((positive + 1.9208) / (positive + negative) -
        1.96 * SQRT((positive * negative) / (positive + negative) + 0.9604) /
        (positive + negative)) / (1 + 3.8416 / (positive + negative))
    AS ci_lower_bound
FROM widgets WHERE positive + negative > 0
ORDER BY ci_lower_bound DESC;

Alternative Applications

The Wilson confidence interval applies wherever you want to "confidently determine the proportion of people performing a specific action":

Spam/abuse detection: What proportion of viewers flagged this as spam?
"Best of" lists: What proportion of users marked this as "best"?
"Most shared" lists: What proportion of users clicked share?

This method works much better than simple positive-to-total ratios for "best of" lists relative to views, downloads, or purchases, since user inaction itself contains information about quality.

References

Binomial proportion confidence interval (Wikipedia)
Agresti, Alan and Brent A. Coull (1998), "Approximate is Better than 'Exact' for Interval Estimation of Binomial Proportions," The American Statistician, 52, 119-126
Wilson, E. B. (1927), "Probable Inference, the Law of Succession, and Statistical Inference," Journal of the American Statistical Association, 22, 209-212

JavaScript Implementation

function wilson_score(up, down) {
    if (!up) return -down;
    var n = up + down;
    var z = 1.64485; //1.0 = 85%, 1.6 = 95%
    var phat = up / n;
    return (phat+z*z/(2*n)-z*Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n);
}

Python Implementation (Arbitrary Scale)

def wilson_score(sum_rating, n, votes_range = [0, 1]):
    z = 1.64485
    v_min = min(votes_range)
    v_width = float(max(votes_range) - v_min)
    phat = (sum_rating - n * v_min) / v_width / float(n)
    rating = (phat+z*z/(2*n)-z*sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
    return rating * v_width + v_min

Parameters: sum_rating = total sum of all votes, n = number of votes, votes_range = the possible rating range (e.g. [0, 1] for binary, [1, 5] for a 5-star system). Returns a value within the specified range.

How to Properly Sort Content Based on User Ratings

Problem Statement

Incorrect Solution #1

Incorrect Solution #2

The Correct Solution

Ruby Implementation

SQL Query

Alternative Applications

References

JavaScript Implementation

Python Implementation (Arbitrary Scale)

Further reading

Why Airships Never Took Off. Part 12: Italian Semi-Rigid Airships

Why Airships Never Took Off. Part 11: Aircraft Carriers in the Sky

Why Airships Never Took Off. Part 10: The Most Famous and Successful Zeppelin

Why Airships Never Took Off. Part 9: Ashes of War and New Opportunities