How to Properly Sort Content Based on User Ratings

Why simple upvote-minus-downvote and average rating approaches fail, and how the Wilson score confidence interval provides the statistically correct way to rank user-rated content.

Problem Statement

You're developing web applications where users rate content. You want highly-rated content to appear at the top and poorly-rated content at the bottom. You need to calculate a "rating" based on user assessments.

Incorrect Solution #1

Rating = (Number of positive votes) − (Number of negative votes)

Why this fails: An item with 600 positive and 400 negative votes (60% positive, rating 200) should rank higher than one with 5500 positive and 4500 negative votes (55% positive, rating 1000). This algorithm reverses the correct order.

Sites making this error: Urban Dictionary

Incorrect Solution #2

Rating = Average rating = (Positive votes) / (Total votes)

Why this fails: An item with 2 positive votes and 0 negative votes (100%) ranks above an item with 100 positive votes and 1 negative vote (99%). This penalizes items with more feedback.

Sites making this error: Amazon

The Correct Solution

Rating = Lower bound of the Wilson score confidence interval for a Bernoulli parameter

The mathematical framework, developed by Edwin Wilson in 1927, answers the question: "Given my data, can I say with 95% confidence what the true positive proportion is?"

Formula:

Wilson Score Formula

Use the minus sign to compute the lower bound. Here is the observed positive proportion, zα/2 is the (1−α/2) quantile of the standard normal distribution, and n is the total number of votes.

Ruby Implementation

require 'statistics2'

def ci_lower_bound(pos, n, confidence)
    if n == 0
        return 0
    end
    z = Statistics2.pnormaldist(1-(1-confidence)/2)
    phat = 1.0*pos/n
    (phat + z*z/(2*n) - z * Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
end

Parameters: pos = number of positive votes, n = total number of votes, confidence = confidence level (use 0.95 for 95% confidence). Use z = 1.96 for 95% confidence if you don't have a statistical library.

SQL Query

SELECT
    widget_id,
    ((positive + 1.9208) / (positive + negative) -
        1.96 * SQRT((positive * negative) / (positive + negative) + 0.9604) /
        (positive + negative)) / (1 + 3.8416 / (positive + negative))
    AS ci_lower_bound
FROM widgets WHERE positive + negative > 0
ORDER BY ci_lower_bound DESC;

Alternative Applications

The Wilson confidence interval applies wherever you want to "confidently determine the proportion of people performing a specific action":

  • Spam/abuse detection: What proportion of viewers flagged this as spam?
  • "Best of" lists: What proportion of users marked this as "best"?
  • "Most shared" lists: What proportion of users clicked share?

This method works much better than simple positive-to-total ratios for "best of" lists relative to views, downloads, or purchases, since user inaction itself contains information about quality.

References

  • Binomial proportion confidence interval (Wikipedia)
  • Agresti, Alan and Brent A. Coull (1998), "Approximate is Better than 'Exact' for Interval Estimation of Binomial Proportions," The American Statistician, 52, 119-126
  • Wilson, E. B. (1927), "Probable Inference, the Law of Succession, and Statistical Inference," Journal of the American Statistical Association, 22, 209-212

JavaScript Implementation

function wilson_score(up, down) {
    if (!up) return -down;
    var n = up + down;
    var z = 1.64485; //1.0 = 85%, 1.6 = 95%
    var phat = up / n;
    return (phat+z*z/(2*n)-z*Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n);
}

Python Implementation (Arbitrary Scale)

def wilson_score(sum_rating, n, votes_range = [0, 1]):
    z = 1.64485
    v_min = min(votes_range)
    v_width = float(max(votes_range) - v_min)
    phat = (sum_rating - n * v_min) / v_width / float(n)
    rating = (phat+z*z/(2*n)-z*sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
    return rating * v_width + v_min

Parameters: sum_rating = total sum of all votes, n = number of votes, votes_range = the possible rating range (e.g. [0, 1] for binary, [1, 5] for a 5-star system). Returns a value within the specified range.