Skip to content

Prediction Star System

Other possible names:“Cost to outperform”?

“Forecast quality rating system?”

“Cost to achieve accuracy”?

“Adjusted Forecasting Outperformance Cost”

“Effort score/index/rating”

“Robustness”

Some group forecasts are the result of so little investigation as to be near useless, and others are near impossible to outperform without large data and analyst teams.

This topic parallels discussion around the efficient market hypothesis. It’s quite apparent that there are professional groups that can outperform the stock market in highly specific areas (as done by hedge funds and select traders), but it takes significant fixed and marginal costs to do so. Good financial institutions find areas that require relatively few intellectual resources for a set amount of return. This could be formalized in equations for something like the “quality adjusted intellectual effort” necessary to make specific amounts of money in various parts of financial markets.

[Note: if someone can recommend discussions of the economics of the costs & benefits of pursuing different trading strategies, and how that impacts the greater stock market, let me know!]

Different questions on PredictIt, Metaculus, Polymarket, and other platforms, get dramatically different levels of activity. It should be assumed that this, and other factors, can dramatically vary the “quality” of forecasts. For example, in mid 2020, “2020 Democratic VP nominee” had 39.7M shares traded, but “Which party will win DE in 2020?” had 82 shares traded. If it were assumed that these were similarly biased in other ways, then a reader should assume that the first question may be fairly difficult to outperform, but the second much easier.

On Metaculus there is an “interest” score for each question and a count of the “total predictions”. These numbers vary dramatically by question. It’s not apparent exactly how to translate this into the question of how much observers of various kinds should trust different aggregates.

An example in finance may be the current Apple Stock Price. This is a very heavily investigated metric with large teams analysts working hard to forecast. A 20-person full-time smart forecasting team could spend 2 years trying to forecast Apple’s Stock Price 1 year out, and wind up not being able to outperform the existing stock price. If one spent $1 Trillion dollars setting up an effective institution to predict Apple’s stock price better than the market, it seems quite possible, though of course not profitable.

Quality here means something roughly like “the challenge to outperform”, which is different from calibration. It’s quite possible to have high calibration but provide an estimate that’s easy to outperform. A naive example would be something like an estimate of “50%” of a long list of randomly chosen binary policy questions.

In a very simple model, we can imagine that forecast quality is solely a factor of how many forecaster-hours went into a given investigation.

Score_simple = number of forecaster hours

This is probably approximately proportional to Metaculus’s interest and number of forecasts, which probably correlate well with hours spent forecasting.

Perhaps we can assume that each prediction on Metaculus corresponds to 20 minutes of investigation. Then a Metaculus question with 300 predictions would have a score of “100 research hours”. It would be expected that as an outsider, if you wanted to beat this prediction, you would need to spend 100 research hours to do it.

Say one question is asked both on Metaculus and PredictIt. On Metaculus, 40 people spend a total of 80 hours on that question over the course of a year. On PredictIt, 80 people spend a total of 50 hours on the question over the last 3 month period. According to Score_simple, score(Metaculus) > score(PredictIt).

Obviously the previous score is not quite right. It’s incredibly simple. Let’s point out problems, then try to help get around them.

Problem 1: New information may have emerged

Section titled “Problem 1: New information may have emerged”

It could be the case that the 100 research hours spent on a Metaculus question happened 2 years ago, and significant new evidence has come out since. The previous forecast should still remain calibrated, but maybe now it could be outperformed by a 2 research hour forecast.

Potential Solution: Timestamps

Predictions scores could have timestamps of when they were made and checked. It would be up to the readers to determine how much things have changed and what they should consider the current score to be.

Potential Solution: Continuous decay

Prediction scores could be represented as functions that decrease over time. Maybe these are set up to algorithms that take in feeds of possibly relevant data and can trigger reductions when necessary. For instance, a prediction of the global population in 2040 could be stable for many months, but a news spike around the term “global pandemic” could raise an alarm, dramatically decreasing the score. Humans could also review these from time to time and change the scores manually.

It could be that one Metaculus question has 1000 forecasts, but they all come from very new and inexperienced forecasters. There’s quite a bit of evidence that Superforecasters are not just more calibrated but also achieve better resolution than worse forecasters.

Potential Solution: Quality Adjustment

“Quality adjustment” is used in “Quality Adjusted Life Years” to help cross compare various quality and duration of life adjustments. Here it can be used to compare forecasting setups. Perhaps it’s expected that 500 inexperienced forecasters working for 1 hour each would achieve the same amount of accuracy as an experienced team of 5 forecasters, working for 5 hours each.

Each forecasting team is different. Even the same team will vary from question set to question set, as they may become more experienced over time. It probably would be a bit much to try to quantitatively compare every variation of these teams, but one could do this with a much coarser granularity. For example, one method could involve having different weights for different forecasting platforms, and leaving it at that.

Problem 3: Different arrangements of forecasters would produce outputs of differing quality

Section titled “Problem 3: Different arrangements of forecasters would produce outputs of differing quality”

The wisdom of crowds is well-researched. It could well be the case that 20 great forecasters working for 2 hours is expected to be significantly more high-resolution than 1 great forecaster working for 40 hours.

It’s also recognized that putting forecasters into collaborative teams improves accuracy, over situations where they are all working individually. Collaboration matters and should help total accuracy. In particular, collaboration might reduce redundancy and allow for specialization: in particularly adversarial prediction market setups, each participant might have to research all pieces of information by themselves.

Potential Solution: Quality Adjustment

This could broadly be solved using similar methods as in problem 2.

Problem 4: Available information needs to be accounted for

Section titled “Problem 4: Available information needs to be accounted for”

Imagine that a question on Metaculus only has 20 predictions, but is exactly the same as a heavily traded question on PredictIt. The Metaculus forecasters seem to be simply tracking the PredictIt score.

This brings up a tricky clarification. A reader would have to spend 1 hour to replicate the Metaculus’s accuracy if they had access to the PredictIt score, but 5000 hours if they couldn’t see the PredictIt score.

Problem 5: Problems differ in difficulty profiles

Section titled “Problem 5: Problems differ in difficulty profiles”

Problem 6: Difficulty to outperform is not the only measure of value

Section titled “Problem 6: Difficulty to outperform is not the only measure of value”

Applying Prediction Star Systems In Practice

Section titled “Applying Prediction Star Systems In Practice”

In practice, some current platforms apply some of these ideas some of the time. Metaculus displays an interest score and the number of forecasts, but this isn’t adjusted for forecaster quality. PredictIt and Polymarket display trade volume and make the number of trades available. Various Good Judgment dashboards display neither the number of superforecasters nor the amount of time they spent, but presumably these remain relatively standardized from question to question.

Ideally, a third party, such as Metaforecast, could come in and determine a common unit (such as stars), and make these comparable. In practice this is difficult, because the solutions proposed above are difficult to automate efficiently. So far, we have instead resorted to asking people familiar with multiple platforms to directly give their quality assessment.

Restored with permission (Nuño’s comments, with Ozzie’s replies).

Nuño Sempere: I’d go with “Prediction Star Systems”

Nuño Sempere: Random thought: In practice, I notice that something which would highly correlate with the forecast rating would be the amount I’m willing to bet. I’d be willing to bet $1 : $5 on a prediction I spent 10 mins coming up with, and say $1000 : $ 5000 on a prediction I’d spent a year researching, even though the implied odds are the same.