Prediction Star System

A 2020 draft proposing a way to rate forecasts by how hard they are to outperform — roughly, the analyst effort required to beat them — rather than by calibration. It opens with candidate names and an analogy to the efficient-market hypothesis, then proposes a simple baseline score equal to the number of (quality-adjusted) forecaster-hours invested, estimated from platform signals such as Metaculus prediction counts or trade volume. It then lists complications: stale forecasts when new information arrives (with timestamp and decay fixes), variation in forecaster quality, different arrangements of forecasters, and the availability of information from other platforms. The draft becomes progressively less developed toward the end: Problems 5 and 6 (“Problems differ in difficulty profiles,” “Difficulty to outperform is not the only measure of value”) are headings with no content. A closing section notes how existing platforms partially display such signals and suggests a third party like Metaforecast could compute a common unit, while acknowledging the proposed adjustments are hard to automate. Nuño Sempere’s comments are restored at the end.

Other possible names:“Cost to outperform”?

“Forecast quality rating system?”

“Cost to achieve accuracy”?

“Adjusted Forecasting Outperformance Cost”

“Effort score/index/rating”

“Robustness”

Background

Some group forecasts are the result of so little investigation as to be near useless, and others are near impossible to outperform without large data and analyst teams.

This topic parallels discussion around the efficient market hypothesis. It’s quite apparent that there are professional groups that can outperform the stock market in highly specific areas (as done by hedge funds and select traders), but it takes significant fixed and marginal costs to do so. Good financial institutions find areas that require relatively few intellectual resources for a set amount of return. This could be formalized in equations for something like the “quality adjusted intellectual effort” necessary to make specific amounts of money in various parts of financial markets.

[Note: if someone can recommend discussions of the economics of the costs & benefits of pursuing different trading strategies, and how that impacts the greater stock market, let me know!]

Different questions on PredictIt, Metaculus, Polymarket, and other platforms, get dramatically different levels of activity. It should be assumed that this, and other factors, can dramatically vary the “quality” of forecasts. For example, in mid 2020, “2020 Democratic VP nominee” had 39.7M shares traded, but “Which party will win DE in 2020?” had 82 shares traded. If it were assumed that these were similarly biased in other ways, then a reader should assume that the first question may be fairly difficult to outperform, but the second much easier.

On Metaculus there is an “interest” score for each question and a count of the “total predictions”. These numbers vary dramatically by question. It’s not apparent exactly how to translate this into the question of how much observers of various kinds should trust different aggregates.

An example in finance may be the current Apple Stock Price. This is a very heavily investigated metric with large teams analysts working hard to forecast. A 20-person full-time smart forecasting team could spend 2 years trying to forecast Apple’s Stock Price 1 year out, and wind up not being able to outperform the existing stock price. If one spent $1 Trillion dollars setting up an effective institution to predict Apple’s stock price better than the market, it seems quite possible, though of course not profitable.

Quality here means something roughly like “the challenge to outperform”, which is different from calibration. It’s quite possible to have high calibration but provide an estimate that’s easy to outperform. A naive example would be something like an estimate of “50%” of a long list of randomly chosen binary policy questions.

Starting with a simple score

In a very simple model, we can imagine that forecast quality is solely a factor of how many forecaster-hours went into a given investigation.

Score_simple = number of forecaster hours

This is probably approximately proportional to Metaculus’s interest and number of forecasts, which probably correlate well with hours spent forecasting.

Perhaps we can assume that each prediction on Metaculus corresponds to 20 minutes of investigation. Then a Metaculus question with 300 predictions would have a score of “100 research hours”. It would be expected that as an outsider, if you wanted to beat this prediction, you would need to spend 100 research hours to do it.

Say one question is asked both on Metaculus and PredictIt. On Metaculus, 40 people spend a total of 80 hours on that question over the course of a year. On PredictIt, 80 people spend a total of 50 hours on the question over the last 3 month period. According to Score_simple, score(Metaculus) > score(PredictIt).

Complications

Obviously the previous score is not quite right. It’s incredibly simple. Let’s point out problems, then try to help get around them.

Problem 1: New information may have emerged

It could be the case that the 100 research hours spent on a Metaculus question happened 2 years ago, and significant new evidence has come out since. The previous forecast should still remain calibrated, but maybe now it could be outperformed by a 2 research hour forecast.

Potential Solution: Timestamps

Predictions scores could have timestamps of when they were made and checked. It would be up to the readers to determine how much things have changed and what they should consider the current score to be.

Potential Solution: Continuous decay

Prediction scores could be represented as functions that decrease over time. Maybe these are set up to algorithms that take in feeds of possibly relevant data and can trigger reductions when necessary. For instance, a prediction of the global population in 2040 could be stable for many months, but a news spike around the term “global pandemic” could raise an alarm, dramatically decreasing the score. Humans could also review these from time to time and change the scores manually.

Problem 2: Forecasters vary in quality

It could be that one Metaculus question has 1000 forecasts, but they all come from very new and inexperienced forecasters. There’s quite a bit of evidence that Superforecasters are not just more calibrated but also achieve better resolution than worse forecasters.

Potential Solution: Quality Adjustment

“Quality adjustment” is used in “Quality Adjusted Life Years” to help cross compare various quality and duration of life adjustments. Here it can be used to compare forecasting setups. Perhaps it’s expected that 500 inexperienced forecasters working for 1 hour each would achieve the same amount of accuracy as an experienced team of 5 forecasters, working for 5 hours each.

Each forecasting team is different. Even the same team will vary from question set to question set, as they may become more experienced over time. It probably would be a bit much to try to quantitatively compare every variation of these teams, but one could do this with a much coarser granularity. For example, one method could involve having different weights for different forecasting platforms, and leaving it at that.

Problem 3: Different arrangements of forecasters would produce outputs of differing quality

The wisdom of crowds is well-researched. It could well be the case that 20 great forecasters working for 2 hours is expected to be significantly more high-resolution than 1 great forecaster working for 40 hours.

It’s also recognized that putting forecasters into collaborative teams improves accuracy, over situations where they are all working individually. Collaboration matters and should help total accuracy. In particular, collaboration might reduce redundancy and allow for specialization: in particularly adversarial prediction market setups, each participant might have to research all pieces of information by themselves.

Potential Solution: Quality Adjustment

This could broadly be solved using similar methods as in problem 2.

Problem 4: Available information needs to be accounted for

Imagine that a question on Metaculus only has 20 predictions, but is exactly the same as a heavily traded question on PredictIt. The Metaculus forecasters seem to be simply tracking the PredictIt score.

This brings up a tricky clarification. A reader would have to spend 1 hour to replicate the Metaculus’s accuracy if they had access to the PredictIt score, but 5000 hours if they couldn’t see the PredictIt score.

Problem 5: Problems differ in difficulty profiles

Problem 6: Difficulty to outperform is not the only measure of value

Applying Prediction Star Systems In Practice

In practice, some current platforms apply some of these ideas some of the time. Metaculus displays an interest score and the number of forecasts, but this isn’t adjusted for forecaster quality. PredictIt and Polymarket display trade volume and make the number of trades available. Various Good Judgment dashboards display neither the number of superforecasters nor the amount of time they spent, but presumably these remain relatively standardized from question to question.

Ideally, a third party, such as Metaforecast, could come in and determine a common unit (such as stars), and make these comparable. In practice this is difficult, because the solutions proposed above are difficult to automate efficiently. So far, we have instead resorted to asking people familiar with multiple platforms to directly give their quality assessment.

Comments from Nuño Sempere

Restored with permission (Nuño’s comments, with Ozzie’s replies).

Nuño Sempere: I’d go with “Prediction Star Systems”

Nuño Sempere: Random thought: In practice, I notice that something which would highly correlate with the forecast rating would be the amount I’m willing to bet. I’d be willing to bet $1 : $5 on a prediction I spent 10 mins coming up with, and say $1000 : $ 5000 on a prediction I’d spent a year researching, even though the implied odds are the same.