Considerations for Time-Scoring

A September 2019 sketch laying out considerations for scoring forecasts on questions that span time (“time-scoring”). It is organized as lists rather than prose: a taxonomy of scenarios (iterated vs. single-shot, group size, timescales, question quantity, user skill), a set of about a dozen properties a good scoring rule should satisfy (e.g. not incentivizing waiting until the end, not rewarding duplicate forecasts, not letting users copy a respected forecaster for free points), and a short list of assumptions and candidate models. One noted property is an anti-gaming concern: because influencing the aggregate reduces one’s own scoring opportunities, skilled forecasters may be incentivized to conceal their skill or use multiple accounts. The document is brief and consists almost entirely of enumerated requirements with little elaboration; it ends abruptly on two modeling options.

“Time-scoring” is the simplest phrase for discussion around how to score forecasts for questions that last some time. This corresponds to the “Agent-Question” scoring layer.

First, I want to lay out a few different clusters of situations where people/researchers may care about scores. I think that there are many kinds of scenarios that are likely to work best with different scoring systems. Identifying the main distinctions that require different scoring systems seems useful.

Scenarios:

1. Iterated vs. single-shot

Iterated scenarios are those where users will make updates to their forecasts based on either new information, new forecasts, or more thought. In these situations, users will be expected to gain information over time.

Single-shot scenarios are those where users make single predictions with or without discussions by others. These predictions are made without seeing each other’s predictions.

2. 1-person, small group, very large group.

1-person scenarios are those where only one person makes forecasts. In these cases, there may be no market for them to compare against (unless we have an AI that’s decent).

Very large groups would should have some amount of competition on all questions at roughly all times.

Small groups are in-between and can be hard to predict. There can be some questions where only 1 person makes predictions, and others where lots of people make predictions. Markets may be relatively inefficient for long periods of time.

3. Multiple timescales: 1min, 3 hours, 1 year, 30 years

4. Things we may be able to trust:

The market will be relatively efficient.
The question writers will be not be corrupted to favor specific participants.
Goodwill; players generally want the project to go well.

5. Multiple question quantities:

1-5 questions (Key important indicators)

5-20 questions (A cluster of important questions)

20-400 questions, scattered

User Skill

Amatures vs. Experienced users
Many “good” users vs. a few “expert” users.l

Cases to handle well:

People shouldn’t be incentivized to wait until the end to predict.
People shouldn’t be required to keep on forecasting (with the same exact value), once they make one forecast.
People shouldn’t be incentivized to make many of the same, or near-the-same forecasts.
People shouldn’t be disincentivized from contributing useful forecasts, from their expected perspective.
People should be incentivized to share information that would be useful for others
Making a contribution which strictly improves the aggregate at the point it was made should not give negative points (unless at a later time it makes the aggregate worse)
Making generally useful predictions should generally result in positive scores.
The score should be combinable with other factors (such as when clients will read the predictions).
If 2-4 people provide a lot of value predicting, with no one else predicting that sum total should get positive points.
Users shouldn’t be incentivized to track the aggregate too much.

Actually, there could be clever ways of doing this where it’s not very computationally expensive.

Users shouldn’t be able to copy the inputs of a respected user and get free points.
If the aggregate exactly matches one really good player, that player should still get EV.
Users should be incentivised to become more trusted by the system.

For example, currently, the better your past performance, the more you influence the aggregate. But the more you influence the aggregate, the closer your predictions will be to the aggregate, and as a result you will get fewer points. This creates an incentive for not revealing how good you actually are, e.g. by using different accounts

Assumptions:

Certainty will increase in time.

Models:

Forecast every x hours
Continuous forecast, then take the integral/average