Skip to content

Considerations for Time-Scoring

“Time-scoring” is the simplest phrase for discussion around how to score forecasts for questions that last some time. This corresponds to the “Agent-Question” scoring layer.

First, I want to lay out a few different clusters of situations where people/researchers may care about scores. I think that there are many kinds of scenarios that are likely to work best with different scoring systems. Identifying the main distinctions that require different scoring systems seems useful.

1. Iterated vs. single-shot

Iterated scenarios are those where users will make updates to their forecasts based on either new information, new forecasts, or more thought. In these situations, users will be expected to gain information over time.

Single-shot scenarios are those where users make single predictions with or without discussions by others. These predictions are made without seeing each other’s predictions.

2. 1-person, small group, very large group.

1-person scenarios are those where only one person makes forecasts. In these cases, there may be no market for them to compare against (unless we have an AI that’s decent).

Very large groups would should have some amount of competition on all questions at roughly all times.

Small groups are in-between and can be hard to predict. There can be some questions where only 1 person makes predictions, and others where lots of people make predictions. Markets may be relatively inefficient for long periods of time.

3. Multiple timescales: 1min, 3 hours, 1 year, 30 years

4. Things we may be able to trust:

  • The market will be relatively efficient.
  • The question writers will be not be corrupted to favor specific participants.
  • Goodwill; players generally want the project to go well.

5. Multiple question quantities:

1-5 questions (Key important indicators)

5-20 questions (A cluster of important questions)

20-400 questions, scattered

  1. User Skill
  • Amatures vs. Experienced users
  • Many “good” users vs. a few “expert” users.l
  1. People shouldn’t be incentivized to wait until the end to predict.

  2. People shouldn’t be required to keep on forecasting (with the same exact value), once they make one forecast.

  3. People shouldn’t be incentivized to make many of the same, or near-the-same forecasts.

  4. People shouldn’t be disincentivized from contributing useful forecasts, from their expected perspective.

  5. People should be incentivized to share information that would be useful for others

  6. Making a contribution which strictly improves the aggregate at the point it was made should not give negative points (unless at a later time it makes the aggregate worse)

  7. Making generally useful predictions should generally result in positive scores.

  8. The score should be combinable with other factors (such as when clients will read the predictions).

  9. If 2-4 people provide a lot of value predicting, with no one else predicting that sum total should get positive points.

  10. Users shouldn’t be incentivized to track the aggregate too much.

  • Actually, there could be clever ways of doing this where it’s not very computationally expensive.
  1. Users shouldn’t be able to copy the inputs of a respected user and get free points.

  2. If the aggregate exactly matches one really good player, that player should still get EV.

  3. Users should be incentivised to become more trusted by the system.

  • For example, currently, the better your past performance, the more you influence the aggregate. But the more you influence the aggregate, the closer your predictions will be to the aggregate, and as a result you will get fewer points. This creates an incentive for not revealing how good you actually are, e.g. by using different accounts

Certainty will increase in time.

Models:

  • Forecast every x hours

  • Continuous forecast, then take the integral/average