“Forecasting Force”: Early Ideation

A 2020 proposal for an “Experimental Forecasting Task Force” (EFTF): a dedicated team of 3–9 full-time-equivalents doing end-to-end judgmental forecasting on a rapid, sprint-based turnaround, proposed as QURI’s primary focus for the following few years. The draft argues that existing actors (Good Judgment Project, Metaculus, GJP Open) work on slow timescales or rely on unreliable volunteers, and that question development is itself under-developed. It applies an Agile-vs-Waterfall analogy, sketches two team compositions with budget ranges, discusses cohesion vs. specialization, and gives a fictional worked example involving a nuclear-risk research group. It closes with comparisons to engineering consultancies, EA research teams, superforecasting teams, and open prediction tournaments. The piece is fairly complete as a pitch but remains an internal ideation document; it assumes human forecaster labor throughout.

Experimental Forecasting Task Force (EFTF): Early Ideation

Summary:

The EFTF is an idea for a dedicated group of 3-9 full time equivalents that performs end-to-end forecasting with a rapid turnaround time. It would work on a range of topics, prioritizing Effective Altruist concerns and globally important decisions.

This is proposed for the primary focus of the Quantified Uncertainty Research Institute for the next 2-4 years.

Description of need

There’s currently a lot of excitement around judgemental forecasting but relatively little implementation. It’s not completely clear why this is, but it seems like there is a lot of innovation yet to occur. There are only a few actors in the space and these operate on fairly long timescales. The Good Judgement Project does experiments with year-long and multi-million dollar contracts. The Good Judgement Project Open and Metaculus both rely on volunteer forecasters, who can be unreliable and often don’t deliver rapid and reliable turnaround (on the scale of hours or days).

I believe that at this stage not only are group forecasting methods in need of improvement, but so is development around figuring out which questions to ask. Question development can be quite tricky and require significant iteration. Questions need to be clear, useful, actually interesting to clients, and cost-effective for forecasters to work on.

This thinking very much follows that of Agile software development. Rather than aiming for a Waterfall model style 4-20 month forecasting competition, the EFTF would work in sprints of 1 to 4 weeks. Each sprint will include a full cycle of questions being written, forecasted, delivered to clients, and discussed to make sure they are useful and decision relevant.

The goal would be to learn how forecasting can be cost-effective by rapidly iterating with different setups and on different kinds of questions.

Possible Team Composition

Note that full-time means 40-hour-weeks. These can technically be employees or contractors.

Small Team (~$200K-$500K/yr)

1 general manager
2 full-time forecaster equivalents
1 part-time engineer
A network of advisors in various domains
1-2 virtual assistants with different skill sets

Medium Sized Team (~$400K-$900K/yr)

1 product manager
1 client lead (Identify clients, communicate to clients, write questions, deliver results)
2-3 full-time judgemental forecasters
2-20 part-time Forecasters contractors with various specialties, on an on-needed basis
1 full-time Interface/scientific engineer
1 part-time data engineer
1 part-time data scientist
A network of advisors in various domains
1-5 virtual assistants with different skill sets

Cohesion vs. Specialization

Some forecasting topics may require a fair bit of background knowledge. Traditional research groups achieve this by specialization. A team of 10 researchers may focus in 10 different areas and barely need to communicate with each other. This is good if they each have all the skills needed for their work, but in our case things are more complicated. Some members will be focussed on question writing, research, forecasting, data science, and engineering.

In our situation the team will represent a diverse set of domain-agnostic skillsets and would be interested in experimenting with a wide domain of fields. Fortunately it seems from research and practice that superforecasting teams with GJP have done very well in many fields outside of their focus areas, so this would be in line with this work. However, it may be the case that over time the forecasters would specialize into sub teams over longer timespans, rather than constantly working on the same topic during each sprint.

Engineering & Long-Term Work

It’s likely much of the engineering work will operate on different schedules than the forecasting work. For example, some projects the forecasters may work on may require no engineering work, and others many weeks worth.

Example engagement

The (fictional) nuclear safety research team Researchers On Nuclear Risk (RONR) recently completed an analysis of Pakistan-India relations. In this they’ve established an index for measuring nuclear risk between these two countries. They provide an estimate for the likelihood of nuclear attacks based on various levels of this risk index; for example, a score of 50 means there is a 0.001% chance per year of a nuclear attack.

RONR works with the the EFTF client lead to further clarify these questions. The EFTF forecasters start work to verify the estimates made by the nuclear safety group. This is done by helping make these estimates very specific, then having the EFTF forecasters give their own takes upon consideration. The EFTF forecasters determine that the researchers seem significantly pessimistic about some elements of the index, but besides that seem reasonably calibrated. These disagreements lead to a realization that the definition of “nuclear attack” was understood differently by different individuals in both organizations, so this is clarified, then more predictions are made. The results of this are made public.

The next step is to put together a plan for continuous judgemental forecast updates. It’s decided that these forecasts will be updated on a yearly schedule going forward. EFTF commits to spending at least 2 weeks working with this nuclear safety groups and others to update these forecasts on this schedule.

The EFTF engineering group starts related work a few weeks after the judgemental engagement. The main elements of the nuclear safety index are based on publicly accessible data. The process of organizing and tracking them can be automated, but requires engineering effort that the research group does not have. The EFTF engineers handle this work. The EFTF also helps build an automated model that combines this data with the judgemental forecasts to produce automatically updated risk index forecasts.

Because both the primary judgmental forecasts and the automated index forecast are publicly accessible in a common format, they are used in other models by EFTF and other organizations that determine total global nuclear risk, which are in turn used by other models that determine total global existential risk. This is done using the forecast APIs, meaning that they will continue to be updated as new predictions are made.

Comparisons to other services

Engineering Consultancies

Engineering consultancies often have a wide range of specialties on diverse teams. For example, product designers, designers, product managers, frontend engineers, backend engineers, and tech specialists. They can be most effective when many of these skill sets are important for a project, but are lacked by their clients. They can do intense work to set up new systems, or ongoing efforts to maintain systems. Often a team of 4-8 may work between 5-20 clients over the course of a year.

In our case things are similar. Combinations of skill sets are important for forecasting setups to be useful, and it doesn’t yet make sense for many other organizations to try to set up full internal forecasting teams at this point.

Similar to these engineering engagements, forecasting engagements would likely entail upfront periods of setup followed by ongoing maintenance. The upfront setup would involve question definitions and initial predictions. The “maintenance” would involve keeping the predictions and relevant data up to date at regular intervals.

EFTF work would look very different to the vast majority of academic research that is currently being done. The specific part of “clearly defining forecasting questions and then ensuring that a team continues to forecast them” is typically a very small part of the research process. Traditional research papers can be great for coming up with new insights, understanding ways of looking at questions, and summarizing existing work, things that the EFTF would not be particularly focusing on. Hopefully research teams could be great collaborators for EFTF efforts. Researchers could act both as clients and as advisors.

Superforecasting Teams

The Good Judgement Project charges for the services of small teams of superforecasters. They seem to work in engagements that last several months at a time in order to focus on 5-15 questions over that period. From what I can tell, this service is a rather small part of the Good Judgment Project’s client portfolio. Most of their existing contracts are under nondisclosure agreements, and in general they don’t share that much research in learnings of how these groups best operate. These teams are also typically mainly superforecasters, rather than more diverse teams with engineers, data management, and research assistants. The EFTF would be more experimental in technique, follow a compressed workflow, and release their research and process improvement findings to the public.

Open Prediction Tournaments

There are several examples of large predictions tournaments. For instance, The Good Judgement Project Open and Metaculus. These can be cost effective ways of getting many forecasters on well-defined questions. This seems to be the best way of achieving scale and on relying on inexpensive volunteers, but can be challenging in other ways:

The turnaround can take a significant amount of time.
It can be difficult to ensure that the necessary questions are properly forecasted on specified schedules, especially when there are many questions.
These systems don’t work with steep learning curves, so need simple interfaces that exclude possible functionality.
The questions, predictions, and aggregations need to be public, or at least available to many people.

I think that open platforms may be a large part of the total solution. It’s possible that Foretold can still be encouraged for larger groups.