Ozzie-Forecasting High-Level Thoughts

A long working document (October 2019) collecting the author’s models about forecasting and its use for EA, written deliberately as informal notes with few citations and one embedded image. Part 1 lays out high-level patterns: shaping problems to fit strong tools (the “law of the hammer”), a distinction between “generalist” and “specialist” research, a systematic-vs-nonsystematic reasoning gradient, and a research-vs-development / horizontal-vs-vertical framing. Part 2 contrasts human (judgmental, general, weak) with AI (statistical, narrow, strong) forecasting, argues forecasting is roughly “AGI-complete,” and claims important forecasting questions resemble financially-traded ones. Part 3 addresses misconceptions about predictability and the Good Judgement Project. The draft is unfinished: one section is marked “[Incomplete],” and it trails off into a list of unelaborated claims about forecasts versus estimates.

Ozzie’s Forecasting High-Level Thoughts

Useful work:

Foretold: foretold.io/login

Foretold inputs: https://observablehq.com/@oagr/foretold-inputs

Introduction

I’ve spent a fair bit of time in the last few years (especially the last one) trying to make sense of the forecasting space and how it can be best applied for EA purposes. I’m currently working on an application to help with forecasting.

At this point I have a bunch of models and opinions on the topic. There’s no one linear thread; rather there’s a long list of things that follow a few clusters.

My current plan is to spend a few more evenings on this document, then share it to several relevant parties for feedback, to later be posted to the EA forum or similar. Feedback at any stage is highly appreciated.

This document uses lots of simple examples and does not have many citations. It is not meant to be as formal as a proper book or paper, but rather as a time-effective method for me to share a lot of concepts. My primary project at the moment is in making and advancing Foretold, which I believe to generally be more time-effective for me.

Previous Material

My LessWrong series on “Prediction-Driven Collaborative Reasoning Systems” has a few posts on my thoughts on how advanced predictions could work.

I had a double-crux with Vaniver in July 2019 about related issues, a transcript of which was posted to LessWrong.

Part 1: High-level Ideas & Patterns

Pattern: “If you have a hammer…” / Strategic Advantages

From Wikipedia:As Abraham Maslow said in 1966, “I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.”[2]

The “Law of the hammer” is generally considered specific to a cognitive bias, but if it can also be altered for responsible individuals to be a useful pattern. There are cases where you may find a few uniquely great tools; in these cases, it could make a lot of sense to pay a lot of attention to ways you can modify all types of problems into version that could fit your tools.

Example: AI Pathfinding

Humans normally do pathfinding intuitively. We can quickly imagine what a good path for an entity would be from a glance at a diagram. AIs (especially pre-NN) generally don’t have this ability. However, they are very good at simple calculation. This has been used to solve pathfinding problems in ways that never would have come up as possibilities to humans. Instead of intuiting good answers, AIs could first make a simple list of every possible option, and then try simulating each one.

figure from Ozzie-Forecasting High-Level Thoughts

Example: Digital vs. Analog Electronics

Say you want to make an electronic controller for opening a chicken coup when the sun comes up. You could design an elegant solution with analog electronics; this would involve doing math for resistors and capacitors to make it operate as you’d expect.

Alternatively, you could write a small digital program with a microcontroller. This may seem wasteful; the microcontroller has far more complexity and circuit components in it than any reasonable analogue solution. However, you know microcontrollers really well, and because there are so many they’ve gotten quite cheap.

Rather than use “many custom setups” for different problems, you eventually realize that microcontrollers are almost always simpler overall, and overtime stop using analogue components almost all-together.

This is basically what has happened in industry.

Example: Functional Programming

Functional programming techniques are very elegant and powerful for pure code, but much less so for non-pure code. Therefore, a lot of work is done to isolate non-pure functionality from pure-functionality, and then optimize each separately. After this is done the pure code can often be significantly optimized.

Relevance for Forecasting:

If we get really good at forecasting and related methods, we may be able to get creative at using them for areas that we normally wouldn’t think of right now. We would also want to strongly distinguish questions where forecasting would be helpful from those where it won’t be, so we can be sure to handle the former accordingly. These are general tools and with some work we may be able to get a whole lot of our current problems to fit into their required shapes.

This is not to say that there aren’t many currently-obvious uses of forecasting, but rather that in the future there may be many more obvious uses.

“Generalist” Research

There is a spectrum in how accessible research is for nonspecialists. Here I call research that is accessible to generally smart people “generalist” research, which is different from “specialist” research. “Generalist” refers to domains (bio, physics, etc), rather than methods (statistical analysis, literature review, etc).

Example: Data Science

Data science is one example. Many “data scientists” get good at “data science”; a set of methods, different from a specific domain to apply them. There are large areas of work where any competent data scientist would be useful, but of course there are other areas where you would need a data scientist deeply trained in a specific topic. The work where any competent data scientist would be useful would be considered “generalist”, while the other would be considered “specialist.”

Example: Open Philanthropy Project

Many of the main staff at the Open Philanthropy Project have done cause analyses for many different areas. They are often quite new to the areas they are investigating. Despite that, they seem to have done a competent job at doing prioritization in very different areas. The Open Philanthropy Project has developed a network of specialists to get advice from, and later did hire some domain experts for specific areas. However, it was generalist competence that led to doing these steps.

Example: Many Entrepreneurs

Many of the most successful entrepreneurs were relatively new to their future field. Their comparative advantage is generally in entrepreneurial activities, rather than the specific domains they wind up applying those to. Some examples would include Uber, AirBnB, Dropbox, Stripe, Twitch, SpaceX, Tesla, etc.

Claim: Generalist Research is very important and large.

Much of prioritization would fall under generalist research. I believe that people could prioritize things significantly better, that would both be directly useful, and would also speed up the valuable research.

Many of the main EA problems I can think of strike me as rather general; namely, the strategy and prioritization for many important fields. Right now I get the impression that there’s a ton of useful prioritization and strategy work to be done in EA areas.

Relevance for Forecasting:

I think that superforecasters and other smart forecasting communities can do an obviously good job at some important aspects of generalist research. It’s not yet as obvious how useful new forecasting tools can be for specialist research, but even if it were only useful for generalist research, that could be great if we think there is a lot of useful generalist research to be done.

Systematic vs. Nonsystematic Thinking

(Maybe “Formal” vs. “Informal” thinking?)

I want to highlight a gradient of reasoning methods.

Relatively Systematic Methods:

Formal mathematics
Formal ontologies & taxonomies
Explicit decision calculations
Tables of data and estimates

Relatively Nonsystematic Methods:

Loose group brainstorming
Most blog posts, essays, and popular books (non-textbooks)

Reflection**:**

Nonsystematic methods are good for small groups with high trust and without a very large amount of time. However, as group size and research time expands, then systematic methods provide higher scalability (or surface area.)

As our research endeavors expand, I expect we should get better and focus more on systematic methods.

Innovation: Research vs. Development / Horizontal vs. Vertical Efforts

Research and development are more of a spectrum than two binary options, but it’s an important spectrum.

There are times in some R&D efforts where research is the primary bottleneck, and other times where development is the primary bottleneck. Sometimes research is needed for development, but sometimes development is needed for research.

Pro-Development Example: Microprocessors

The introduction and advancement of microprocessors has led to highly significant advances in other areas. Hypothetically, it would have been possible for the original staff of Intel to have instead decided to do more research into computation. They could have written extensive reports on interesting developments that computers would allow, or performed experiments to see how useful simple computers were in different situations.

Thankfully they didn’t focus on these tasks. By spending a great deal of time and money improving microprocessors, Intel enabled many other research & development groups to make significant advances in many domains. At the time, advancing development seems like a much larger bottleneck for the innovation process than more traditional research would have been.

An alternative frame: Horizontal vs. Vertical Efforts

Horizontal efforts describe identifying new techniques and principles around forecasting. Vertical efforts describe creating an end-to-end value chain of forecasting efforts and iterating on it. Vertical efforts are similar to an “Agile” focus on doing a simple job at all parts of an effort up to user engagement, while keeping the breadth of functionality more limited.

Horizontal efforts are obviously a bottleneck if there is no viable vertical effort.

Generally, research institutions focus on horizontal efforts, but startups allocate most resources on vertical efforts.

Relevance for Forecasting:

It’s not obvious if the most useful path in forecasting work is in research or development. Similarly, it’s not obvious if the most useful path is in horizontal or vertical efforts.

I personally think that vertical efforts are quite possible, and also that if they are possible they present the largest bottleneck. The corresponding strategy would look something like making a forecasting system that provides an initially-small amount of value to EA purposes, and then spending a lot of time scaling it up.

Others close to this work disagree. Existing efforts to make predictions for EA purposes have not gone very well and may not be exciting to scale. This is a dilemma we’ll be keeping track of.

Part 2: Understanding Human vs. AI Forecasting

Judgemental Forecasting vs. Statistical Forecasting

There are several ways to divide forecasting methods. One distinction I like is to consider “statistical vs. judgemental” techniques, where “statistical” techniques include AI methods.

When Effective Altruists talk about “forecasting”, they often refer primarily to judgemental techniques. Superforecasting was mostly about judgemental methods, for instance. Yet judgemental techniques represent a small minority of the forecasting literature. One could arguably include most of data science and AI into “forecasting.”

This doesn’t mean that we should ignore judgemental methods, but rather, that we really shouldn’t ignore non judgemental methods, especially when considering the future.

I think that AI represents the most exciting advancements in data science, so will reframe this distinction to “human forecasting vs. AI forecasting”, which I believe will approximate the distinction of “statistical vs. judgemental” techniques over time.

Human forecasting is general & weak. AI forecasting is narrow & strong.

Humans can use intuitions to forecast on a very wide variety of general questions. By this I mean that humans can forecast to some degree of accuracy on almost any question they could understand. AIs typically use significant data sources to get quite good (typically better than humans, where applicable) at narrow/specific questions.

General vs. narrow forecasting is very equivalent to general vs. narrow intelligences; regarding AI. Forecasting in general is probably AGI-complete, so we can expect humans to be better at at least some questions until AGI.

Therefore, for a long time, we should expect human forecasting to be important relative to AI forecasting. But in general, where we have the option to use AI instead of judgemental techniques, we should go with AI.

Human forecasting is slowly getting better. AI forecasting is quickly getting better.

The Superforecasting studies took several years and were perhaps the most notable advance in judgemental techniques in the last 10 years. The results are interesting, but still not fantastic. There are only around 125 active superforecasters working part time for relatively expensive amounts for around 10 clients. If others took the lessons from these studies, I could imagine human forecasters improving accuracy rates by around 10%. There aren’t many studies going on now that seem to improve accuracy or effectiveness further.

Meanwhile, AI development has a very large industry behind it and major advances are happening every year, which are often entering use in industry shortly after.

I’m quite sure that superforecasters forecasted the economic efficiency (dollars generated via predictions per dollar spent) from AI forecasting vs. judgemental forecasting per year, AI forecasting would be forecasted to do dramatically better.

AI forecasting & human forecasting work well together in financial systems

Warren Buffett is a great example of a “judgemental” value investor. He decides if companies are overvalued or undervalued based on principled analyses of their fundamentals. He does not make most of his decisions primarily using advanced AI tools.

Jane Street is a strong alternative example. Over the last 20 years, I imagine many people would agree that computer systems have become much more effective at stock trading. I imagine they would also agree that humans haven’t become similarly that much better at stock trading.

However, value investors still exist (and occasionally flourish) in the market with algorithmic traders. For now both seem quite important.

Important forecasting questions are very similar to financially-traded questions

Consider the following two sets of questions. Which do you think would be more amenable for combining lots of AI systems and human judgement?

Question Set #1:

The GDP of Russia in 2030
The GDP of the United States in 2030, conditional on a Democratic candidate being elected in 2024
The total number of AI papers published in 2025
The average global temperature in 2030 Question Set #2:
The time-discounted expected value of Apple over the distant future.
The time-discounted expected value of JPMorgan Chase & Co. over the distant future.

I think without prior knowledge, it shouldn’t be obvious what question set #2 has any strong net advantages over question set #1.

Question set #1 contains questions we may want to forecast for general use. Currently these questions are typically forecasted in relatively judgemental methods, and done with very little automation.

Meanwhile question set #2 currently has tons of advanced automation, AI, and collaboration.

The only real difference that I can tell is that question set #2 happens to have a very significant market available, with a lot of money at stake, and question set #1 does not have anything like that. Because we know that advanced methods are used successfully for question set #2, I think we can assume that they could become a large part of solving questions in question set #1, and that this may be needed for us to do a similarly good job with the questions in question set #1.

Part 3: Misconceptions on Forecasting

Misconception: Many things are impossible to predict

Saying that some things are “impossible” to predict treats prediction as a binary ability. It’s instead a gradient called predictability. “Impossible” vs. “possible” is the wrong type signature.

There’s no generic “time threshold”, after which we cannot predict things. Different types of things have very different predictability. The weather isn’t very predictable 3 weeks out. Global population is relatively predictable 5 to 10 years out. Interplanetary orbits are predictable 10,000 years out.

Predictability is a predictable thing. Forecasters can forecast the certainty we can get on different variables conditional on us spending resources to forecast them. With sufficient forecasting work, we could make elegant tables of what kinds of things are the most effective to forecast.

On “Predictability”

I’m using predictability as a fuzzy term. I’d like to provide a clearer definition / split later on.

Misconception: The Good Judgement Project has shown we can’t predict things after 2 years

The Good Judgement Project has found that geopolitical events they have studied seem to be “somewhat predictable up to two years out but much more difficult to predict five, ten, twenty years out.”[AI Impacts].

It’s clear that the kinds of questions asked by the GJP are not very predictable 2+ years out. It’s not clear how well this applies to other questions. Obviously many things 2+ years out are predictable, like population sizes.

There was likely some selection effect of questions. They have historically posed “interesting” questions that seemed significantly uncertain and difficult.

Second, my impression is that GJPs claims were about resolution; not calibration. The forecasters should generally be well calibrated, it’s just that their resolution is very low on many political events. This is fine and can itself be very useful. As Philip Tetlock mentioned in Expert Political Judgement, many “experts”, “popular figures” and “government figures” commonly make very long-term political forecasts with rather poor calibration. Superforecasters have poor resolution on specific long-term questions, but probably not less so than other people, and their calibration should be much better. Listening to superforecasters on long-term political issues, and also on long-term other issues, seems more useful than listening to any other apparent group.

Misconception: Most future value from forecasting will definitely come from forecasters internal to organizations & projects

Prediction market software has been available for internal company and government use for many years, but adoption has been very small. For example, Google has tried them but for a few projects around 2008, and use has mostly halted. Even after the Good Judgement Project’s work on superforecasting, there are no related large government forecasting projects.

Some may see this as a large amount of evidence that forecasting methods will never be useful for governments and companies.

I’ll begin a counter-argument by clarifying that “forecasting” is a very broad term; the specific thing that is being discussed is typically “internal prediction markets” or “formal internal prediction registries for judgemental forecasts.” Governments and corporations generally use judgemental forecasts very commonly (executive decisions, for instance), and employ many data scientists and similar for statistical forecasts.

Many things that businesses care about are similar between businesses. For instance, many businesses purchase data sets and information sources from third-party providers.

Some questions:

How much money is spent by businesses on internal vs. external analysis?
How much value do businesses get by internal vs. external analysis?

[Incomplete]

Forecasts vs. Estimates

Claim: Forecasts are a subset of Estimates

Claim: Almost all estimates can be reframed as forecasts, which would come with benefits and negatives.

Claim:

Almost all estimates can be turned into forecasts.

Forecasts are more expensive but more powerful than estimations