Skip to content

AI Tools as Evaluators

Reviewing Notes

  • Article by Ozzie Gooen, intending to publish around Jan 17th.
  • This is somewhere between “rough blog post” and “professional output”, leaning closer to the former.
  • Any comments/takes are appreciated. I probably care most about big-picture ideas/criticisms / sharing the key thoughts with others who may go on to use them.

Today, expert humans are often the most trusted resources for making important determinations on subjective and speculative questions. As AI systems improve, we are likely to defer to these instead.

One immediate use of AI systems as “evaluators” is to use them to resolve complex forecasting questions. We already have prediction platforms that precisely propose and estimate long lists of tricky questions and resolve them with human judges. These systems are often highly innovative and technical, making them perhaps an interesting focus area to consider and experiment with AI evaluations.

In this essay we discuss how AI evaluations could work and some potential complications. Casual readers might want to skip and only read the section on “A Short Story of How This Plays Out,” others might find the details interesting.

AI evaluation systems could be dramatically useful, if they could be both highly optimized and actually trusted (with appropriate levels of trust). They’re also very tractable - there are many experiments and innovations that could likely be done in the next year. We think they deserve more attention, and we expect to focus more on this area going forward.

Note that AI evaluations don’t need to be strong yet to be useful. If they’re just expected to be strong in a few years, we can start setting up prediction tournaments to target them today. This can help us set up decently-incentivized forecasting markets on difficult-to-resolve questions, without needing to arrange humans to resolve them.

Say we want to create a prediction tournament on a partially-subjective or speculative question.

For example:

  • “Will there be over 10 million people killed in global wars between 2025 and 2030?”
  • “Will IT security incidents be a national US concern, on the scale of the top 5 concerns, in 2030?”
  • “Will bottlenecks in power capacity limit AIs by over 50%, in 2030?”

These are all questions that require some amount of subjective judgement. Coming 2030, it will likely be possible to argue on either side of any of these questions, but I suspect most reasonable people will wind up agreeing with each other on one side of each.

The current solution to this sort of endeavor is to select some human evaluator to make a judgement call. On Manifold this is the question author, and different authors develop different reputations for doing a good or bad job at this. On Metaculus, often small panels of experts are chosen for the more subjective and important questions.

Humans have a lot of downsides though.

  • Dramatically more expensive than AI systems, especially when expertise is desired.
  • Very short track records or evaluations.
  • Poor accessibility. Most predictors can’t ask them questions, for example.
  • Humans change. Their opinions or motivations can dramatically shift over time.
  • Many humans can’t be guaranteed to be available at points in the future.
  • Humans can have substantial biases and/or be corrupt.

Instead of using humans to resolve thorny questions, we can use AIs. There are many ways we could attempt to do this, so we’ll walk through a few examples.

Option 1: Use an LLM

The first option to consider is to use an LLM as an evaluator. For example, write,“This question will be judged by Claude 3.5 Sonnet (2024-06-20). The specific prompt will be, …”

This style is replicable, simple, and inexpensive. However, it clearly has some downsides. The first obvious one is that Claude 3.5 Sonnet doesn’t perform web searches, so its knowledge would likely be too limited to resolve future forecasting questions.

Option 2: Use an AI Tool with Search

Instead of using a standard LLM, you might want to use a tool that uses both LLMs and web searches. Perplexity might be the most famous one now, but other advanced research assistants are starting to come out. In theory one should be able to set a research budget that’s in line with the importance and complexity of the question.

This is probably better than Option 1 for most things. But there are still problems. The next major one is the risk that Perplexity, or any other single tool we can point to now, won’t be the leading one in the future. The field is moving rapidly, it’s difficult to tell which tools will even exist in 5 years, let alone be the preferred options.

Option 3: Use an “Epistemic” Selection Protocol

In this case, you don’t select a specific AI tool. Instead you select a process or protocol that selects an AI tool.

For example:“In 2030, we will resolve this question using the leading AI tool on the ‘Forbes 2030 Most trusted AI tools’ list.”

We’re looking for AI tools that are “trusted to reason about complex, often speculative or political matters.” This arguably can be more quickly expressed as searching for the tool with the best epistemics.

Epistemic Selection Protocols (Or, how do we choose the best AI tool to use?)

Section titled “Epistemic Selection Protocols (Or, how do we choose the best AI tool to use?)”

Arguably, AI Epistemic Selection Protocols can be the best choice of the above options, if one could implement them effectively, for most 2+ year questions. There are a lot of potential processes to choose, though most would be too complicated to be worthwhile. We want to strike a balance between simplicity and optimality.

Let’s first list the most obvious options.

Option 1: Trusted and formalized epistemic evaluations

There’s currently a wide variety of AI benchmarks. But arguably, none of these would be great proxies for which AI tool would be the most trusted question resolvers in the future. Newer, deliberate benchmarks could help here.

Example:“This forecasting question will be resolved, using whichever AI Tool does the best on Epistemic Benchmark X, and can be used for less than $20.”

Option 2: Human-derived trust rankings

Humans could simply be polled on which AI tools they regard as the most trustworthy. One challenge is that different groups of humans would have different preferences, so the group would need to be specified in advance for an AI Selection Process.

Example:“This forecasting question will be resolved, using whichever AI Tool is on the top of the list of ‘Most trusted AI Tools’ on LessWrong, and can be used for less than $20.”

Option 3: Inter-AI trust ratings

AI tools could select future AI tools to use. This could be a 1-step solution, where an open-source or standardized (for the sake of ensuring it will be available long-term) solution is asked to identify the best available candidate. Or it could be a multiple-step solution, where perhaps AI tools are asked to recommend each other using some simple algorithm. This can be similar in concept to the Community Notes algorithm.

Example:“This forecasting question will be resolved, using whichever AI Tool wins a poll of the ‘Most trusted AI tools’ according to AI tools.’ In this poll, each AI tool will recommend its favorite of the other available candidates.” (Note: This specific proposal can be gamed, so greater complexity will likely be required.)

In 2025, several question writers on Manifold experiment with AI resolution systems. Some questions include:“Will California fires in 2025 be worse than those in 2024? To answer this, I’ll ask Perplexity.AI on Jan 1, 2026. My prompt will be, [Will California fires in 2025 be worse than those in 2024? Judge this by guessing the total economic loss.]”

“How many employees will OpenAI have in Dec 2025? To answer this, I’ll first ask commenters to write arguments and/or facts that they’ve found on this. I’ll filter this for what seems accurate, then I’ll paste this into Perplexity. I’ll call Perplexity 5 times, and average the results.”

Forecasting users gradually identify the uses and limitations of such systems. It turns out they are surprisingly bad at advanced physics questions, for some surprising reason. There are a few clever prompting strategies that help ensure that these AIs put out more consistent results.

AI tools like Perplexity also get very good at hunting down and answering questions that are straightforward to resolve. Manifold adds custom functionality to do this. For example, say someone writes a question, “What Movie Will Win The 2025 Oscars For Best Picture?” When they do, they’ll be given the option to have a Manifold AI system automatically make a suggested guess for them, at the time of expected question resolution. These guesses will begin with high error rates (10%), but these will gradually drop.

Separately, various epistemic evaluations are established. There are multiple public and private rankings. There are also surveys of the “Most Trusted AIs”, held on various platforms such as Manifold, LessWrong, and The Verge. Leading consumer product review websites such as Consumer Reports and Wirecutter begin to have ratings for AI tools, using defined categories such as “accuracy” and “reasonableness.”

One example question from this is:“In 2030, will it seem like o1 was an important AI development, that was at least as innovative and important as GPT4? This will be resolved using whichever AI leads the “Most trusted AIs” poll on Manifold.”

There will be a long tail of AI tools that are proposed as contenders for epistemic benchmarks. Most of the options are simply minor tweaks on other options or light routers. Few of these will get the full standard evaluations, but good proxies will emerge. It turns out that you can get a decent measure by using the top fully-evaluated AI systems to evaluate more niche systems.

In 2027, there will be a significant amount of understanding, buy-in, and sophistication with such systems (at least among a few niche communities, like Manifold users). This will make it possible to scale them for more ambitious uses.

Metaculus runs some competitions that include:“What is the relative value of each of [the top 100 AI safety papers of 2026]? This will be resolved in 2030 by using the most trusted AI system, via LessWrong or The Economist, at that time. This AI will order all of the papers - forecasters should estimate the percentile that each paper will achieve.”

“What is the expected value of every biosafety organization, estimated as what Open Philanthropy would have paid for it from their biosafety funding pool in 2027? This will be judged in 2029, by the most trusted AI system, for a random 1/10th of the organizations, with a budget of $1,000 for each evaluation.”

Around this time, some researchers will begin to make wider kinds of analyses, and forecast compressions.“How will the SOTA epistemic model of 2030 evaluate the accuracy and value of the claims of each of the top 100 intellectuals from 2027?”

“Will the SOTA epistemic model of 2030 consider the current SOTA epistemic models to be ‘highly overconfident’ for at least 10% of the normative questions they are asked?”

The top trusted AI tools start to become frequent ways to second-guess humans. For example, if a boss makes a controversial decision, people could contest the decision if top AI tools back them up. Similar analyses would be used within governments.

As these AI tools become even more trusted, they will replace many humans for important analyses and decisions.

Protocol Complications & Potential Solutions

Section titled “Protocol Complications & Potential Solutions”

Complication 1: Lack of Sufficient AI Tools

In the beginning, we expect that many people won’t trust any AI tools to be adequate in resolving many questions. Even if tools look good in evaluations, it will take time for them to build trust.

One option is to set certain criteria for sufficiency. For example, one might say, “This question will be resolved using whichever AI system first gets to a 90/100 on the Epistemic Benchmark Evaluation…” This would clearly require understanding and trust in the evaluations, rather than in a specific tool, so this would require strong evaluations.

Complication 2: Lack of Ground Truth

One standard difficulty facing subjective and/or speculative questions is that problem of getting a specific answer. There are many questions for which the correct answers will only be found far after they are needed, and others where there will never be correct answers.

The bar to do a good job should arguably be “do better than alternatives,” rather than trying to be fully precise, in situations where the latter is impossible.

The goal of question resolutions is to do the best we can with the available resources (financial costs, compute, time, etc). In order to be useful to clients, it should generally outperform other question resolution strategies they have access to.

It arguably seems important and tractable for these tools to be calibrated, at least in ways that reflect a client’s belief system. The next step is to have as high-resolution an answer as possible, given strong calibration.

In a forecasting environment, resolutions don’t need to be precise. The main thing is to make sure that they are calibrated, and that they represent more information and deliberation than the predictions.

Complication 3: Goodharting

We’d want to avoid a situation where one tool technically maximizes a narrow “Epistemic Selection Protocol”, but is actually poor at doing many of the things we want from a resolver AI. Perhaps some tools have Goodharted the Protocol.

To get around this, the Protocol could have restrictions, like to ask> What will be the most epistemically-capable service in [Date] that satisfies the following requirements?1. Costs under $20 per run.2. Is publicly available.3. Has over 1000 human users per month (this is to ensure there’s no bottleneck that’s hard to otherwise specify.)4. Completes runs within 10 minutes.

  1. Has been separately reviewed to not have significantly and deceivingly goodharted on this specific benchmark.

It’s often possible to get around Goodharting by applying additional layers of complexity. Whether it’s worth it depends on the situation.

Complication 4: Different Ideologies

Say there’s a question on the moral costs and benefits of a policy change that cuts taxes. People from different philosophical or ideological backgrounds are likely to disagree on many of the core assumptions that could lead to an answer.

One solution is to not provide an answer. Instead, express that it’s a “difficult question, with many valid answers.” However, that’s often not very useful.

A second solution is to allow for ideologically-representative AI tools to compete. This could mean a bunch of separate tools with different capabilities, or it could mean one AI tool that has settings that allow it to represent these beliefs.

A more complex setup could involve a tool that generates answers individualized for specific people, with different levels of study on certain questions. So for example, “In 2030, how valuable was California Proposition 10, according to an arbitrary person X, having studied the topic for [10, 100, 1000] hours?.” This question would be estimated with an algorithm that takes in attributes of a given person (i.e. a Scorable Function), and how much time they spent studying the question.

Complication 5: AI Tools with Different Strengths

One might ask:*“What if different AI tools are epistemically dominant in different areas? For example, one is great at political science, and another is great at advanced mathematics.”*An obvious answer is to then create simple compositions of AI tools. A router can be used to send specific requests or subrequests to other AI tools that are best equipped to handle them.

figure from AI Tools as Evaluators

One possible AI tool resolution workflow

Complication 6: AI Tools that Recommend Other AI Tools

Imagine there’s a situation where one AI tool is chosen, but that tool recommends a different tool instead. For example, Perplexity 3.0 is asked a question, and it responds by stating that Claude 4.5 could do a better job than it could. Arguably it would make a lot of sense that if an AI tool were highly trusted to make speculative judgements, it could be trusted to be correct when claiming that a different tool is superior to itself.

This probably won’t be a major bottleneck. If AI tools could simply delegate other tools for specific questions, that could just be considered part of it during evaluation.

End of Article

(The rest of this is scrap / stuff that might be better in future posts)

Complication: AI Tools with Adjustable Parameters

The name “Epistemic Front-Runner” implies that there’s some specific and discrete AI process that produces results. But this might not be the case.

Today, Perplexity has two modes: regular and pro. It’s easy to assume that the pro version is the front-runner.

figure from AI Tools as Evaluators

But what if instead, Perplexity allowed you to enter an arbitrary budget. In that case, how would one define the “Front-Runner”?

Or what if there were several more tunable parameters to choose. Perhaps some are configurations that might impact performance, but where optimization is very difficult.

This highlights that the idea of “Epistemic Front-Runner” is likely not as clean as the phrase suggests. But we can still make approximations, and use the term as a placeholder.

A more precise name would probably be something more like, “Widely-accessible, evaluated, and reasonably-priced epistemic standard.

Evaluating AIs using Epistemically Dominant AIs

Section titled “Evaluating AIs using Epistemically Dominant AIs”

Say you have two broadly capable AI tools that can resolve generic forecasting questions or make subjective calls. You broadly trust one more than the other. We can then call these AI_weak and AI_strong.

One obvious thing to do here would be to use AI_strong to evaluate AI_weak. AI_strong could probe AI_weak over a large domain of questions. It would then separately generate its own response to these questions and have AI_weak generate a response, compare the responses, then evaluate AI_weak for its capabilities. In most situations, it’s expected that AI_strong will outperform AI_weak - but there might also be situations where AI_weak outperforms AI_strong in ways that AI_strong will understand.

For example, AI_strong might generate 10,000 binary forecasting questions among a wide range of topics. Both AI_strong and AI_weak would forecast on all of these questions. In the case that there’s a disagreement, AI_strong might converse with AI_weak to see if it could be convinced. After that, one would use a proper scoring rule on the values of AI_weak, where we assume that the new best-guess values of AI_strong (after deliberating with AI_weak) are correct.

What this means is that, while humans might have a lot of uncertainty about subjective and speculative questions, we could get a gauge for how every AI tool compares on them, in respect to a certain Epistemic Front-Runner.

If such a system were implemented and well-understood, it would be possible to then add other layers to it. For example, prediction markets could forecast how well certain AI tools will do against future Epistemic Front-Runners that don’t yet exist.

Some key questions:

  • When will AI tools be “Epistemically Sufficient” to resolve various forecasting or speculative questions, in a manner that’s useful? If these don’t exist now, can we predict when they may exist?
  • What are some clean Epistemic Selection Protocols options, for choosing potential Epistemic Frontrunners?
  • What are the best methods for evaluating AIs using Epistemically Dominant AIs?

But what protocol might be best? There are some clear concerns to worry about.

  • If the protocol is mediocre, it could result in poor AI tools doing the evaluation.
  • If the protocol is noisy, it could result in certain AI tools known for specific biases. While any specific tool could at least be predictably biased, having uncertainty about which tool is used would lead to noisier predictions.
  • If the protocol is complex, forecasters and viewers might refuse to engage with it. They might default to not trusting it.

We might ideally have a protocol similar to Coherent Extrapolated Volition.

If protocols might be complicated, then it would make sense to not make new protocols for every forecasting question. Instead a forecasting question would just refer to a commonly used protocol. For example, “Will there be over 10 million people killed in global wars between 2025 and 2030, according to the AI chosen by Common Epistemic Selection Protocol #7?” In practice, it seems likely that such protocols might be simple, in which case the specifics can be used instead. For example, “Will there be over 10 million people killed in global wars between 2025 and 2030, according to the AI ranked highest on the LessWrong Epistemic AI Leaderboard.”

What if there are no trusted AI tools, but it’s expected there might be in the future? In that case, there could be Epistemic Selection Protocols that simply wait until certain criteria are satisfied. For example, one might say, “This question will be resolved using whichever AI system first gets to a 90/100 on the Epistemic Benchmark Evaluation…” This might be called, “Epistemic Sufficiency.”

To clarify this more, we can consider the idea of an “Epistemic Front-Runner.” This is the most trusted AI tool that exists at a certain point for broad question resolution. We might broadly want an “Epistemic Selection Protocol” that can reliably select the future “Epistemic Front-Runner.”