An Overview of “Obvious” Approaches to Training Wise AI Advisors

By Chris Leong

This was a prize-winning entry into the Essay Competition on the Automation of Wisdom and Philosophy.

I consider four different “obvious” high-level approaches to training wise AI advisors. I consider imitation learning to be the most promising approach as I’ll argue in an upcoming sequence on Less Wrong, however, I’ve tried to take a more balanced approach in these notes.

Approach:

Imitation learning: Training imitation learning agents on a bunch of people the lab considers to be wise.
- We’d be fine-tuning a separate base model for each advisor using human demonstrations. Ideally, we’d avoid using any reinforcement learning, but that might not be possible.
- Additional training details – I don’t know enough about training frontier models to know if this is a good plan, but this is a rough draft plan:
  - Train a model on the distribution of Internet data
  - Fine-tune it on clean data to remove the tendency to occasionally generate rubbish
  - Fine-tune it according to the kinds of outputs you want it to produce. Low quality is fine at this stage (articles, chat logs)
  - Fine-tune it on high-quality data (ie. published philosophy essays, chat logs from people having serious discussions where they actually try to answer the question being asked)
  - Fine-tune it on your data from everyone you identified as wise
  - Create specific fine tunes (or specific Lora adapters) for each wise individual
- Challenges:
  - Some of the steps listed above might interfere with the previous steps. For example, some of the data from people identified as wise might come from non-serious discussions.
  - Maybe it makes sense to add meta-data at the start (ie. serious discussion, person identified as wise) for both training and inference. This might resolve the previous issue.
The Direct Approach: Training an AI to be wise based on human demonstrations and feedback
- We’d most likely use supervised learning and RLHF on a base model.
The Principled Approach: Attempting to understand what wisdom is at a deep principled level and build an AI that provides advice according to those principles:
- While we’d ideally like to develop a complete principled understanding of wisdom, more realistically we’d probably only be able to manage a partial understanding
The Scattergun Approach: This approach involves just throwing a bunch of potentially relevant wise principles and/or anecdotes (nuggets of wisdom) from a fixed set at the deciders in the hope that reading through it will lead to a wise decision:
- A model would be trained to contextually figure out what nuggets to prioritize based on past user ratings likely by using RLHF on a base model.

Definitions:

Safe LLM: I’m quite worried that if we fine-tune an LLM hard on wisdom we’ll simply end up with an LLM that optimizes against us. A safe LLM would be an LLM where we’ve taken steps to reduce the chance of significant adversarial optimization. Ways of achieving this might include limiting the size of the base model, reducing RLHF or avoiding fine-tuning the model too hard.
Wisdom explosion: When a system is able to recursively self-improve its wisdom. This doesn’t have to continue forever, as long as it caps out at a superhuman level. The self-improving system doesn’t have to be a single AI, but may be a cybernetic system consisting of a bunch of operators and AI’s in an organization, or even a network of such organizations. See Some Preliminary Notes on the Promise of a Wisdom Explosion for more details.

Considerations:

Base Power level: How capable is this method of training extremely wise agents?
Feasibility: How practical is it to make such a system?
Adversarial optimization: To what extent do we have to worry that we may be training a system to adversarially optimize against us?
Application of principles: What kind of support does the system provide in figuring out how to apply the principles?
Generalization: How well does this technique generalize out of distribution?
Wisdom explosion potential: Could this approach be useful for recursive self-wisening?
Holisticity:
- I’m worried that mixing and matching principles from various systems of wisdom can result in a new system that is incredibly unwise, even if each principle is wise within its original system. As an example, Warren Buffet might be able to provide wise advice on how to become wealthy and the Dali Lama wise advice on spiritual development, but perhaps these are two separate paths and what is wise for pursuing one path would be foolish for the other. There are two reasons why I consider holisticity to be good:
  - Consistency: Individual views have the advantage of consistency whilst mixing and matching breaks this assumption.
  - Commitment: Sometimes there are advantages to picking a path, any path, rather than just averaging everything together. As an example, maybe it’s better to either completely devote myself to pursuing programming or completely devote myself to pursuing art rather than split myself between the two and succeed at neither.

Evaluation:

Please keep in mind that my assessments of these techniques on each of the criteria are essentially hot-takes.

Imitation Learning:
- Evaluation of base proposal:
  - Base Power level: Depends hugely on who you are able to train on. The wisest people are quite wise, but you might not be able to obtain their permission to train on their data or to persuade them to collaborate with you.
  - Feasibility:
    - Standard imitation learning isn’t particularly challenging. However, we may need to advance the state of the art in order to obtain sufficiently accurate results.
    - Even if we advance the state of the art, obtaining sufficiently high-quality data might pose a significant challenge
    - There are many historical figures with large amounts of data. The major limitation here is that we can’t obtain more if they’re dead.
    - However, we might only be able to obtain a sufficient level of accuracy with people who are alive and willing to participate in the project. This has the following advantages:
      - We can gather data about their responses to the kinds of questions we’re interested in
      - We can search for cases where the model is especially unsure of what they’d say and collect their responses to these questions
      - We can ask them to take a second look at places where their thought seems contradictory
      - We can ask them to produce additional chain of thought data even for things that are so basic that they wouldn’t normally bother stepping through all their reasoning
      - Contemporary folk can use Wise AI to become wiser, making them better targets to train on
  - Adversarial optimization:
    - Optimizing hard on imitation learning is less likely to be problematic than for other targets:
      - Safer target: Incentivizing the AI to fool us into believing that “X would say Y” rather than “Y is true” is less likely to be harmful
      - Easier validation: easier to talk to X and learn that they would never say Y rather than learn that Y is not wise which might take a lot of experience and incur significant costs. Even for historical figures, we can withhold part of the data as a validation set.
      - More reliable data: it is easier to gather a high-quality dataset on what X said than on what is best on some metric (which tends to be unknown for any situation of reasonable complexity).
    - Inner alignment might still be an issue
    - If you imitate folks who are opposed to you for whatever reason, then an imitation learning agent trained on them might act adversarially.
    - If the figures we are training are being compensated to produce training data, then this might push them towards giving you the answer you want. However, this is better than RLHF as they are being compensated for being themselves rather than attempting to either produce or rate outputs according to the company’s conception of what high-quality data looks like.
  - Application of principles:
    - As an abstraction, sims provide a natural way to hold principles of wisdom along with information about the particular context in which these principles apply. Simulating dialog between these sims provides a natural way of determining which principles are more applicable to the current scenario.
  - Holisticity:
    - Likely pretty good. Sims encourage us to conceive of wisdom as a holistic system rather than just individual principles. However, skeptics might argue that even the wisest humans are incredibly inconsistent.
  - Generalization:
    - Likely very good.
    - Consulting multiple advisors reduces the impact from any one advisor generalizing poorly.
    - Humans can invent new principles on the fly, such that we can better adapt to new and unexpected circumstances or cover gaps in our map. I expect this to carry over to the imitation learning approach.
    - The principled and direct approaches attempt to figure out what wisdom is across all of time and space. In contrast, the simulator attempts to identify figures who are wise within a particular context and then adapt this to the current context. This is a much less challenging problem particularly since we can have the sims talk through how to adapt to the new circumstances.
    - One potentially useful frame: When we are selecting a figure, we aren’t just selecting a certain style of in-distribution reasoning, but a certain style of out-of-distribution reasoning. If our curation choices are good, then we might expect out-of-distribution reasoning to be good, whilst if our curation choices are bad, then we might expect out-of-distribution reasoning to be bad.
    - Going further: We aren’t just selecting a certain style of out-of-distribution reasoning, but also a certain style of reasoning about whether you are out of distribution.
  - Wisdom explosion potential:
    - Scalable alignment techniques provide significant opportunities for amplification:
      - “What if you knew X?” in combination with RAG
      - Self-consistency
      - Debate
      - Iterated distillation and amplification
    - Imitation-based techniques might actually work better with techniques ported over from humans because they’d be more in distribution.
  - Other advantages:
    - Users are less likely to be overly trusting: People will understand that they need to take the advice of imitation agents with a grain of salt, particularly because of the wide range of disagreements between them, while they will more uncritically accept the advice of an AI trained to be wise.
    - Given the relative ease of imitation learning, if we need to use either the direct or principled approach, I’d recommend implementing imitation-based techniques first and using them to assist:
      - These assistants could help us make wise decisions about all aspects of the project, including high-level approach, planning and personal selection
      - These assistants could help us produce training data for the direct approach or figure out the principles for the principled approach.
      - These assistants could help us make wise decisions about how to utilize these models and work around their limitations.
- Potential mitigations:
  - Fixing the lack of historical data:
    - If there are different interpretations of a figure’s work, we can train different agents for the main schools of thought on what they meant
    - We can ask an expert on these figures to speculate about what they may have said in relation to some of the kinds of questions we’re interested in. This could be used to reduce the chance of out-of-distribution errors.
  - Speculative: We might be able to mitigate inner alignment by averaging the weights of a bunch of models. We can then use this as a starting point and do a tiny bit of additional training to get to the real parameters for the model we’re training:
    - The average baseline is likely better for imitation learning vs. optimization b/c the average is more likely to be near the ideal solution for the former rather than the latter. I expect that this would make the ‘average biasing’ more effective at mitigating inner alignment issues
- Most promising variant:
  - I’m most optimistic about a variant where swarms of AI advisors are allowed to dynamically self-organize rather than using a fixed structure like debate for amplification.
The Direct Approach:
- Evaluation of base proposal:
  - Base Power level: Optimisation is very powerful
  - Feasibility: Very feasible. This is the standard way of training AI
  - Adversarial optimization:
    - The standard issues of Goodhart’s law are exacerbated when the training target is wisdom.
    - Wisdom is extremely hard to evaluate:
      - Wisdom is highly contested
      - Wisdom can typically only be validated by examining many different kinds of situations over long periods of time
      - It’s very easy to accidentally impose assumptions on a situation without even realizing that you are doing it. The assumptions don’t even make it to the level of consideration.
    - Sycophancy:
      - The phrasing is especially likely to leak information about the user’s views on questions about wisdom
    - Ambiguity of meaning: This can have advantages as a wise decision is still wise even if the wisdom mostly came from the user. However, it can go wrong as follows: Adam rates Y as wise assuming it will be understood as Z. Bob interprets as Z’ which is a reasonable interpretation, but incredibly unwise.
  - Application of principles: Pretty good. You can just get the model to generate outputs.
  - Holisticity: Quite poor. If we aren’t trusting any one person, we will need many different raters and this will likely merge their views together inconsistently
  - Generalization: Debatable. Some people might think that this will generalize better because it merges a lot of different views. Others might argue that there will be issues because we’re training it on inconsistent data.
  - Wisdom explosion potential: Maybe, but I’m dubious. I expect that triggering a wisdom explosion requires embracing a certain degree of subjectivity rather than trying to be objective.
- Potential mitigations:
  - We could aggressively filter the text used to train the base model to remove
  - We could produce a number of fine-tunes and use weight averaging to attempt to reduce adversarial optimization.
  - We could train another model to comment on the model outputs and attempt to identify situations where the model is being sycophantic or manipulative. This could be directly trained or we could provide it with a bunch of rules.
  - We could train a classifier on the latents to detect sycophancy.
  - We could attempt to use activation vectors in order to reduce sycophancy.
  - We could use some kind of self-consistency training to reduce the inconsistency created by training on data coming from multiple individuals.
- Most promising variant:
  - I suspect that the most promising approach would be a form of defense-in-depth where we just smash all of these different methods together and hope for the best.
The Principled Approach:
- Evaluation of base proposal:
  - Base power level: Theoretically quite powerful if you were able to reverse engineer wisdom. Partial solutions are likely much less powerful.
  - Feasibility:
    - Feasibility challenges: wisdom is likely too multifarious to reverse engineer. Most likely result is that the team never gets anywhere near finishing, even by its own standards. It would be easy to spend an entire lifetime studying wisdom:
    - The issue isn’t just that the task is massive, it’s also that it’s very hard to have a complete map of wisdom without having experienced a huge diversity of different contexts.
    - My intuition is that this would be a challenge, even if we had fifty years, which we don’t have. I expect that we would need time to go through multiple paradigms of foundational wisdom research, with each subsequent paradigm identifying massive blind spots in the previous paradigm. Without time to iterate through paradigms, we’ll likely be too localized to the current context and unable to adapt to new circumstances.
  - Adversarial optimization:
    - Much better than in the direct approach, however, unless we develop a method of inserting the principles into an AI directly, we’d still need humans to rate how well the AI is following these principles. I’m pretty worried that this would be too much exposure.
    - Inner alignment might present a problem.
  - Application of principles: Likely pretty good since we’re training the AI to learn the principles.
  - Holisticity: Actually solving wisdom principally would be the best approach in terms of ensuring holistically coherent advice.
  - Wisdom explosion potential: Decent. There’s a chance that we don’t have to solve all of wisdom, but that identifying some core principles of wisdom would allow us to produce a seed system that could trigger a wisdom explosion.
  - Generalization:
    - Potentially the best if you were actually able to reverse engineer wisdom, but as I said, that’s unlikely.
    - A partial solution to the principled approach would likely have huge blindspots.
- Potential mitigations:
  - We could merge the direct approach and the principled approach to cover any gaps by generating new principles. The downside is that this would also allow the AI to directly optimize against us. This would work as follows: use supervised learning on our list of principles and then use RLHF to train the model to produce outputs that will be highly rated. The obvious worry is that introducing RL leaves us vulnerable to being adversarially optimized against, however, there’s a chance that this is safer than the direct approach if we are able to get away with less RL¹.
  - One way to reduce the amount of exposure to adversarial optimization would be to limit the AI to identify the most contextually relevant principles, rather than allowing it to generate text explaining how to do this. However, this would greatly limit the ability of the AI to assist with figuring out how to apply the principles (we could use a safe LLM for assistance instead, but this would be less powerful).
- Most promising variant:
  - Given that you are unlikely to succesfully reverse engineering all of wisdom, I believe that the most promising variant would be aiming to decipher enough principles of wisdom such that you could build a seed AI that could recursively self-wisen.
  - I’m uncertain whether it would be better to attempt to find a way to directly insert the principles into an AI (I suspect this is basically impossible) or to let the model generate text advising you on how to apply the principles based on human ratings (unlikely to go well due to exposing yourself to adversarial optimization)
The Scattergun Approach
- Evaluation of base proposal:
  - Base Power level: Pretty weak. Limited to a set of specific nuggets of wisdom
  - Feasibility: Very feasible. Not a particularly complicated thing to train.
  - Adversarial optimization: Even though the optimizer can only select particular nuggets of text it can still adversarially optimize again you to a degree. However, it is much more limited than if it were able to freely generate text².
  - Application of principles: The base proposal provides very limited support in terms of figuring out how to apply these principles compared to the other approaches. It just provides a bunch of disconnected principles.
  - Holisticity: Provides disconnected nuggets of wisdom. Scores pretty poorly here.
  - Wisdom explosion potential: Very limited. Such a system like this might be useful for helping us pursue one of the other approaches, but limiting the nuggets of wisdom to a fixed set is a crippling limitation.
  - Generalization: Rather poor. Has a fixed set of principles.
- Potential mitigations:
  - We could tilt the optimizer towards favoring advice that would be coherent with the advice already provided. I expect that would help to a degree, but this honestly seems like a fundamental problem with this approach
  - We could annotate the content with details about the kind of context in which it might be useful. Mitigates it a bit, but this is a very limited solution.
  - We could allow an LLM to freely generate text advising you on how to apply one of these principles to your particular situation. If this were done, I would have a strong preference for using a safe LLM.
    The whole point of the scattergun approach as far as I’m concerned is to limit the set of responses as to mitigate adversarial optimization. At the point where you allow an LLM to optimize hard, I feel that you may as well go with the direct approach, as you’ve exposed yourself to adversarial optimization.
- Most promising variant:
  - Using a safe LLM to contextually annotate the nuggets of wisdom with notes on how to apply them seems like the most viable variant of this approach.

Appendix on the Imitation Learning Approach:

Because imitation learning approach is difficult to understand I’ve added answers for three of the most common questions. I’ll be explaining this approach in a lot more detail in my upcoming Less Wrong sequence:

Isn’t this approach a bit obvious?:
- Yes. That doesn’t mean that it wouldn’t be effective though.
What kind of figures are you talking about?:
- Depends on the exact use case, but there’s wisdom in all kinds of places. There’s wise philosophers, wise scientists, wise policy advisors, wise communicators ect.
Isn’t the subjectivity in selecting figures bad?
- The subjectivity is already there in the direct approach. The fact that we’re selecting figures just makes this more obvious because humans are highly attuned to anything involving status. Making this more salient is good. These are big decisions and people should be aware of this subjectivity.
- Different actors can choose to make use of different subsets of figures. Whilst we could produce multiple different AI’s with the direct approach, imitation learning has the advantage of being extremely legible in how the result is being produced. As soon as we move to some kind of averaging, we have to deal with the question of how your sample was produced.
- Further, if there are multiple projects, each project can make their own selection
- After we’ve chosen some initial figures, we can take advantage of their wisdom to help us figure out who we’ve missed or what our blindspots are.
- If we end up simply using these figures to help us train a wise AI, I would expect many of these choices to wash out and many different figures – all of whom are wise – would make similar recommendations. Running self-consistency on the AI might further remove some of these differences.
- Framing this slightly differently, if we use techniques like debate well, poor choices are unlikely to have much of an impact.

Notes

It isn’t clear if this is actually the case. See the discussion here
Likely comparable to the extent that a model which was able to priortise different imitation agents would be able to optimise against you.