Finding the Wisdom to Build Safe AI

Gordon Seidoh Worley

Review

We may soon build superintelligent AI. Such AI poses an existential threat to humanity, and all life on Earth, if it is not aligned with our flourishing. Aligning superintelligent AI is likely to be difficult because smarts and values are mostly orthogonal and because Goodhart effects are robust, so we can neither rely on AI to naturally decide to be safe on its own nor can we expect to train it to stay safe. We stand a better chance of creating safe, aligned, superintelligent AI if we create AI that is "wise", in the sense that it knows how to do the right things to achieve desired outcomes and doesn't fall into intellectual or optimization traps.

Unfortunately, I'm not sure how to create wise AI, because I'm not exactly sure what it is to be wise myself. My current, high-level plan for creating wise AI is to first get wiser myself, then help people working on AI safety to get wiser, and finally hope that wise AI safety researchers can create wise, aligned AI that is safe.

For close to a decade now I've been working on getting wiser, and in that time I've figured out a bit of what it is to be wise. I'm starting to work on helping others get wiser by writing a book that explains some useful epistemological insights I picked up between pursuing wisdom and trying to solve a subproblem in AI alignment, and have vague plans for another book that will be more directly focused on the wisdom I've found. I thus far have limited ideas about how to create wise AI, but I'll share my current thinking anyway in the hope that it inspires thoughts for others.

Why would wise AI safety researchers matter?

My theory is that it would be hard for someone to know what's needed to build a wise AI without first being wise themself, or at least having a wiser person to check ideas against. Wisdom clearly isn't sufficient for knowing how to build wise and aligned AI, but it does seem necessary under my assumptions, in the same way that it would be hard to develop a good decision theory for AI if one could not reason for oneself how to maximize expected value in games.

How did I get wiser?

Mostly by practicing Zen Buddhism, but also by studying philosophy, psychology, mathematics, and game theory to help me think about how to build aligned AI.

I started practicing Zen in 2017. I picked Zen with much reluctance after trying many other things that didn't work for me, or worked for a while and then had nothing else to offer me. Things I tried included Less Wrong style rationality training, therapy, secular meditation, and various positive psychology practices. I even tried other forms of Buddhism, but Zen was the only tradition I felt at home with.

Consequently, my understanding of wisdom is biased by Zen, but I don't think Zen has a monopoly on wisdom, and other traditions might produce different but equally useful theories of wisdom than what I will discuss below.

What does it mean to be wise?

I roughly define wisdom as doing the right thing at the right time for the right reasons. This definition puts the word "right" through a strenuous workout, so let's break it down.

The "right thing" is doing that which causes outcomes that we like upon reflection. The "right time" is doing the right thing when it will have the desired impact. And the "right reasons" is having an accurate model of the world that correctly predicts the right thing and time.

How can the right reasons be known?

The straightforward method is to have true beliefs and use correct logic. Alas, we're constantly uncertain about what's true and, in real-world scenarios, the logic becomes uncomputable, so instead we often rely on heuristics that lead to good outcomes and avoid optimization traps. That we rely on heuristics doesn't mean facts and logic are not useful, only that heuristics are necessary to fill in their gaps in most cases. This need for heuristics suggests that the root of wisdom is finding the right heuristics that will generate good reasoning, which will in turn generate good actions that lead to good outcomes.

What are some wise heuristics?

The two that have been the most important for me are humility and kindness. By humility I mean not deceiving myself into believing my own unjustified beliefs, like by having well-calibrated beliefs, not falling for the typical mind fallacy, and not seeing myself as separate from reality. By kindness I mean a willingness to take the actions that most benefit myself and others rather than take the actions that optimize for something else, like minimizing the risk of personal suffering or maximizing the chance for personal gain.

I don't have a rigorous argument for these two heuristics, other than I tried a lot of different ones and these two have so far worked the best to help me find the right things, times, and reasons. Other heuristics might work better for others, or might be better for me, or might be better for everyone who adopts them. But, for what it's worth, humility and kindness are almost universally recommended by religions and other wisdom traditions, so I suspect that many others agree that these are two very useful wisdom heuristics, even if they are not the set of maximally useful ones.

How do we find wise heuristics?

Existing wisdom traditions, like religions, provide us with a large set of heuristics we can choose from. For any particular person looking to become wiser, the problem of finding wise heuristics is mostly one of experimentation. A person can adopt one or more heuristics, then see if those heuristics help them achieve their reflexively desired outcomes. If not, they can try again with different heuristics until they find ones that work well for them.

We collectively benefit from these individual experiments. For millennia our ancestors ran similar experiments with their lives and have passed on to us the wisdom heuristics that most reliably served them well. Thus the set of heuristics they've provided us with have already been tested and found effective.

There may be other, better wisdom heuristics that cultural evolution could not find, but we should expect no more than marginal improvements over existing heuristics. That's because most wisdom heuristics are shared as simple, fuzzy concepts, so any "new" heuristics are not going to be clearly distinct from existing ones. For example, if someone were to propose I replace my heuristics of humility and kindness with meekness and goodwill, they would have to explain how meekness and goodwill are more than restatements of humility and kindness such that I would have reason to adopt them. Even if these heuristics were different, they would likely only be different on the margin, and would be unlikely to offer Pareto improvements over my existing heuristics.

Therefore I expect that, for the most part, we've already adequately explored the space of wise heuristics for humans, and the challenge of becoming wise is not so much in finding wise heuristics as it is in learning how to effectively apply the ones we already know about.

How do we train wise humans?

I don't know all the ways we might do it, but here's a rough outline of how we train wisdom in Zen.

A student works closely with a teacher over many years. That work includes thousands of hours of meditation, instruction from their teacher in both meditation and ethics, and putting that instruction into practice where the teacher can observe and offer corrections. One purpose of this work is to help the student wake up to the reality of life and free themselves from suffering ("enlightenment"), which requires the cultivation of "wisdom beyond wisdom". The result is that the teacher is said to transmit their wisdom mind-to-mind over the course of this training period.

I'm not going to claim that Zen teachers know how to psychically transmit thoughts directly from one mind to another. Instead, they are slowly and gradually training the student to develop the same generators of thought and action as they have. Thus, rather than training the student to appear wise and enlightened, they are attempting to remake the student into the type of person who is wise and enlightened, and thereby avoid Goodharting wisdom.

As will perhaps be obvious, this is not reliably possible, and yet Zen has managed to transmit its wisdom from one generation to the next without collapsing from runaway Goodharting, so its methods must work to some extent.

How do we know when a human is wise?

Again, I'll answer from my experience with Zen.

It is generally up to a Zen teacher to testify to the wisdom of their students. This is achieved through the rituals of jukai and dharma transmission. In jukai, a student takes vows to behave ethically and follow the wisdom of teachings, and it comes after a period of training to learn to live those vows. Later, a student may receive dharma transmission and permission to teach if their teacher deems them sufficiently capable and wise, creating a lineage of wisdom certification stretching back hundreds of years.

What's important about these processes is that they place the authority to recognize wisdom in another person. That is, a person is not a reliable judge of their own wisdom because it is too easy to self-deceive, so we instead rely on another person's judgement. And the people who are best going to be able to recognize wisdom are those who are wise themselves because others judge them to be so.

Thus we know someone is wise because other people, and wise people in particular, can recognize wisdom in others.

Could we train wise AI the way we train wise humans?

Maybe. It would seem to mostly depend on how similar AI are to us, and to what extent training methods that work on humans would work on AI. Importantly, the answer will hinge on whether or not AI can be trained without Goodharting on wisdom, and whether or not we can tell if an AI has Goodharted wisdom rather than become actually wise.

If we can train AI to be wise, it would imply an ability to automate training, because if we can train a wise AI, then in theory that AI could train other AIs to be wise in the same way wise humans are able to train other humans to be wise. We would only need to train a single wise AI in such a scheme who could pass on wisdom to other AIs.

Can wisdom recognition be automated?

I'm not sure. Automation generally requires the use of measurable, legible signals, but in Zen we mostly avoid legibility. Instead we rely on the conservative application of intuitive pattern matching, like a teacher closely observing a student for years. My theory is that this is a culturally evolved defense against Goodharting.

In theory, it might be possible to train an LLM to recognize wisdom in the same way a Zen teacher would, but it would require first finding a way to train this LLM in Zen with a teacher who would be willing to give it dharma transmission. I'm doubtful that we can use training methods that work on humans, but they do offer inspiration. In particular, I suspect Zen's model of mind-to-mind transmission only works because, typical mind fallacy not withstanding, some people really do think very similarly, and when a student is similar enough to what the teacher was like prior to their own training, the teacher is able to train the student in the same way they were trained and be largely successful. In short, training succeeds because student and teacher have sufficiently similar minds.

It's always possible that the path to superintelligent AI will pass through designs that closely mimic human minds, but that seems unlikely given we've made tremendous AI progress already with non-human-like designs. Thus it's more likely that, if we were to attempt training wisdom into AIs, we would need to look for ways to do it that would generalize to minds not like ours.

Can we use Reinforcement Learning to train wisdom?

I'm doubtful that we can successfully train wisdom using known RL techniques. The big risk with RL is Goodharting, and I don't see signs that we've found RL methods that are likely to be sufficiently robust to Goodharting under the extreme optimization pressure of superintelligent AI. At best we might be able to use RL to train wise AI that helps us to build wise superintelligent AI, but would be inadvisable to use RL to directly train a superintelligent AI to be wise.

How else might we create wise AI?

I don't have a solid answer, but given the ultimate goal is to create superintelligent AI that is aligned with human flourishing, there might be a way to use relatively wise AI to help us bootstrap a safe, superintelligent AI.

One way this could go is the following:

We train an LLM to be an expert on AI design and wisdom. We might do this by feeding it AI research papers and "wisdom texts", like principled arguments about wise behavior and stories of people behaving wisely, over and above those base models already have access to, and then fine tuning to prioritize giving wise responses.
We simultaneously train some AI safety researchers to be wiser.
Our wise AI safety researchers use this LLM as an assistant to help them think through how to design a superintelligent AI that would embody the wisdom necessary to be safe.
Iterate as necessary, using wisdom and understanding developed with the use of less wise AI to train more wise AI.

This is an extremely hand-wavy plan, so I offer it only as inspiration. The actual implementation of such a plan will require resolving many difficult questions such as what research papers and wisdom texts the LLM should be trained on, which AI safety researchers are wise enough to succeed in making progress towards safe AI, and when enough progress will have been made that superintelligent AI can safely be created.

Doesn't this plan still risk Goodharting wisdom?

Yep! As with many problems in AI safety, the fundamental problem is preventing Goodharting. The hope I hold on to is that people sometimes manage to avoid Goodharting, such as when a Zen teacher successfully transmits wisdom to their students. Based on such examples of non-Goodharting training regimes, we may find a way to train superintelligent AI that stays safe because it doesn't succumb to Goodhart Curse or other forms of Goodharting.

What's next?

Personally, I'm going to continue to focus on helping myself and others get wiser. I seriously doubt it's the most impactful thing we can do to ensure the creation of safe, aligned, superintelligent AI, but it's the most impactful thing I expect to be able to make progress on right now.

As for you and other readers, I see a few paths forward:

Work on getting wiser yourself.
Share the wisdom you have with others.
Work on training LLMs that not only understand wisdom, but robustly apply it.
Look for ways we might create AIs that could train other AIs to be wise.
Figure out how AI safety research can outpace capabilities progress such that we have the time needed to figure out how to build wise AIs, or more generally how to create AI that is aligned with our flourishing.

Thanks to Owen Cotton-Barrett and Justis Mills for helpful comments on earlier drafts.

[-]Chris_Leong2y33

My intuition is that the best way to build wise AI would be to train imitation learning agents on people who we consider to be wise. If we trained imitations of people with a variety of perspectives, we could then simulate discussions between them and try to figure out the best discussion formats between such agents. This could likely get us reasonably far.

The reason why I say imitation learning is because that would give us something that we could treat as an optimisation target which is what we require for training ML systems.

[-]Gordon Seidoh Worley2y10

Seems reasonable. I do still worry quite a bit about Goodharting, but perhaps this could be reasonably addressed with careful oversight by some wise humans to do the wisdom equivalent of red teaming.

[-]Chris_Leong2y10

You mean it might still Goodhart to what we think they might say? Ideally, the actual people would be involved in the process.

[-]Charbel-Raphaël2y22

It might not be that impossible to use LLM to automatically train wisdom:

Look at this: "Researchers have utilized Nvidia’s Eureka platform, a human-level reward design algorithm, to train a quadruped robot to balance and walk on top of a yoga ball."

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

15

Finding the Wisdom to Build Safe AI

15