Edit February 2024: Now I think maybe we can't do better.

Lately, "alignment" means "follow admin rules and do what users mean". Admins can put in rules like "don't give instructions for bombs, hacking, or bioweapons" and "don't take sides in politics". As AI gets more powerful, we can use it to write better rules and test for loopholes. And "do what users mean" implies some low-impactfulness: it won't factory reset your computer when you ask it to fix the font size.

In other words, the plan has two steps:

  1. Make a machine that can do literally anything
  2. Configure it to do good things

I don't like this plan very much because it puts power in the hands of the few and has big security liabilities.


Imagine you're on the team of three people that writes the constitution for the training run for the gigaAI1. Or you're the one actually typing it in. Or you do infra and write all the code that strings pass through on the way to and from the training cluster.

Might you want to take advantage of the opportunity? I can think of a few subtle things I might add.

If I think about the people in my life I know well, I can only name one that I would nominate for the constitution writing position.

One solution I've heard for this is to use the powerful DWIM machine to construct a second unalterable machine, then delete the first machine. You might call the time between creating the first and second machine the temptation mountain (in a plot where X is time and Y is the maximum power of any individual person). Then you want to minimize the volume of this mountain. It is difficult to shrink the mountain very much though.

Security risks

If you're 90% through your DWIM AI development and it seems very promising, then you must be very careful with your code and data. If someone else gets your secrets, then they can make it do what they mean, and they might mean for it to do something quite different. You should be careful with your team members too since threats and blackmail and bribes are all quite cheap and effective. (Hard to be careful with them though since they put "on constitution AI team" on their linkedins, their phone numbers are practically public information, and they live in a group house with employees from your competitor.)

If you're doing open source AI development and start having big successes then you might start looking at the github profiles of your stargazers and wonder what they mean for it to do.

If altLab is also getting close to AGI then you have to figure out whether your stop-clause means you help them or their stop-clause means they help you.

Doing better

Could we come up with a plan so rock solid that we'd be comfortable giving the code, the data, or the team to anyone? Like um how you could give Gandhi to Hitler without stressing.

For lack of a better term[1], let's call such a system inextricably kind AI. The key innovation required might be a means to strongly couple capabilities and benevolence. This is in contrast to the current regime where we build Do Anything Now then tell it to do good.

Such an innovation would unblock coordination and ease security burdens, and not require the board of directors & the engineers to be saints. It would also give the open source community something to work on that doesn't give AI safety folks gray hairs.

To be clear, I don't have a good idea for how to achieve this. I'm not certain it is even possible.

However, it's not obviously impossible to me. And I would be much more at ease if all the AI labs were building intrinsically aligned AI instead of something that can be stolen and directed to arbitrary goals.[2]

Let's at least give it a shot.

  1. please let me know if there's a better or pre-existing term ↩︎

  2. Apple has a pretty serious security team but they can't seem to help getting JS and kernel exploits on 3 billion phones and computers every few weeks ↩︎



New Comment
2 comments, sorted by Click to highlight new comments since:

I agree that it would be nice to get to a place where it is known (technically) how to make a kind AGI, but nobody knows how to make an unkind AGI. That's what you're saying, right? If so, yes that would be nice, but I see it as extraordinarily unlikely. I’m optimistic that there are technical ways to make a kind AGI, but I’m unaware of any remotely plausible approach to doing so that would not be straightforwardly modifiable to turn it into a way to make an unkind AGI.

It is just as ambitious/implausible as you say. I am hoping to get out some rough ideas in my next post anyways.