(I have a health problem that is acting up and making it hard to type for long periods of time, so I'm condensing three posts into one.)

1. AI design as opportunity and obligation to address human safety problems

Many AI safety problems are likely to have counterparts in humans. AI designers and safety researchers shouldn't start by assuming that humans are safe (and then try to inductively prove that increasingly powerful AI systems are safe when developed/trained by and added to a team of humans) or try to solve AI safety problems without considering whether their designs or safety approaches exacerbate human safety problems relative to other designs / safety approaches. At the same time, the development of AI may be a huge opportunity to address human safety problems, for example by transferring power from probably unsafe humans to de novo AIs that are designed from the ground up to be safe, or by assisting humans' built-in safety mechanisms (such as moral and philosophical reflection).

2. A hybrid approach to the human-AI safety problem

Idealized humans can be safer than actual humans. An example of idealized human is a human whole-brain emulation that is placed in a familiar, safe, and supportive virtual environment (along with other humans for socialization), so that they are not subject to problematic "distributional shifts" nor vulnerable to manipulation from other powerful agents in the physical world. One way to take advantage of this is to design an AI that is ultimately controlled by a group of idealized humans (for example, has a terminal goal that refers to the reflective equilibrium of the idealized humans), but this seems impractical due to computational constraints. An idea to get around this is to give the AI an advice or hint, that it can serve that terminal goal by learning from actual humans as an instrumental goal. This learning can include imitation learning, value learning, or other kinds of learning. Then, even if the actual humans become corrupted, the AI has a chance of becoming powerful enough to discard its dependence on actual humans and recompute its instrumental goals directly from its terminal goal. (Thanks to Vladimir Nesov for giving me a hint that led to this idea.)

3. Several approached to AI alignment will differentially accelerate intellectual progress that are analogous to solving problems that are low in the polynomial hierarchy.

This is bad if the "good" kind of intellectual progress (such as philosophical progress) is disproportionally high in the hierarchy or outside PH entirely, or if we just don't know how to formulate such progress as problems low in PH. I think this issue needs to be on the radar of more AI safety researchers.

(A reader might ask, "differentially accelerate relative to what?" An "aligned" AI could accelerate progress in a bad direction relative to a world with no AI, but still in a good direction relative to a world with only unaligned AI. I'm referring to the former here.)

New Comment
30 comments, sorted by Click to highlight new comments since: Today at 6:54 AM

Hope you get better soon!

humans’ built-in safety mechanisms (such as moral and philosophical reflection)

I feel the opposite way: reflection is divergent and unsafe, while low-level instincts are safety checks. For example, "my grand theory of society says we must kill a class of people to usher in a golden age, but my gut says killing so many people is wrong."

Hope you get better soon!

Thanks, I think I'm getting better. The exercises I was using to keep my condition under control stopped working as well as they used to, but I may have figured out a new routine that works better.

I feel the opposite way: reflection is divergent and unsafe, while low-level instincts are safety checks.

The problem as I see it is that your gut instincts were trained on a very narrow set of data, and therefore can't be trusted when you move outside of that narrow range. For example, suppose some time in the future you're playing a game on a very powerful computer and you discover that the game has accidentally evolved a species of seemingly sentient artificial life forms. Would you trust your gut to answer the following questions?

  1. Is it ok to shut down the game?
  2. Is it ok to shut down the game if you first save its state?
  3. Is it ok to shut down the game if you first save its state but don't plan to ever resume it?
  4. What if you can't save the game but you can recreate the creatures using a pseudorandom seed? Is it ok to shut down the game in that case? What if you don't plan to actually recreate the creatures?
  5. What if the same creatures are recreated in every run of the game and there are plenty of other copies of the game running in the galaxy (including some that you know will keep running forever)? Is it ok to shut down your copy then?
  6. What if there are two copies of the game running on the same computer with identical creatures in them? Is it ok to shut down one of the copies?
  7. What moral obligations do you have towards these creatures in general? For example are you obligated to prevent their suffering or to provide them with happier lives than they would "naturally" have?

(Note that you can't just "play it safe" and answer "no" to shutting down the game, because if the correct answer is "yes" and you don't shut down the game, you could end up wasting a lot of resources that can be used to create more value in the universe.)

I do agree that reflection can go wrong pretty often. But I don't see what we can do about that except to try to figure out how to do it better (including how to use AI to help us do it better).

It occurs to me that we should actually view system 1 and system 2 as safety checks for each other. See this comment for further discussion.

Attempted rewrite of this post in my language:

We might worry that if any particular human got a lot of power, or was able to think a lot faster, then there's a decent chance that they do something that we would consider bad. Perhaps power corrupts them, or perhaps they get so excited about the potential technologies they can develop that they do so without thinking seriously about the consequences. We now have the opportunity to design AI systems that operate more cautiously, that aren't prone to the same biases of reasoning and heuristics that we are, such that the future actually goes better than it would if we magically made humans more intelligent.

If it's too hard to make AI systems in this way and we need to have them learn goals from humans, we could at least have them learn from idealized humans rather than real ones. Human values don't extrapolate well -- just look at the myriad answers that people give to the various hypotheticals like the trolley problem. So, it's better to learn from humans that are kept in safe, familiar environment with all their basic needs taken care of. These are our idealized humans -- and in practice the AI system would learn preferences from real humans, since that should be a very good indicator of the preferences of idealized humans. But if real humans end up in situations they've never encountered before, and start contradicting each other a lot, then the AI system can ignore these "corrupted" values and try to infer what the idealized humans would say about the situation.

More generally, it seems important for our AI systems to help us figure out what we care about before we make drastic and irreversible changes to our environment, especially changes that prevent us from figuring out what we care about. For example, if we create a hedonic paradise where everyone is on side-effect-free recreational drugs all the time, it seems unlikely that we check whether this is actually what we wanted. This suggests that we need to work on AI systems that differentially advance our philosophical capabilities relative to other capabilities, such as technological ones.

And my actual comment on the post:

It seems quite difficult to me to build AI systems that are safe, without having them rely on humans in some way to determine what is and is not good behavior. As a result, I'm pessimistic about our chances at creating AI systems that can address "human safety problems". Learning from idealized humans might address this to some extent, but in many circumstances I think I would trust the real humans who are actually in those circumstances more than the idealized humans who must reason about those circumstances from afar (in their safe, familiar environment).

I do think we want to have a general approach where we try to figure out how AIs and humans should reason, such that the resulting system behaves well. On the human side, this might mean that the human needs to be more cautious for longer timescales, or to have more epistemic and moral humility. Idealized humans can be thought of an instance of this approach where rather than change the policy of humans in reality, we change their policy in a hypothetical.

Overall, I'm hoping that we can solve "human safety problems" by training the humans supervising the AI to not have those problems, because it sure does make the technical problem of aligning AI seem a lot easier.

Overall, I'm hoping that we can solve "human safety problems" by training the humans supervising the AI to not have those problems, because it sure does make the technical problem of aligning AI seem a lot easier.

Note that humans play two distinct roles in IDA, and I think it's important to separate them:

1. They are used to train corrigible reasoning, because we don't have a sufficiently good explicit understanding of corrigible reasoning. This role could be removed if e.g. MIRI's work on agent foundations were sufficiently successful.

2. The AI that we've trained is then tasked with the job of helping the user get what they "really" want, which is indirectly encoded in the user.

Solving safety problems for humans in step #1 is necessary to solve intent alignment. This likely involves both training (whether to reduce failure probabilities or to reach appropriate universality thresholds), and using humans in a way that is robust to their remaining safety problems (since it seems clear that most of them cannot be removed).

Solving safety problems for humans in step #2 is something else altogether. At this point you have a bunch of humans in the world who want AIs that are going to help them get what they want, and I don't think it makes that much sense to talk about replacing those humans with highly-trained supervisors---the supervisors might play a role in step #1 as a way of getting an AI that is trying to help the user get what they want, but can't replace the user themselves in step #2 . I think relevant measures at this point are things like:

  • Learn more about how to deliberate "correctly," or about what kinds of pressures corrupt human values, or about how to avoid such corruption, or etc. If more such understanding is available, then both AIs and humans can use them to avoid corruption. In the long run AI systems will do much more work on this problem than we will, but a lot of damage could be done between now and the time when AI systems are powerful enough to obsolete all of the thinking that we do today on this topic.
  • Figure out how to build AIs that are better at tasks like "help humans clarify what they really want." Differential progress in this area could be a huge win. (Again, in the long run all of the AI-design work will itself be done by AI systems, but lots of damage could be dealt in the interim as we deploy human-desinged AIs that are particularly good at manipulation relative to helping humans clarify and realize their "real" values.)
  • Change institutions/policy/environment to reduce the risk of value corruption, especially for users that don't have strong short-term preferences about how their short-term preferences change, or who don't have a clear picture of how their current choices will affect that. For example, the designers of potentially-manipulative technologies may be able to set defaults that make a huge difference in how humanity's values evolve.
  • You could also try give highly wise people more influence over what actually happens, whether by acquiring resources, earning others' trust, or whatever.
Learning from idealized humans might address this to some extent, but in many circumstances I think I would trust the real humans who are actually in those circumstances more than the idealized humans who must reason about those circumstances from afar (in their safe, familiar environment).

This objection may work for some forms of idealization, but I don't think it holds up in general. If you think that experiencing X makes your views better, then your idealization can opt to experience X. The whole point of the idealization is that the idealized humans get to have the set of experiences that they believe are best for arriving at correct views, rather than a set of experiences that are constrained by technological feasibility / competitiveness constraints / etc.

(I agree that there can be some senses in which the idealization itself unavoidably "breaks" the idealized human---e.g. Vladimir Slepnev points out that an idealized human might conclude that they are most likely in a simulation, which may change their behavior; Wei Dai points out that they may behave selfishly towards the idealized human rather than towards the unidealized human, if selfishness is part of the values we'd converge to---but I don't think this is one of them.)

Note that humans play two distinct roles in IDA, and I think it's important to separate them

Yeah, I was talking entirely about the first role, thanks for the clarification.

This objection may work for some forms of idealization, but I don't think it holds up in general. If you think that experiencing X makes your views better, then your idealization can opt to experience X.

I agree now, I misunderstood what the point of the idealization in the original point was. (I thought it was to avoid having experiences that could cause value corruption, whereas it was actually about having experiences only when ready for them.)

Wei Dai points out that they may behave selfishly towards the idealized human rather than towards the unidealized human

I think this was tangentially related to my objection, for example that an idealized human would choose eg. not to be waterboarded even though that experience is important for deciding what to do for the unidealized human. Though the particular objection I wrote was based on a misunderstanding.

Note that humans play two distinct roles in IDA, and I think it’s important to separate them

This seems like a really important clarification, but in your article on corrigibility, you only ever talk about one human, the overseer, and the whole argument about "basin of attraction" seems to rely on having one human be both the trainer for corrigibility, target of corrigibility, and source of preferences:

But a corrigible agent prefers to build other agents that share the overseer’s preferences — even if the agent doesn’t yet share the overseer’s preferences perfectly. After all, even if you only approximately know the overseer’s preferences, you know that the overseer would prefer the approximation get better rather than worse.

I think in that post, the overseer is training the AI to specifically be corrigible to herself, which makes the AI aligned to herself. I'm not sure what is happening in the new scheme with two humans. Is the overseer now still training the AI to be corrigible to herself, which produces an AI that's aligned to the overseer which then helps out the user because the overseer has a preference to help out the user? Or is the overseer training the AI to be corrigible to a generic user and then "plugging in" a real user into the system at a later time? If the latter, have you checked that the "basin of attraction" argument still applies? If it does, maybe that post needs to be rewritten to make that clearer?

This seems like a really important clarification, but in your article on corrigibility, you only ever talk about one human, the overseer, and the whole argument about "basin of attraction" seems to rely on having one human be both the trainer for corrigibility, target of corrigibility, and source of preferences:

Corrigibility plays a role both within amplification and in the final agent.

The post is mostly talking about the final agent without talking about IDA specifically.

The section titled Amplification is about the internal dynamics, where behavior is corrigible by the question-asker. It doesn't seem important to me that these be the same. Corrigibility to the overseer only leads to corrigibility to the end user if the overseer is appropriately motivated. I usually imagine the overseer as something like a Google engineer and the end user as something like a visitor to google.com today. The resulting agent will likely be imperfectly corrigible because of the imperfect motives of Google engineers (this is pretty similar to human relationships around other technologies).

I'm no longer as convinced that corrigibility is the right abstraction for reasoning about internal behavior within amplification (but am still pretty convinced that it's a good way to reason about the external behavior, and I do think "corrigible" is closer to what we want than "benign" was). I've been thinking about these issues recently and it will be touched on in an upcoming post.

Is the overseer now still training the AI to be corrigible to herself, which produces an AI that's aligned to the overseer which then helps out the user because the overseer has a preference to help out the user?

This is basically right. I'm usually imagining the overseer training a general question-answering system, with the AI trained to be corrigible to the question-asker. We then use that question-answering system to implement a corrigible agent, by using it to answer questions like "What should the agent do next?" (with an appropriate specification of 'should'), which is where external corrigibility comes in.

This is basically right. I’m usually imagining the overseer training a general question-answering system, with the AI trained to be corrigible to the question-asker.

This confuses me because you're saying "basically right" to something but then you say something that seems very different, and which actually seems closer to the other option I was suggesting. Isn't it very different for the overseer to train the AI to be corrigible to herself as a specific individual, versus training the AI to be corrigible to whoever is asking the current question? Since the AI can't know who is asking the current question (which seems necessary to be corrigible to them?) without that being passed in as additional information, this seems closer to 'overseer training the AI to be corrigible to a generic user and then “plugging in” a real user into the system at a later time'.

I also have a bunch of other confusions, but it's probably easier to talk about them after resolving this one.

(Also, just in case, is there a difference between "corrigible to" and "corrigible by"?)

this seems closer to 'overseer training the AI to be corrigible to a generic user and then “plugging in” a real user into the system at a later time'.

The overseer asks the question "what should the agent do [to be corrigible to the Google customer Alice it is currently working for]?", and indeed even at training time the overseer is training the system to answer this question. There is no swapping out at test time. (The distributions at train and test time are identical, and I normally talk about the version where you keep training online.)

When the user asks a question to the agent it is being answered by indirection, by using the question-answering system to answer "what should the agent do [in the situation when it has been asked question Q by the user]?"

The overseer asks the question “what should the agent do [to be corrigible to the Google customer Alice it is currently working for]?“

Ok, I've been trying to figure out what would make the most sense and came to the same conclusion. I would also note that this "corrigible" is substantially different from the "corrigible" in "the AI is corrigible to the question asker" because it has to be an explicit form of corribility that is limited by things like corporate policy. For example if Alice asks "What are your design specs and source code?" or "How do I hack into this bank?" then the AI wouldn't answer even though it's supposed to be "corrigible" to the user, right? Maybe we need modifiers to indicate which corrigibility we're talking about, like "full corrigibility" vs "limited corrigibility"?

ETA: Actually, does it even make sense to use the word "corrigible" in "to be corrigible to the Google customer Alice it is currently working for"? Originally "corrigible" meant:

A corrigible agent experiences no preference or instrumental pressure to interfere with attempts by the programmers or operators to modify the agent, impede its operation, or halt its execution.

But obviously Google's AI is not going to allow a user to "modify the agent, impede its operation, or halt its execution". Why use "corrigible" here instead of different language altogether, like "helpful to the extent allowed by Google policies"?

(Also, just in case, is there a difference between "corrigible to" and "corrigible by"?)

No. I was just saying "corrigible by" originally because that seems more grammatical, and sometimes saying "corrigible to" because it seems more natural. Probably "to" is better.

It seems quite difficult to me to build AI systems that are safe, without having them rely on humans in some way to determine what is and is not good behavior. As a result, I’m pessimistic about our chances at creating AI systems that can address “human safety problems”.

The second sentence doesn't seem to follow logically from the first. There are many ways that an AI can rely on humans to determine what is and is not good behavior, some of which put more stress on human safety than others. At the most ideal (least reliant on humans being safe) end of the spectrum, we could solve metaphilosophy in a white-box way, and have the AI extract true/normative values from humans using its own philosophical understanding of what that means.

Aside from the "as a result" part, I'm also pessimistic but I don't see a reason to be so pessimistic that the current level of neglect could be justified.

in many circumstances I think I would trust the real humans who are actually in those circumstances more than the idealized humans who must reason about those circumstances from afar (in their safe, familiar environment).

This is not obvious to me. Can you elaborate or give an example?

Idealized humans can be thought of an instance of this approach where rather than change the policy of humans in reality, we change their policy in a hypothetical.

Idealized humans can be safer not just because they have better policies, but also because they have safer environments.

Overall, I’m hoping that we can solve “human safety problems” by training the humans supervising the AI to not have those problems

Can you explain how one might train a human not to have the second problem in this post?

ETA: Also, it seems like you're saying that a small group of highly trained supervisors will act as gatekeepers to all AIs, and use their judgement to, for example, deny requests for technologies that they don't think humans are ready for yet. Is that right?

The second sentence doesn't seem to follow logically from the first.

True. I don't know exactly how to convey this intuition. I think the part that seems very hard to do is to get the AI system to outperform humans at figuring out what is good to do. But this is not exactly true -- certainly at chess an AI system can outperform us choosing the best action. So maybe we could say that it's very hard to get the AI system to outperform humans at figuring out the best criteria for figuring out what is good to do. But I don't really believe that either -- I expect that AI systems will be able to do philosophy better than we will at some point.

Coming at it from another angle, I think I would say that thousands of years of human intellectual effort has been put into trying to figure out what is good to do, without much consensus. It seems very hard to build an AI system that is going to outperform all of that intellectual effort and be completely correct. (This very directly applies to the most ideal scenario that you laid out.)

This is not obvious to me. Can you elaborate or give an example?

Skin in the Game (SSC)

A particular example from the post: "A really good example of this is Hitchens’ waterboarding – he said he thought it was an acceptable interrogation technique, some people offered to waterboard him to see if he changed his mind, and he quickly did. I’m fascinated by this incident because it’s hard for me to understand, in a Mary’s Room style sense, what he learned from the experience. He must have already known it was very unpleasant – otherwise why would you even support it as useful for interrogation? But somehow there’s a difference between having someone explain to you that waterboarding is horrible, and undergoing it yourself."

Also all of the worries about paternalism in the charity sector.

Also there's the simple distributional shift argument: the circumstances that real humans are in form a distributional shift for the idealized humans, so we shouldn't trust their judgment.

Idealized humans can be safer not just because they have better policies, but also because they have safer environments.

I was imagining that the better policies are a result of the safer environments. (Like, if you're using something like IRL to infer values, then the safe environments cause different behavior and answers to questions which cause better inferred values.)

Changed "we change their policy in a hypothetical" to "we change their policy in a hypothetical by putting them in safer environments".

Can you explain how one might train a human not to have the second problem in this post?

Good epistemics goes some of the way, but I mostly wasn't thinking about those sorts of situations. I'll comment on that post as well, let's discuss further there.

ETA: Also, it seems like you're saying that a small group of highly trained supervisors will act as gatekeepers to all AIs, and use their judgement to, for example, deny requests for technologies that they don't think humans are ready for yet. Is that right?

Basically yes. I was imagining that each AI has its own highly trained supervisor(s). I'm not sure that the supervisors will be denying requests for technologies, more that the AI will learn patterns of reasoning and thinking from the supervisors that cause it to actually evaluate whether humans are ready for some particular technology rather than just giving it to humans without thinking about the safety implications. (This does require that the supervisors would make a similar decision if the problem was posed to them.)

Coming at it from another angle, I think I would say that thousands of years of human intellectual effort has been put into trying to figure out what is good to do, without much consensus. It seems very hard to build an AI system that is going to outperform all of that intellectual effort and be completely correct.

This doesn't seem to follow logically either. Thousands of years of intellectual effort amounts to a tiny fraction of the computation that the universe could do. Why is that a reason to think it very hard to outperform humans? And even if the white-box metaphilosophical approach doesn't work out, there are likely other ways to improve upon human performance, even if the result isn't "completely correct". (Did you perhaps interpret "address human safety problems" as "fully solve human safety problems"? I actually picked the word "address" over "solve" so as to not imply that a full solution must be possible.)

Personally, I'm pessimistic because there probably won't be enough time to figure out and implement these ideas in a safe way before someone builds an unaligned or aligned-but-unsafe AI.

Also there’s the simple distributional shift argument: the circumstances that real humans are in form a distributional shift for the idealized humans, so we shouldn’t trust their judgment.

But the idealized humans are free to broaden their "distribution" (e.g., experience waterboarding), when they figure out how to do that in a safe way. The difference is that unlike the real humans, they won't be forced to deal with strange new inputs before they are ready.

I was imagining that each AI has its own highly trained supervisor(s).

And these AIs/supervisors also act as a cabal to stop anyone else from running an AI without going through the same training, right? I guess this is another approach that should be considered, but don't you think putting humans in such positions of power in itself carries a high risk of corruption, and that it will be hard to come up with training that can reliably prevent such corruption?

This doesn't seem to follow logically either. Thousands of years of intellectual effort amounts to a tiny fraction of the computation that the universe could do. Why is that a reason to think it very hard to outperform humans?

I meant more that there has been a lot of effort into figuring out how to normatively talk about what "good" is, and it seems necessary for us to figure that out if you want to write down code that we can know beforehand will correctly extrapolate our values even though it cannot rely on us avoiding corrupted values. Now admittedly in the past people weren't specifically thinking about the problem of how to use a lot of computation to do this, but many of the thought experiments they propose seem to be of a similar nature (eg. suppose that we were logically omniscient and knew the consequences of our beliefs).

It's possible that this is mostly me misunderstanding terminology. Previously it sounded like you were against letting humans discover their own values through experience and instead having them figure it out through deliberation and reflection from afar. Now it sounds like you actually do want humans to discover their own values through experience, but you want them to be able to control the experiences and take them at their own pace.

But the idealized humans are free to broaden their "distribution" (e.g., experience waterboarding), when they figure out how to do that in a safe way. The difference is that unlike the real humans, they won't be forced to deal with strange new inputs before they are ready.

Thanks, that clarifies things a lot. I take back the statement about trusting real humans more.

And these AIs/supervisors also act as a cabal to stop anyone else from running an AI without going through the same training, right ?

Probably? I haven't thought about it much.

don't you think putting humans in such positions of power in itself carries a high risk of corruption, and that it will be hard to come up with training that can reliably prevent such corruption?

I agree this seems potentially problematic, I'm not sure that it's a deal-breaker.

I meant more that there has been a lot of effort into figuring out how to normatively talk about what “good” is, and it seems necessary for us to figure that out if you want to write down code that we can know beforehand will correctly extrapolate our values even though it cannot rely on us avoiding corrupted values.

I don't think it's necessary for us to figure out what "good" is, instead my "white-box metaphilosophical approach" is to figure out what "doing philosophy" is, program/teach an AI to "do philosophy" which would let it figure out what "good" is on its own. I think there has not been a lot of effort invested into that problem so there's a somewhat reasonable chance that it might be tractable. It seems worth investigating especially given that it might be the only way to fully solve the human safety problem.

It’s possible that this is mostly me misunderstanding terminology. Previously it sounded like you were against letting humans discover their own values through experience and instead having them figure it out through deliberation and reflection from afar. Now it sounds like you actually do want humans to discover their own values through experience, but you want them to be able to control the experiences and take them at their own pace.

To clarify, this is a separate approach from the white-box metaphilosophical approach. And I'm not sure whether I want the humans to discover their values through experience or through deliberation. I guess they should first deliberate on that choice and then do whatever they decide is best.

(I'm not sure if you're just checking your own understanding of the post, or if you're offering suggestions for how to express the ideas more clearly, or if you're trying to improve the ideas. If the latter two, I'd also welcome more direct feedback pointing out issues in my use of language or my ideas.)

I think the first paragraph of your rewrite is missing the "obligation" part of my post. It seems that even aligned AI could exacerbate human safety problems (and make the future worse than if we magically or technologically made humans more intelligent) so I think AI designers at least have an obligation to prevent that.

For the second paragraph, I think under the proposed approach, the AI should start inferring what the idealized humans would say (or calculate how it should optimize for the idealized humans' values-in-reflective-equilibrium, depending on details of how the AI is designed) as soon as it can, and not wait until the real humans start contradicting each other a lot, because the real humans could all be corrupted in the same direction. Even before that, it should start taking measures to protect itself from the real humans (under the assumption that the real humans might become corrupt at any time in a way that it can't yet detect). For example, it should resist any attempts by the real humans to change its terminal goal.

(I'm not sure if you're just checking your own understanding of the post, or if you're offering suggestions for how to express the ideas more clearly, or if you're trying to improve the ideas. If the latter two, I'd also welcome more direct feedback pointing out issues in my use of language or my ideas.)

Sorry, I should have explained. With most posts, there are enough details and examples that when I summarize the post for the newsletter, I'm quite confident that I got the details mostly right. This post was short enough that I wasn't confident this was true, so I pasted it here to make sure I wasn't changing the meaning too much.

I suppose you could think of this as a suggestion on how to express the ideas more clearly to me/the audience of the newsletter, but I think that's misleading. It's more that I try to use consistent language in the newsletter to make it easier for readers to follow, and the language you use is different from the language I use. (For example, you have short words like "human safety problems" for a large class of concepts, each of which I explain out in full sentences with examples.)

I think the first paragraph of your rewrite is missing the "obligation" part of my post. It seems that even aligned AI could exacerbate human safety problems (and make the future worse than if we magically or technologically made humans more intelligent) so I think AI designers at least have an obligation to prevent that.

Good point, added it in the newsletter summary.

For the second paragraph, I think under the proposed approach, the AI should start inferring what the idealized humans would say (or calculate how it should optimize for the idealized humans' values-in-reflective-equilibrium, depending on details of how the AI is designed) as soon as it can, and not wait until the real humans start contradicting each other a lot, because the real humans could all be corrupted in the same direction. Even before that, it should start taking measures to protect itself from the real humans (under the assumption that the real humans might become corrupt at any time in a way that it can't yet detect). For example, it should resist any attempts by the real humans to change its terminal goal.

Hmm, that's what I was trying to say. I've changed the last sentence of that paragraph to:

But if the idealized humans begin to have different preferences from real humans, then the AI system should ignore the "corrupted" values of the real humans.

If it's too hard to make AI systems in this way and we need to have them learn goals from humans, we could at least have them learn from idealized humans rather than real ones.

My interpretation of how the term is used here and elsewhere is that idealized humans are usually in themselves, and when we ignore costs, worse than real ones. For example, they could be based on predictions of human behavior that are not quite accurate, or they may only remain sane for an hour of continuous operation from some initial state. They are only better because they can be used in situations where real humans can't be used, such as in an infinite HCH, an indirect normativity style definition of AI goals, or a simulation of how a human develops when exposed to a certain environment (training). Their nature as inaccurate predictions may make them much more computationally tractable and actually available in situations where real humans aren't, and so more useful when we can compensate for the errors. So a better term might be "abstract humans" or "models of humans".

If these artificial environments with models of humans are good enough, they may also be able to bootstrap more accurate models of humans and put them into environments that produce better decisions, so that the initial errors in prediction won't affect the eventual outcomes.

Perhaps Wei Dai could clarify, but I thought the point of idealized humans was to avoid problems of value corruption or manipulation, which makes them better than real ones.

I agree that idealized humans have the benefit of making things like infinite HCH possible, but that doesn't seem to be a main point of this post.

I thought the point of idealized humans was to avoid problems of value corruption or manipulation

Among other things, yes.

which makes them better than real ones

This framing loses the distinction I'm making. More useful when taken together with their environment, but not necessarily better in themselves. These are essentially real humans that behave better because of environments where they operate and lack of direct influence from the outside world, which in some settings could also apply to the environment where they were raised. But they share the same vulnerabilities (to outside influence or unusual situations) as real humans, which can affect them if they are taken outside their safe environments. And in themselves, when abstracted from their environment, they may be worse than real humans, in the sense that they make less aligned or correct decisions, if the idealized humans are inaccurate predictions of hypothetical behavior of real humans.

Yeah, I agree with all of this. How would you rewrite my sentence/paragraph to be clearer, without making it too much longer?

Please reword your last idea. There is a possible aligned AI that is biased in its research and will ignore people telling it so?

I think that section will only make sense if you're familiar with the concept of differential intellectual progress. The wiki page I linked to is a bit outdated, so try https://concepts.effectivealtruism.org/concepts/differential-progress/ and its references instead.

Reading the link and some reference abstracts, I think my last comment already had that in mind. The idea here is that a certain kind of AI would accelerate a certain kind of progress more than another, because of the approach we used to align it, and on reflection we would not want this. But surely if it is aligned, and therefore corrigible, this should be no problem?

Here's a toy example that might make the idea clearer. Suppose we lived in a world that hasn't invented nuclear weapons yet, and someone creates an aligned AI that is really good at developing nuclear weapon technology and only a little bit better than humans on everything else. Even though everyone would prefer that nobody develops nuclear weapons, the invention of this aligned AI (if more than one nation had access to it, and "aligned" means aligned to the user) would accelerate the development of nuclear weapons relative to every other kind of intellectual progress and thereby reduce the expected value of the universe.

Does that make more sense now?

So you want to align the AI with us rather than its user by choosing the alignment approach it uses. If it's corrigible towards its user, won't it acquire the capabilities of the other approach in short order to better serve its user? Or is retrofitting the other approach also a blind spot of your proposed approach?

If it’s corrigible towards its user, won’t it acquire the capabilities of the other approach in short order to better serve its user?

Yes, that seems like an issue.

Or is retrofitting the other approach also a blind spot of your proposed approach?

That's one possible solution. Another one might be to create an aligned AI that is especially good at coordinating with other AIs, so that these AIs can make an agreement with each other to not develop nuclear weapons before they invent the AI that is especially good at developing nuclear weapons. (But would corrigibility imply that the user can always override such agreements?) There may be other solutions that I'm not thinking of. If all else fails, it may be that the only way to avoid AI-caused differential intellectual progress in a bad direction is to stop the development of AI.