Wei Dai


My research methodology

Why did you write "This post [Inaccessible Information] doesn't reflect me becoming more pessimistic about iterated amplification or alignment overall." just one month before publishing "Learning the prior"? (Is it because you were classifying "learning the prior" / imitative generalization under "iterated amplification" and now you consider it a different algorithm?)

For example, at the beginning of modern cryptography you could describe the methodology as “Tell a story about how someone learns something about your secret” and that only gradually crystallized into definitions like semantic security (and still people sometimes retreat to this informal process in order to define and clarify new security notions).

Why doesn't the analogy with cryptography make you a lot more pessimistic about AI alignment, as it did for me?

The best case is that we end up with a precise algorithm for which we still can’t tell any failure story. In that case we should implement it (in some sense this is just the final step of making it precise) and see how it works in practice.

Would you do anything else to make sure it's safe, before letting it become potentially superintelligent? For example would you want to see "alignment proofs" similar to "security proofs" in cryptography? What if such things do not seem feasible or you can't reach very high confidence that the definitions/assumptions/proofs are correct?

Persuasion Tools: AI takeover without AGI or agency?

You mention "defenses will improve" a few times. Can you go into more detail about this? What kind of defenses do you have in mind? I keep thinking that in the long run, the only defenses are either to solve meta-philosophy so our AIs can distinguish between correct arguments and merely persuasive ones and filter out the latter for us (and for themselves), or go into an info bubble with trusted AIs and humans and block off any communications from the outside. But maybe I'm not being imaginative enough.

Alignment By Default

So similarly, a human could try to understand Alice's values in two ways. The first, equivalent to what you describe here for AI, is to just apply whatever learning algorithm their brain uses when observing Alice, and form an intuitive notion of "Alice's values". And the second is to apply explicit philosophical reasoning to this problem. So sure, you can possibly go a long way towards understanding Alice's values by just doing the former, but is that enough to avoid disaster? (See Two Neglected Problems in Human-AI Safety for the kind of disaster I have in mind here.)

(I keep bringing up metaphilosophy but I'm pretty much resigned to be living in a part of the multiverse where civilization will just throw the dice and bet on AI safety not depending on solving it. What hope is there for our civilization to do what I think is the prudent thing, when no professional philosophers, even ones in EA who are concerned about AI safety, ever talk about it?)

Alignment By Default

To help me check my understanding of what you're saying, we train an AI on a bunch of videos/media about Alice's life, in the hope that it learns an internal concept of "Alice's values". Then we use SL/RL to train the AI, e.g., give it a positive reward whenever it does something that the supervisor thinks benefits Alice's values. The hope here is that the AI learns to optimize the world according to its internal concept of "Alice's values" that it learned in the previous step. And we hope that its concept of "Alice's values" includes the idea that Alice wants AIs, including any future AIs, to keep improving their understanding of Alice's values and to serve those values, and that this solves alignment in the long run.

Assuming the above is basically correct, this (in part) depends on the AI learning a good enough understanding of "improving understanding of Alice's values" in step 1. This in turn (assuming "improving understanding of Alice's values" involves "using philosophical reasoning to solve various confusions related to understanding Alice's values, including Alice's own confusions") depends on that the AI can learn a correct or good enough concept of "philosophical reasoning" from unsupervised training. Correct?

If AI can learn "philosophical reasoning" from unsupervised training, GPT-N should be able to do philosophy (e.g., solve open philosophical problems), right?

Inaccessible information

or we need to figure out some way to access the inaccessible information that “A* leads to lots of human flourishing.”

To help check my understanding, your previously described proposal to access this "inaccessible" information involves building corrigible AI via iterated amplification, then using that AI to capture "flexible influence over the future", right? Have you become more pessimistic about this proposal, or are you just explaining some existing doubts? Can you explain in more detail why you think it may fail?

(I'll try to guess.) Is it that corrigibility is about short-term preferences-on-reflection and short-term preferences-on-reflection may themselves be inaccessible information?

I can pay inaccessible costs for an accessible gain — for example leaking critical information, or alienating an important ally, or going into debt, or making short-sighted tradeoffs. Moreover, if there are other actors in the world, they can try to get me to make bad tradeoffs by hiding real costs.

This seems similar to what I wrote in an earlier thread: "What if the user fails to realize that a certain kind of resource is valuable? (By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)" At the time I thought you proposed to solve this problem by using the user's "preferences-on-reflection", which presumably would correctly value all resources/costs. So again is it just that "preferences-on-reflection" may itself be inaccessible?

Overall I don’t think it’s very plausible that amplification or debate can be a scalable AI alignment solution on their own, mostly for the kinds of reasons discussed in this post — we will eventually run into some inaccessible knowledge that is never produced by amplification, and so never winds up in your distilled agents.

Besides the above, can you give some more examples of (what you think may be) "inaccessible knowledge that is never produced by amplification"?

(I guess an overall feedback is that in most of the post you discuss inaccessible information without talking about amplification, and then quickly talk about amplification in the last section, but it's not easy to see how the two ideas relate without more explanations and examples.)

Possible takeaways from the coronavirus pandemic for slow AI takeoff

Thanks for writing this. I've been thinking along similar lines since the pandemic started. Another takeaway for me: Under our current political system, AI risk will become politicized. It will be very easy for unaligned or otherwise dangerous AI to find human "allies" who will help to prevent effective social response. Given this, "more competent institutions" has to include large-scale and highly effective reforms to our democratic political structures, but political dysfunction is such a well-known problem (i.e., not particularly neglected) that if there were easy fixes, they would have been found and applied already.

So whereas you're careful to condition your pessimism on "unless our institutions improve", I'm just pessimistic. (To clarify, I was already pessimistic before COVID-19, so it just provided more details about how coordination/institutions are likely to fail, which I didn't have a clear picture of. I'm curious if COVID-19 was an update for you as far as your overall assessment of AI risk. That wasn't totally clear from the post.)

On a related note, I recall Paul said the risk from failure of AI alignment (I think he said or meant "intent alignment") is 10%; Toby Ord gave a similar number for AI risk in his recent book; 80,000 Hours, based on interviews with multiple AI risk researchers, said "We estimate that the risk of a serious catastrophe caused by machine intelligence within the next 100 years is between 1 and 10%." Until now 1-10% seems to have been the consensus view among the most prominent AI risk researchers. I wonder if that has changed due to recent events.

AGIs as collectives

Having said this, I’m open to trying it for one of your arguments. So perhaps you can point me to one that you particularly want engagement on?

Perhaps you could read all three of these posts (they're pretty short :) and then either write a quick response to each one and then I'll decide which one to dive into, or pick one yourself (that you find particularly interesting, or you have something to say about).

Also, let me know if you prefer to do this here, via email, or text/audio/video chat. (Also, apologies ahead of time for any issues/delays as my kid is home all the time now, and looking after my investments is a much bigger distraction / time-sink than usual, after I updated away from "just put everything into an index fund".)

AGIs as collectives

This seems about right. In general when someone proposes a mechanism by which the world might end, I think the burden of proof is on them. You’re not just claiming “dangerous”, you’re claiming something like “more dangerous than anything else has ever been, even if it’s intent-aligned”. This is an incredibly bold claim and requires correspondingly thorough support.

  1. "More dangerous than anything else has ever been" does not seem incredibly bold to me, given that superhuman AI will be more powerful than anything else the world has seen. Historically the risk of civilization doing damage to itself seems to grow with the power that it has access to (e.g., the two world wars, substantial risks of nuclear war and man-made pandemic that continue to accumulate each year, climate change) so I think I'm just extrapolating a clear trend. (Past risks like these could not have been eliminated by solving a single straightforward, self-contained, technical problem analogous to "intent alignment" so why expect that now?)

To risk being uncharitable, your position seems analogous to someone saying, before the start of the nuclear era, "I think we should have a low prior that developing any particular kind of nuclear weapon will greatly increase the risk of global devastation in the future, because (1) that would be unprecedentedly dangerous and (2) nobody wants global devastation so everyone will work to prevent it. The only argument that has been developed well enough to overcome this low prior is that some types of nuclear weapons could potentially ignite the atmosphere, so to be safe we'll just make sure to only build bombs that definitely can't do that." (What would be a charitable historical analogy to your position if this one is not?)

  1. "The world might end" is not the only or even the main thing I'm worried about, especially because there are more people who can be expected to worry about "the world might end" and try to do something about it. My focus is more on the possibility that humanity survives but the values of people like me (or human values, or objective morality, depending on what the correct metaethics turn out to be) end up controlling only a small fraction of universe so we end up with astronomical waste or Beyond Astronomical Waste as a result. (Or our values become corrupted and the universe ends up being optimized for completely alien or wrong values.) There is plenty of precedence for the world becoming quite suboptimal according to some group's values, and there is no apparent reason to think the universe has to evolve according to objective morality (if such a thing exists), so my claim also doesn't seem very extraordinary from this perspective.

First because quite a few countries are handling it well. Secondly because I wasn’t even sure that lockdowns were a tool in the arsenal of democracies, and it seemed pretty wild to shut the economy down for so long.

If you think societal response to a risk like pandemic (and presumably AI) is substantially suboptimal by default (and it clearly is given that large swaths of humanity are incurring a lot of needless deaths), doesn't that imply significant residual risks, and plenty of room for people like us to try to improve the response? To a first approximation, the default suboptimal social response reduces all risks by some constant amount, so if some particular x-risk is important to work on without considering default social response, it's probably still important to work on after considering "whatever efforts people will make when the problem starts becoming more apparent". Do you disagree this argument? Did you have some other reason for saying that, that I'm not getting?

AGIs as collectives

To try to encourage you to engage with my arguments more (as far as pointing out where you're not convinced), I think I'm pretty good at being skeptical of my own ideas and have a good track record in terms of not spewing off a lot of random ideas that turn out to be far off the mark. But I am too lazy / have too many interests / am too easily distracted to write long papers/posts where I lay out every step of my reasoning and address every possible counterargument in detail.

So what I'd like to do is to just amend my posts to address the main objections that many people actually have, enough for more readers like you to "assign moderate probability that the argument is true". In order to do that, I need to have a better idea what objections people actually have or what counterarguments they currently find convincing. Does this make sense to you?

AGIs as collectives

but when we’re trying to make claims that a given effect will be pivotal for the entire future of humanity despite whatever efforts people will make when the problem starts becoming more apparent, we need higher standards to get to the part of the logistic curve with non-negligible gradient.

I guess a lot of this comes down to priors and burden of proof. (I guess I have a high prior that making something smarter than human is dangerous unless we know exactly what we're doing including the social/political aspects, and you don't, so you think the burden of proof is on me?) But (1) I did write a bunch of blog posts which are linked to in the second post (maybe you didn't click on that one?) and it would help if you could point out more where you're not convinced, and (2) does the current COVID-19 disaster not make you more pessimistic about "whatever efforts people will make when the problem starts becoming more apparent"?

When you think about the arguments made in your disjunctive post, how hard do you try to imagine each one conditional on the knowledge that the other arguments are false? Are they actually compelling in a world where Eliezer is wrong about intelligence explosions and Paul is wrong about influence-seeking agents?

I think I did? Eliezer being wrong about intelligence explosions just means we live in a world without intelligence explosions, and Paul being wrong about influence-seeking agents just means he (or someone) succeeds in building intent-aligned AGI, right? Many of my "disjunctive" arguments were written specifically with that scenario in mind.

Load More