Wei Dai


Persuasion Tools: AI takeover without AGI or agency?

You mention "defenses will improve" a few times. Can you go into more detail about this? What kind of defenses do you have in mind? I keep thinking that in the long run, the only defenses are either to solve meta-philosophy so our AIs can distinguish between correct arguments and merely persuasive ones and filter out the latter for us (and for themselves), or go into an info bubble with trusted AIs and humans and block off any communications from the outside. But maybe I'm not being imaginative enough.

Alignment By Default

So similarly, a human could try to understand Alice's values in two ways. The first, equivalent to what you describe here for AI, is to just apply whatever learning algorithm their brain uses when observing Alice, and form an intuitive notion of "Alice's values". And the second is to apply explicit philosophical reasoning to this problem. So sure, you can possibly go a long way towards understanding Alice's values by just doing the former, but is that enough to avoid disaster? (See Two Neglected Problems in Human-AI Safety for the kind of disaster I have in mind here.)

(I keep bringing up metaphilosophy but I'm pretty much resigned to be living in a part of the multiverse where civilization will just throw the dice and bet on AI safety not depending on solving it. What hope is there for our civilization to do what I think is the prudent thing, when no professional philosophers, even ones in EA who are concerned about AI safety, ever talk about it?)

Alignment By Default

To help me check my understanding of what you're saying, we train an AI on a bunch of videos/media about Alice's life, in the hope that it learns an internal concept of "Alice's values". Then we use SL/RL to train the AI, e.g., give it a positive reward whenever it does something that the supervisor thinks benefits Alice's values. The hope here is that the AI learns to optimize the world according to its internal concept of "Alice's values" that it learned in the previous step. And we hope that its concept of "Alice's values" includes the idea that Alice wants AIs, including any future AIs, to keep improving their understanding of Alice's values and to serve those values, and that this solves alignment in the long run.

Assuming the above is basically correct, this (in part) depends on the AI learning a good enough understanding of "improving understanding of Alice's values" in step 1. This in turn (assuming "improving understanding of Alice's values" involves "using philosophical reasoning to solve various confusions related to understanding Alice's values, including Alice's own confusions") depends on that the AI can learn a correct or good enough concept of "philosophical reasoning" from unsupervised training. Correct?

If AI can learn "philosophical reasoning" from unsupervised training, GPT-N should be able to do philosophy (e.g., solve open philosophical problems), right?

Inaccessible information

or we need to figure out some way to access the inaccessible information that “A* leads to lots of human flourishing.”

To help check my understanding, your previously described proposal to access this "inaccessible" information involves building corrigible AI via iterated amplification, then using that AI to capture "flexible influence over the future", right? Have you become more pessimistic about this proposal, or are you just explaining some existing doubts? Can you explain in more detail why you think it may fail?

(I'll try to guess.) Is it that corrigibility is about short-term preferences-on-reflection and short-term preferences-on-reflection may themselves be inaccessible information?

I can pay inaccessible costs for an accessible gain — for example leaking critical information, or alienating an important ally, or going into debt, or making short-sighted tradeoffs. Moreover, if there are other actors in the world, they can try to get me to make bad tradeoffs by hiding real costs.

This seems similar to what I wrote in an earlier thread: "What if the user fails to realize that a certain kind of resource is valuable? (By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)" At the time I thought you proposed to solve this problem by using the user's "preferences-on-reflection", which presumably would correctly value all resources/costs. So again is it just that "preferences-on-reflection" may itself be inaccessible?

Overall I don’t think it’s very plausible that amplification or debate can be a scalable AI alignment solution on their own, mostly for the kinds of reasons discussed in this post — we will eventually run into some inaccessible knowledge that is never produced by amplification, and so never winds up in your distilled agents.

Besides the above, can you give some more examples of (what you think may be) "inaccessible knowledge that is never produced by amplification"?

(I guess an overall feedback is that in most of the post you discuss inaccessible information without talking about amplification, and then quickly talk about amplification in the last section, but it's not easy to see how the two ideas relate without more explanations and examples.)

Possible takeaways from the coronavirus pandemic for slow AI takeoff

Thanks for writing this. I've been thinking along similar lines since the pandemic started. Another takeaway for me: Under our current political system, AI risk will become politicized. It will be very easy for unaligned or otherwise dangerous AI to find human "allies" who will help to prevent effective social response. Given this, "more competent institutions" has to include large-scale and highly effective reforms to our democratic political structures, but political dysfunction is such a well-known problem (i.e., not particularly neglected) that if there were easy fixes, they would have been found and applied already.

So whereas you're careful to condition your pessimism on "unless our institutions improve", I'm just pessimistic. (To clarify, I was already pessimistic before COVID-19, so it just provided more details about how coordination/institutions are likely to fail, which I didn't have a clear picture of. I'm curious if COVID-19 was an update for you as far as your overall assessment of AI risk. That wasn't totally clear from the post.)

On a related note, I recall Paul said the risk from failure of AI alignment (I think he said or meant "intent alignment") is 10%; Toby Ord gave a similar number for AI risk in his recent book; 80,000 Hours, based on interviews with multiple AI risk researchers, said "We estimate that the risk of a serious catastrophe caused by machine intelligence within the next 100 years is between 1 and 10%." Until now 1-10% seems to have been the consensus view among the most prominent AI risk researchers. I wonder if that has changed due to recent events.

AGIs as collectives

Having said this, I’m open to trying it for one of your arguments. So perhaps you can point me to one that you particularly want engagement on?

Perhaps you could read all three of these posts (they're pretty short :) and then either write a quick response to each one and then I'll decide which one to dive into, or pick one yourself (that you find particularly interesting, or you have something to say about).

Also, let me know if you prefer to do this here, via email, or text/audio/video chat. (Also, apologies ahead of time for any issues/delays as my kid is home all the time now, and looking after my investments is a much bigger distraction / time-sink than usual, after I updated away from "just put everything into an index fund".)

AGIs as collectives

This seems about right. In general when someone proposes a mechanism by which the world might end, I think the burden of proof is on them. You’re not just claiming “dangerous”, you’re claiming something like “more dangerous than anything else has ever been, even if it’s intent-aligned”. This is an incredibly bold claim and requires correspondingly thorough support.

  1. "More dangerous than anything else has ever been" does not seem incredibly bold to me, given that superhuman AI will be more powerful than anything else the world has seen. Historically the risk of civilization doing damage to itself seems to grow with the power that it has access to (e.g., the two world wars, substantial risks of nuclear war and man-made pandemic that continue to accumulate each year, climate change) so I think I'm just extrapolating a clear trend. (Past risks like these could not have been eliminated by solving a single straightforward, self-contained, technical problem analogous to "intent alignment" so why expect that now?)

To risk being uncharitable, your position seems analogous to someone saying, before the start of the nuclear era, "I think we should have a low prior that developing any particular kind of nuclear weapon will greatly increase the risk of global devastation in the future, because (1) that would be unprecedentedly dangerous and (2) nobody wants global devastation so everyone will work to prevent it. The only argument that has been developed well enough to overcome this low prior is that some types of nuclear weapons could potentially ignite the atmosphere, so to be safe we'll just make sure to only build bombs that definitely can't do that." (What would be a charitable historical analogy to your position if this one is not?)

  1. "The world might end" is not the only or even the main thing I'm worried about, especially because there are more people who can be expected to worry about "the world might end" and try to do something about it. My focus is more on the possibility that humanity survives but the values of people like me (or human values, or objective morality, depending on what the correct metaethics turn out to be) end up controlling only a small fraction of universe so we end up with astronomical waste or Beyond Astronomical Waste as a result. (Or our values become corrupted and the universe ends up being optimized for completely alien or wrong values.) There is plenty of precedence for the world becoming quite suboptimal according to some group's values, and there is no apparent reason to think the universe has to evolve according to objective morality (if such a thing exists), so my claim also doesn't seem very extraordinary from this perspective.

First because quite a few countries are handling it well. Secondly because I wasn’t even sure that lockdowns were a tool in the arsenal of democracies, and it seemed pretty wild to shut the economy down for so long.

If you think societal response to a risk like pandemic (and presumably AI) is substantially suboptimal by default (and it clearly is given that large swaths of humanity are incurring a lot of needless deaths), doesn't that imply significant residual risks, and plenty of room for people like us to try to improve the response? To a first approximation, the default suboptimal social response reduces all risks by some constant amount, so if some particular x-risk is important to work on without considering default social response, it's probably still important to work on after considering "whatever efforts people will make when the problem starts becoming more apparent". Do you disagree this argument? Did you have some other reason for saying that, that I'm not getting?

AGIs as collectives

To try to encourage you to engage with my arguments more (as far as pointing out where you're not convinced), I think I'm pretty good at being skeptical of my own ideas and have a good track record in terms of not spewing off a lot of random ideas that turn out to be far off the mark. But I am too lazy / have too many interests / am too easily distracted to write long papers/posts where I lay out every step of my reasoning and address every possible counterargument in detail.

So what I'd like to do is to just amend my posts to address the main objections that many people actually have, enough for more readers like you to "assign moderate probability that the argument is true". In order to do that, I need to have a better idea what objections people actually have or what counterarguments they currently find convincing. Does this make sense to you?

AGIs as collectives

but when we’re trying to make claims that a given effect will be pivotal for the entire future of humanity despite whatever efforts people will make when the problem starts becoming more apparent, we need higher standards to get to the part of the logistic curve with non-negligible gradient.

I guess a lot of this comes down to priors and burden of proof. (I guess I have a high prior that making something smarter than human is dangerous unless we know exactly what we're doing including the social/political aspects, and you don't, so you think the burden of proof is on me?) But (1) I did write a bunch of blog posts which are linked to in the second post (maybe you didn't click on that one?) and it would help if you could point out more where you're not convinced, and (2) does the current COVID-19 disaster not make you more pessimistic about "whatever efforts people will make when the problem starts becoming more apparent"?

When you think about the arguments made in your disjunctive post, how hard do you try to imagine each one conditional on the knowledge that the other arguments are false? Are they actually compelling in a world where Eliezer is wrong about intelligence explosions and Paul is wrong about influence-seeking agents?

I think I did? Eliezer being wrong about intelligence explosions just means we live in a world without intelligence explosions, and Paul being wrong about influence-seeking agents just means he (or someone) succeeds in building intent-aligned AGI, right? Many of my "disjunctive" arguments were written specifically with that scenario in mind.

AGIs as collectives

For now my epistemic state is: extreme agency is an important component of thee main argument for risk, so all else equal reducing it should reduce risk.

I appreciate the explanation, but this is pretty far from my own epistemic state, which is that arguments for AI risk are highly disjunctive, most types of AGI (not just highly agentic ones) are probably unsafe (i.e., are likely to lead us away from rather than towards a success story), at best probably only a few very specific AGI designs (which may well be agentic if combined with other properties) are both feasible and safe (i.e., can count as success stories), so it doesn't make sense to say that an AGI is "safer" just because it's less agentic.

Having said that, I also believe that most safety work will be done by AGIs, and so I want to remain open-minded to success stories that are beyond my capability to predict.

Getting to an AGI that can safely do human or superhuman level safety work would be a success story in itself, which I labeled "Research Assistant" in my post.

Load More