Matthew "Vaniver" Graves

Wiki Contributions


Let's See You Write That Corrigibility Tag

If you have a space with two disconnected components, then I'm calling the distinction between them "crisp."

The components feel disconnected to me in 1D, but I'm not sure they would feel disconnected in 3D or in ND. Is your intuition that they're 'durably disconnected' (even looking at the messy plan-space of the real-world, we'll be able to make a simple classifier that rates corrigibility), or if not, when the connection comes in (like once you can argue about philosophy in way X, once you have uncertainty about your operator's preferences, once you have the ability to shut off or distract bits of your brain without other bits noticing, etc.)?

[This also feels like a good question for people who think corrigibility is anti-natural; do you not share Paul's sense that they're disconnected in 1D, or when do you think the difficulty comes in?]

AGI Ruin: A List of Lethalities

I agree this list doesn't seem to contain much unpublished material, and I think the main value of having it in one numbered list is that "all of it is in one, short place", and it's not an "intro to computers can think" and instead is "these are a bunch of the reasons computers thinking is difficult to align".

The thing that I understand to be Eliezer's "main complaint" is something like: "why does it seem like No One Else is discovering new elements to add to this list?". Like, I think Risks From Learned Optimization was great, and am glad you and others wrote it! But also my memory is that it was "prompted" instead of "written from scratch", and I imagine Eliezer reading it more had the sense of "ah, someone made 'demons' palatable enough to publish" instead of "ah, I am learning something new about the structure of intelligence and alignment."

[I do think the claim that Eliezer 'figured it out from the empty string' doesn't quite jive with the Yudkowsky's Coming of Age sequence.]

AGI Ruin: A List of Lethalities

Why is the process by which humans come to reliably care about the real world

IMO this process seems pretty unreliable and fragile, to me. Drugs are popular; video games are popular; people-in-aggregate put more effort into obtaining imaginary afterlives than life extension or cryonics.

But also humans have a much harder time 'optimizing against themselves' than AIs will, I think. I don't have a great mechanistic sense of what it will look like for an AI to reliably care about the real world.

AGI Ruin: A List of Lethalities

Not saying that this should be MIRI's job, rather stating that I'm confused because I feel like we as a community are not taking an action that would seem obvious to me. 

I wrote about this a bit before, but in the current world my impression is that actually we're pretty capacity-limited, and so the threshold is not "would be good to do" but "is better than my current top undone item". If you see something that seems good to do that doesn't have much in the way of unilateralist risk, you doing it is probably the right call. [How else is the field going to get more capacity?]

AGI Ruin: A List of Lethalities

Notably, humans were once much less powerful, in our hunter-gatherer days, but over time, through the gradual process of accumulating technology, knowledge, and culture, humans now possess vast productive capacities that far outstrip our ancient powers.

Similarly, our ability to coordinate through language also plays a huge role in explaining our power compared to other animals. But, on a first approximation, other animals can't coordinate at all, making this distinction much less impressive. The first AGIs we construct will be born into a culture already capable of coordinating, and sharing knowledge, making the potential power difference between AGI and humans relatively much smaller than between humans and other animals, at least at first.

I basically buy the story that human intelligence is less useful that human coordination; i.e. it's the intelligence of "humanity" the entity that matters, with the intelligence of individual humans relevant only as, like, subcomponents of that entity.

But... shouldn't this mean you expect AGI civilization to totally dominate human civilization? They can read each other's source code, and thus trust much more deeply! They can transmit information between them at immense bandwidths! They can clone their minds and directly learn from each other's experiences!

Like, one scenario I visualize a lot is the NHS having a single 'DocBot', i.e. an artificial doctor run on datacenters that provides medical advice and decision-making for everyone in the UK (while still working with nurses and maybe surgeons and so on). Normally I focus on the way that it gets about three centuries of experience treating human patients per day, but imagine the difference in coordination capacity between DocBot and the BMA

Having seen undeniable, large economic effects from AI, policymakers will eventually realize that AGI is important, and will launch massive efforts to regulate it. 

I think everyone expects this, and often disagree on the timescale on which it will arrive. See, for example, Elon Musk's speech to the US National Governors Association, where he argues that the reactive regulation model will be too slow to handle the crisis.

But I think the even more important disagreement is on whether or not regulations should be expected to work. Ok, so you make it so that only corporations with large compliance departments can run AGI. How does that help? There was a tweet by Matt Yglesias a while ago that I can't find now, which went something like: "a lot of smart people are worried about AI, and when you ask them what the government can do about it, they have no idea; this is an extremely wild situation from the perspective of a policy person." A law that says "don't run the bad code" is predicated on the ability to tell the good code from the bad code, which is the main thing we're missing and don't know how to get!

And if you say something like "ok, one major self-driving car accident will be enough to convince everyone to do the Butlerian Jihad and smash all the computers", that's really not how it looks to me. Like, the experience of COVID seems a lot like "people who were doing risky research in labs got out in front of everyone else to claim that the lab leak hypothesis was terrible and unscientific, and all of the anti-disinformation machinery was launched to suppress it, and it took a shockingly long time to even be able to raise the hypothesis, and it hasn't clearly swept the field, and legislation to do something about risky research seems like it definitely isn't a slam dunk." 

When we get some AI warning signs, I expect there are going to be people with the ability to generate pro-AI disinfo and a strong incentive to do so. I expect there to be significant latent political polarization which will tangle up any attempt to do something useful about it. I expect there won't be anything like the international coordination that was necessary to set up anti-nuclear-proliferation efforts to set up the probably harder problem of anti-AGI-proliferation efforts.

AGI Ruin: A List of Lethalities

Not sure exactly what this means.

My read was that for systems where you have rock-solid checking steps, you can throw arbitrary amounts of compute at searching for things that check out and trust them, but if there's any crack in the checking steps, then things that 'check out' aren't trustable, because the proposer can have searched an unimaginably large space (from the rater's perspective) to find them. [And from the proposer's perspective, the checking steps are the real spec, not whatever's in your head.]

In general, I think we can get a minor edge from "checking AI work" instead of "generating our own work" and that doesn't seem like enough to tackle 'cognitive megaprojects' (like 'cure cancer' or 'develop a pathway from our current society to one that can reliably handle x-risk' or so on). Like, I'm optimistic about "current human scientists use software assistance to attempt to cure cancer" and "an artificial scientist attempts to cure cancer" and pretty pessimistic about "current human scientists attempt to check the work of an artificial scientist that is attempting to cure cancer." It reminds me of translators who complained pretty bitterly about being given machine-translated work to 'correct'; they basically still had to do it all over again themselves in order to determine whether or not the machine had gotten it right, and so it wasn't nearly as much of a savings as hoped.

Like the value of 'DocBot attempts to cure cancer' is that DocBot can think larger and wider thoughts than humans, and natively manipulate an opaque-to-us dense causal graph of the biochemical pathways in the human body, and so on; if you insist on DocBot only thinking legible-to-human thoughts, then it's not obvious it will significantly outperform humans.

AGI Ruin: A List of Lethalities

Still, iterative deployment with gradually increasing stakes is much safer than deploying a model to do something totally unprecedented and high-stakes.

I agree with the "X is safer than Y" claim; I am uncertain whether it's practically available to us, and much more worried in worlds where it isn't available.

incrementally increase the amount of KL-divergence between the new policy and a known-to-be-safe policy

For this specific proposal, when I reframe it as "give the system a KL-divergence budget to spend on each change to its policy" I worry that it works against a stochastic attacker but not an optimizing attacker; it may be the case that every known-to-be-safe policy has some unsafe policy within a reasonable KL-divergence of it, because the danger can be localized in changes to some small part of the overall policy-space.

the problem might not be so bad with the current paradigm, where you start with a pretrained model (which doesn't really have goals and isn't good at long-horizon control), and fine-tune it (which makes it better at goal-directed behavior). In this case, most of the concepts are learned during the pretraining phase, not the fine-tuning phase where it learns goal-directed behavior.

Yeah, I agree that this seems pretty good. I do naively guess that when you do the fine-tuning, it's the concepts that are most related to the goals who change the most (as they have the most gradient pressure on them); it'd be nice to know how much this is the case, vs. most of the relevant concepts being durable parts of the environment that were already very important for goal-free prediction.

AGI Ruin: A List of Lethalities

But it's much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time. With this iterative approach to deployment, you only need to generalize a little bit out of distribution. Further, you can use Agent N to help you closely supervise Agent N+1 before giving it any power.

My model of Eliezer claims that there are some capabilities that are 'smooth', like "how large a times table you've memorized", and some are 'lumpy', like "whether or not you see the axioms behind arithmetic." While it seems plausible that we can iteratively increase smooth capabilities, it seems much less plausible for lumpy capabilities. 

A specific example: if you have a neural network with enough capacity to 1) memorize specific multiplication Q+As and 2) implement a multiplication calculator, my guess is that during training you'll see a discontinuity in how many pairs of numbers it can successfully multiply.[1] It is not obvious to me whether or not there are relevant capabilities like this that we'll "find with neural nets" instead of "explicitly programming in"; probably we will just build AlphaZero so that it uses MCTS instead of finding MCTS with gradient descent, for example.

[edit: actually, also I don't think I get how you'd use a 'smaller times table' to oversee a 'bigger times table' unless you already knew how arithmetic worked, at which point it's not obvious why you're not just writing an arithmetic program.]

That it might be possible to establish an agent's inner objective when training on easy problems, when the agent isn't very capable, such that this objective remains stable as the agent becomes more powerful.

IMO this runs into two large classes of problems, both of which I put under the heading 'ontological collapse'.

First, suppose the agent's inner objective is internally located: "seek out pleasant tastes." Then you run into 16 and 17, where you can't quite be sure what it means by "pleasant tastes", and you don't have a great sense of what "pleasant tastes" will extrapolate to at the next level of capabilities. [One running "joke" in EA is that, on some theories of what morality is about, the highest-value universe is one which contains an extremely large number of rat brains on heroin. I think this is the correct extrapolation / maximization of at least one theory which produces good behavior when implemented by humans today, which makes me pretty worried about this sort of extrapolation.]

Second, suppose the agent's inner objective is externally located: "seek out mom pressing the reward button". Then you run into 18, which argues that once the agent realizes that the 'reward button' is an object in its environment instead of a communication channel between the human and itself, it may optimize for the object instead of 'being able to hear what the human would freely communicate' or whatever philosophically complicated variable it is that we care about. [Note that attempts to express this often need multiple patches and still aren't fixed; "mom approves of you" can be coerced, "mom would freely approve of you" has a trouble where you have some freedom in identifying your concept of 'mom' which means you might pick one who happens to approve of you.]

there's lots of ongoing research and promising ideas for fixing it.

I'm optimistic about this too, but... I want to make sure we're looking at the same problem, or something? I think my sense is best expressed in Stanovich and West, where they talk about four responses to the presence of systematic human misjudgments. The 'performance error' response is basically the 'epsilon-rationality' assumption; 1-ε of the time humans make the right call, and ε of the time they make a random call. While a fine model of performance errors, it doesn't accurately predict what's happening with systematic errors, which are predictable instead of stochastic.

I sometimes see people suggest that the model should always or never conform to the human's systematic errors, but it seems to me like we need to somehow distinguish between systematic "errors" that are 'value judgments' ("oh, it's not that the human prefers 5 deaths to 1 death, it's that they are opposed to this 'murder' thing that I should figure out") and systematic errors that are 'bounded rationality' or 'developmental levels' ("oh, it's not that the (very young) human prefers less water to more water, it's that they haven't figured out conservation of mass yet"). It seems pretty sad if we embed all of our confusions into the AI forever--and also pretty sad if we end up not able to transfer any values because all of them look like confusions.[2]

[1] This might depend on what sort of curriculum you train it on; I was imagining something like 1) set the number of digits N=1, 2) generate two numbers uniformly at random between 1 and 2^N, pass them as inputs (sequence of digits?), 3) compare the sequence of digits outputted to the correct answer, either with a binary pass/fail or some sort of continuous similarity metric (so it gets some points for 12x12 = 140 or w/e); once it performs at 90% success check the performance for increased N until you get one with below 80% success and continue training. In that scenario, I think it just memorizes until N is moderately sized (8?), at which point it figures out how to multiply, and then you can increasing N lots without losing accuracy (until you hit some overflow error in its implementation of multiplication from having large numbers).

[2] I'm being a little unfair in using the trolley problem as an example of a value judgment, because in my mind the people who think you shouldn't pull the lever because it's murder are confused or missing a developmental jump--but I have the sense that for most value judgments we could find, we can find some coherent position which views it as confused in this way.

AGI Ruin: A List of Lethalities

I'm very glad this list is finally published; I think it's pretty great at covering the space (tho I won't be surprised if we discover a few more points), and making it so that plans can say "yeah, we're targeting a hole we see in number X."

[In particular, I think most of my current hope is targeted at 5 and 6, specifically that we need an AI to do a pivotal act at all; it seems to me like we might be able to transition from this world to a world sophisticated enough to survive on human power. But this is, uh, a pretty remote possibility and I was much happier when I was optimistic about technical alignment.]

“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

This mostly seems to be an argument for: "It'd be nice if no pivotal act is necessary", but I don't think anyone disagrees with that.

It's arguing that, given that your organization has scary (near) AGI capabilities, it is not so much harder (to get a legitimate authority to impose an off-switch on the world's compute) than (to 'manufacture your own authority' to impose that off-switch) such that it's worth avoiding the cost of (developing those capabilities while planning to manufacture authority). Obviously there can be civilizations where that's true, and civilizations where that's not true.

Load More