RobertM — AI Alignment Forum

They imagine writing small and carefully locked-down infrastructure and allowing the AIs to interact with it.

That's surprising and concerning. As you say, if these companies expect their AIs to do end-to-end engineering and R&D tasks internally, it seems difficult to imagine how they could do that without having employee-level privileges. Any place where they don't is a place where humans turn into a bottleneck. I can imagine a few possible objections to this:

They don't expect those bottlenecks to impose that much of a penalty.
1. I'm not sure how this could be true unless you think that AI systems will be pretty weak, which is sort of fighting the premise. What are we worried about, again?
They expect the bottlenecks to impose a large penalty and plan on biting that bullet when the time comes.
1. I currently roll to disbelieve based on both the publicly-held positions of the relevant organizations and also their historical track records. Incentives here seem quite bad.
They plan on trying to thread the needle by employing some control schemes where (for example) different "agents" have different permissions. i.e. a "code writing" agent has read permissions for (some parts of) the codebase, the ability to write, deploy, and test changes to that code in a sandboxed dev environment, and the ability to open a pull request with those changes. Another set of agents have permissions to review pull requests, and then request changes, approve/merge them, or flag the PR as suspicious. Yet another set of agents act as gatekeepers to sensitive data that might be needed for some experiments but only with good justification.
1. This still suffers from the incentive gradient pushing quite hard to just build end-to-end agents. Not only will it probably work better, but it'll be straight up cheaper and easier!

Like, to be clear, I would definitely prefer a world where these organizations wrote "small and carefully locked-down infrastructure" as the limited surface their AIs were allowed to interact with; I just don't expect that to actually happen in practice.

LLM chatbots have ~half of the kinds of "consciousness" that humans believe in. Humans should avoid going crazy about that.

RobertM11mo816

I understand this post to be claiming (roughly speaking) that you assign >90% likelihood in some cases and ~50% in other cases that LLMs have internal subjective experiences of varying kinds. The evidence you present in each case is outputs generated by LLMs.

The referents of consciousness for which I understand you to be making claims re: internal subjective experiences are 1, 4, 6, 12, 13, and 14. I'm unsure about 5.

Do you have sources of evidence (even illegible) other than LLM outputs that updated you that much? Those seem like very surprisingly large updates to make on the basis of LLM outputs (especially in cases where those outputs are self-reports about the internal subjective experience itself, which are subject to substantial pressure from post-training).

Separately, I have some questions about claims like this:

The Big 3 LLMs are somewhat aware of what their own words and/or thoughts are referring to with regards to their previous words and/or thoughts. In other words, they can think about the thoughts "behind" the previous words they wrote.

This doesn't seem constructively ruled out by e.g. basic transformer architectures, but as justification you say this:

If you doubt me on this, try asking one what its words are referring to, with reference to its previous words. Its "attention" modules are actually intentionally designed to know this sort of thing, using using key/query/value lookups that occur "behind the scenes" of the text you actually see on screen.

How would you distinguish an LLM both successfully extracting and then faithfully representing whatever internal reasoning generated a specific part of its outputs, vs. conditioning on its previous outputs to give you plausible "explanation" for what it meant? The second seems much more likely to me (and this behavior isn't that hard to elicit, i.e. by asking an LLM to give you a one-word answer to a complicated question, and then asking it for its reasoning).

My motivation and theory of change for working in AI healthtech

RobertM1y1515

Do you have a mostly disjoint view of AI capabilities between the "extinction from loss of control" scenarios and "extinction by industrial dehumanization" scenarios? Most of my models for how we might go extinct in next decade from loss of control scenarios require the kinds of technological advancement which make "industrial dehumanization" redundant, with highly unfavorable offense/defense balances, so I don't see how industrial dehumanization itself ends up being the cause of human extinction if we (nominally) solve the control problem, rather than a deliberate or accidental use of technology that ends up killing all humans pretty quickly.

Separately, I don't understand how encouraging human-specific industries is supposed to work in practice. Do you have a model for maintaining "regulatory capture" in a sustained way, despite having no economic, political, or military power by which to enforce it? (Also, even if we do succeed at that, it doesn't sound like we get more than the Earth as a retirement home, but I'm confused enough about the proposed equilibrium that I'm not sure that's the intended implication.)

The Checklist: What Succeeding at AI Safety Will Involve

RobertM1y20

Is your perspective something like:

Something like that, though I'm much less sure about "non-norms-violating", because many possible solutions seem like they'd involve something qualitatively new (and therefore de-facto norm-violating, like nearly all new technology). Maybe a very superhuman TAI could arrange matters such that things just seem to randomly end up going well rather than badly, without introducing any new^[1] social or material technology, but that does seem quite a bit harder.

I'm pretty uncertain about, if something like that ended up looking norm-violating, it'd be norm-violating like Uber was^[2], or like super-persuasian. That question seems very contingent on empirical questions that I think we don't have much insight into, right now.

I'm unsure about the claim that if you put this aside, there is a way to end the acute risk period without needing truly insanely smart AIs.

I didn't mean to make the claim that there's a way to end the acute risk period without needing truly insanely smart AIs (if you put aside centrally-illegal methods); rather, that an AI would probably need to be relatively low on the "smarter than humans" scale to need to resort to methods that were obviously illegal to end the acute risk period.

^{^}
In ways that are obvious to humans.
^{^}
Minus the part where Uber was pretty obviously illegal in many places where it operated.

The Checklist: What Succeeding at AI Safety Will Involve

RobertM1y40

(Responding in a consolidated way just to this comment.)

Ok, got it. I don't think the US government will be able and willing to coordinate and enforce a worldwide moratorium on superhuman TAI development, if we get to just-barely TAI, at least not without plans that leverage that just-barely TAI in unsafe ways which violate the safety invariants of this plan. It might become more willing than it is now (though I'm not hugely optimistic), but I currently don't think as an institution it's capable of executing on that kind of plan and don't see why that will change in the next five years.

Another way to put this is that the story for needing much smarter AIs is presumably that you need to build crazy weapons/defenses to defend against someone else's crazily powerful AI.

I think I disagree with the framing ("crazy weapons/defenses") but it does seem like you need some kind of qualitatively new technology. This could very well be social technology, rather than something more material.

Building insane weapons/defenses requires US government consent (unless you're commiting massive crimes which seems like a bad idea).

I don't think this is actually true, except in the trivial sense where we have a legal system that allows the government to decide approximately arbitrary behaviors are post-facto illegal if it feels strongly enough about it. Most new things are not explicitly illegal. But even putting that aside^[1], I think this is ignoring the legal routes by which a qualitatively superhuman TAI might find to ending the Acute Risk Period, if it was so motivated.

(A reminder that I am not claiming this is Anthropic's plan, nor would I endorse someone trying to build ASI to execute on this kind of plan.)

TBC, I don't think there are plausible alternatives to at least some US government involvement which don't require commiting a bunch of massive crimes.

I think there's a very large difference between plans that involve nominal US government signoff on private actors doing things, in order to avoid comitting massive crimes (or to avoid the appearance of doing so), plans that involve the US government mostly just slowing things down or stopping people from doing things, and plans that involve the US government actually being the entity that makes high-context decisions about e.g. what values to to optimize for, given a slot into which to put values.

^{^}
I agree that stories which require building things that look very obviously like "insane weapons/defenses" seem bad, both for obvious deontological reasons, but also I wouldn't expect them to work well enough be worth it even under "naive" consequentialist analysis.

The Checklist: What Succeeding at AI Safety Will Involve

RobertM1y24

I agree with large parts of this comment, but am confused by this:

I think you should instead plan on not building such systems as there isn't a clear reason why you need such systems and they seem super dangerous. That's not to say that you shouldn't also do research into aligning such systems, I just think the focus should instead be on measures to avoid needing to build them.

While I don't endorse it due to disagreeing with some (stated and unstated) premises, I think there's a locally valid line of reasoning that goes something like this:

if Anthropic finds itself in a world where it's successfully built not-vastly-superhuman TAI, it seems pretty likely that other actors have also done so, or will do so relatively soon
it is now legible (to those paying attention) that we are in the Acute Risk Period
most other actors who have or will soon have TAI will be less safety-conscious than Anthropic
if nobody ends the Acute Risk Period, it seems pretty likely that one of those actors will do something stupid (like turn over their AI R&D efforts to their unaligned TAI), and then we all die
not-vastly-superhuman TAI will not be sufficient to prevent those actors from doing something stupid that ends the world
unfortunately, it seems like we have no choice but to make sure we're the first to build superhuman TAI, to make sure the Acute Risk Period has a good ending

This seems like the pretty straightforward argument for racing, and if you have a pretty specific combination of beliefs about alignment difficulty, coordination difficulty, capability profiles, etc, I think it basically checks out.

I don't know what set of beliefs implies that it's much more important to avoid building superhuman TAI once you have just-barely TAI, than to avoid building just-barely TAI in the first place. (In particular, how does this end up with the world in a stable equilibrium that doesn't immediately get knocked over by the second actor to reach TAI?)

A framework for thinking about AI power-seeking

RobertM1y32

Here the alignment concern is that we aren’t, actually, able to exert adequate selection pressure in this manner. But this, to me, seems like a notably open empirical question.

I think the usual concern is not whether this is possible in principle, but whether we're likely to make it happen the first time we develop an AI that is both motivated to attempt and likely to succeed at takeover. (My guess is that you understand this, based on your previous writing addressing the idea of first critical tries, but there does exist a niche view that alignment in the relevant sense is impossible and not merely very difficult to achieve under the relevant constraints, and arguments against that view look very different from arguments about the empirical difficulty of value alignment, likelihood of various default outcomes, etc).

I agree that it's useful to model AI's incentives for takeover in worlds where it's not sufficiently superhuman to have a very high likelihood of success. I've tried to do some of that, though I didn't attend to questions about how likely it is that we'd be able to "block off" the (hopefully much smaller number of) plausible routes to takeover for AIs which have a level of capabilities that don't imply an overdetermined success.

I think I am more pessimistic than you are about how much such AIs would value the "best benign alternatives" - my guess is very close to zero, since I expect ~no overlap in values and that we won't be able to succesfully engage in schemes like pre-committing to sharing the future value of the Lightcone conditional on the AI being cooperative^[1]. Separately, I expect that if we attempt to maneuver such AIs into positions where their highest-EV plan is something we'd consider to have benign long-run consequences, we will instead end up in situations where their plans are optimized to hit the pareto-frontier of "look benign" and "tilt the playing field further in the AI's favor". (This is part of what the Control agenda is trying to address.)

^{^}
Credit-assignment actually doesn't seem like the hard part, conditional on reaching aligned ASI. I'm skeptical of the part where we have a sufficiently capable AI that its help is useful in us reaching an aligned ASI, but it still prefers to help us because it thinks that its estimated odds of a successful takeover imply less future utility for itself than a fair post-facto credit assignment would give it, for its help. Having that calculation come out in our favor feels pretty doomed to me, if you've got the AI as a core part of your loop for developing future AIs, since it relies on some kind of scalable verification scheme and none of the existing proposals make me very optimistic.

Many arguments for AI x-risk are wrong

RobertM1y20

I think you tried to embed images hosted on some Google product, which our editor should've tried to re-upload to our own image host if you pasted them in as images but might not have if you inserted the images by URL. Hotlinking to images on Google domains often fails, unfortunately.

Matthew Barnett's Shortform

RobertM1y1-2

I'm happy to use a functional definition of "understanding" or "intelligence" or "situational awareness".

But this is assuming away a substantial portion of the entire argument: that there is a relevant difference between current systems, and systems which meaningfully have the option to take control of the future, in terms of whether techniques that look like they're giving us the desired behavior now will continue to give us desired behavior in the future.

My point re: introspection was trying to provide evidence for the claim that model outputs are not a useful reflection of the internal processes which generated those outputs, if you're importing expectations from how human outputs reflect the internal processes that generated them. If you get a model to talk to you about its internal experiences, that output was not causally downstream of it having internal experiences. Based on this, it is also pretty obvious that current gen LLMs do not have meaningful amounts of situational awareness, or, if they do, that their outputs are not direct evidence for it. Consider Anthropic's Sleeper Agents. Would a situationally aware model use a provided scratch pad to think about how it's in training and needs to pretend to be helpful? No, and neither does the model "understand" your intentions in a way that generalizes out of distribution the way you might expect a human's "understanding" to generalize out of distribution, because the first ensemble of heuristics found by SGD for returning the "right" responses during RLHF are not anything like human reasoning.

I'd prefer you to make some predictions about when this spontaneous agency-by-default in sufficiently intelligent systems is supposed to arise.

Are you asking for a capabilities threshold, beyond which I'd be very surprised to find that humans were still in control decades later, even if we successfully hit pause at that level of capabilities? The obvious one is "can it replace humans at all economically valuable tasks", which is probably not that helpful. Like, yes, there is definitely a sense in which the current situation is not maximally bad, because it does seem possible that we'll be able to train models capable of doing a lot of economically useful work, but which don't actively try to steer the future. I think we still probably die in those worlds, because automating capabilities research seems much easier than automating alignment research.

Matthew Barnett's Shortform

RobertM1y20

GPT-4 seems like a "generic system" that essentially "understands our intentions"

I suspect that a lot of my disagreement with your views comes down to thinking that current systems provide almost no evidence about the difficulty of aligning systems that could pose existential risks, because (I claim) current systems in fact almost certainly don't have any kind of meaningful situational awareness, or stable(ish) preferences over future world states.

In this case, I don't know why you think that GPT-4 "understands our intentions", unless you mean something very different by that than what you'd mean if you said that about another human. It is true that GPT-4 will produce output that, if it came from a human, would be quite strong evidence that our intentions were understood (more or less), but the process which generates that output is extremely different from the one that'd generate it in a human and is probably missing most of the relevant properties that we care about when it comes to "understanding". Like, in general, if you ask GPT-4 to produce output that references its internal state, that output will not have any obvious relationship^[1] to its internal state, since (as far as we know) it doesn't have the same kind of introspective access to its internal state that we do. (It might, of course, condition its outputs on previous tokens it output, and some humans do in fact rely on examining their previous externally-observable behavior to try to figure out what they were thinking at the time. But that's not the modality I'm talking about.)

It is also true that GPT-4 usually produces output that seems like it basically corresponds to our intentions, but that being true does not depend on it "understanding our intentions".

^{^}
That is known to us right now; possibly one exists and could be derived.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments