Raemon — AI Alignment Forum

LessWrong team member / moderator. I've been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I've been interested in improving my own epistemic standards and helping others to do so as well.

(Putting the previous Wei Dai answer to What are the open problems in Human Rationality? for easy reference, which seemed like it might contain relevant stuff)

In your mind what are the biggest bottlenecks/issues in "making fast, philosophically competent alignment researchers?"

nod. I'm not sure I agreed with all the steps there but I agree with the general promise of "accept the premise that claude is just a bit away from AGI, and is reasonably aligned, and see where that goes when you look at each next step."

I think you are saying something that shares at least some structure with Buck's comment that

It seems like as AIs get more powerful, two things change:
They probably eventually get powerful enough that they (if developed with current methods) start plotting to kill you/take your stuff.
They get better, so their wanting to kill you is more of a problem.
I don't see strong arguments that these problems should arise at very similar capability levels, especially if AI developers actively try to prevent the AIs from taking over

(But where you're pointing at a different two sets of properties that may not arise at the same time)

I'm actually not sure I get what the two properties you're talking about, though. Seems like you're contrasting "claude++ crosses the agi (= can kick off rsi) threshold" with "crosses the 'dangerous-core-of-generalization' threshold"

I'm confused because I think the word "agi" basically does mean "cross the core-of-generalization threshold" (which isn't immediately dangerous, but, puts us into 'things could quickly get dangerous at any time" territory)

I do agree "able to do a loop of RSI doesn't intrinsically mean 'agi' or 'core-of-generalization'," there could be narrow skills for doing a loop of RSI. I'm not sure if you more meant "non-agi RSI" or, you see something different between "AGI" and "core-of-generalization." Or think there's a particular "dangerous core-of-generalization" separate from AGI.

(I think "the sharp left turn" is when the core-of-generalization starts to reflect on what it wants, which might come immediately after a core-of-generalization but also could come after either narrow-introspection + adhoc agency, or, might just take awhile for it to notice)

((I can't tell if this comment is getting way more in the weeds than is necessary, but, it seemed like the nuances of exactly what you meant were probably loadbearing))

It sees like the reason Claude's level is misalignment is fine is because it's capabilities aren't very good, and there's not much/any reason to assume it'd be fine if you held alignment constant but dialed up capabilities.

Do you not think that?

(I don't really see why it's relevant how aligned Claude is if we're not thinking about that as part of it)

Mmm nod. (I bucket this under "given this ratio of right/wrong responses, you think a smart alignment researcher who's paying attention can keep it in a corrigibility basin even as capability levels rise?". Does that feel inaccurate, or, just, not how you'd exactly put it?)

There's a version of Short Timeline World (which I think is more likely? but, not confidently) which is : "the current paradigm does basically work... but, the way we get to ASI, as opposed to AGI, routes through 'the current paradigm helps invent a new better paradigm, real fast'."

In that world, GPT5 has the possibility-of-true-generality, but, not necessarily very efficiently, and once you get to the sharper part of the AI 2027 curve, the mechanism by which the next generation of improvement comes is via figuring out alternate algorithms.

certainly if AI systems were only ever roughly this misaligned we'd be doing pretty well.

I think this is an important disagreement with the "alignment is hard" crowd. I particularly disagree with "certainly."

The question is "what exactly is the AI trying to do, and what happens if it magnified it's capabilities a millionfold and it and it's descendants were running openendedly?", and are any of the instances catastrophically bad?

Some things you might mean that are raising your position to "certainly" (whereas I'd say "most likely not, or, it's too dumb to even count as 'aligned' or 'misaligned'")

"this ratio of 'do the thing you want' to 'sometimes do a thing you didn't want' is pretty acceptable."
"this magnitude of 'worst case outcome' is not that bad." (this seems technically true, but, is only because the capability level is low)
given this ratio of right/wrong responses, you think a smart alignment researcher who's paying attention can keep it in a corrigibility basin even as capability levels rise?

Were any of those what you meant? Or are you thinking about it an entirely different way?

I would naively expect, if you took LLM-agents current degree of alignment, and ran a lotta copies trying to help you with end-to-end alignment research with dialed up capabilities, at least a couple instances would end up trying to subtle sabotage you and/or escape.

This framing feels reasonable-ish, with some caveats.^[1]

I am assuming we're starting the question at the first stage where either "shut it down" or "have a strong degree of control over global takeoff" becomes plausibly politically viable. (i.e. assume early stages of Shut It Down and Controlled Takeoff both include various partial measures that are more immediately viable and don't give you the ability to steer capability-growth that hard)

But, once it becomes a serious question "how quickly should we progress through capabilities", then one thing to flag is, it's not like you know "we get 5 years, therefore, we want to proceed through those years at X rate." It's "we seem to have this amount of buy-in currently..." and the amount of buy-in could change (positively or negatively).

Some random thoughts on things that seem important:

I would want to do at least some early global pause on large training runs, to check if you are actually capable of doing that at all. (in conjunction with some efforts attempting to build international goodwill about it)
One of the more important things to do as soon as it's viable, is to stop production of more compute in an uncontrolled fashion. (I'm guessing this plays out with some kind of pork deals for nVidia and other leaders^[2], where the early steps are 'consolidate compute', and then them producing the chips that are more monitorable, and which they get to make money from, but also are sort of nationalized). This prevents a big overhang.
Before I did a rapid-growth of capabilities, I would want a globally set target of "we are able to make some kind of interpretability strides or evals that let us make better able to predict the outcome of the next training run." (

If it's not viable to do that, well, then we don't. (but, then we're not really having a real convo about how slow the takeoff should ideally be, just riding the same incentive wave we're currently riding with slightly more steering). ((We can instead have a convo about how to best steer given various murky conditions, which seems like a real important convo, I'm just responding here to this comment's framing))^[3]

If we reach a point where humanity has demonstrated the capability of "stop training on purpose, stop uncontrolled compute production, and noticeably improve our ability to predict the next training run", then I'm not obviously opposed to doing relatively rapid advancement, but, it's not obviously better to do "rapid to the edge" than "do one round where there are predictions/incentives/prizes somehow for people to accurately predict how the next training rounds go, then evaluate that, then do it again."

^{^}
I think there's at least some confusion where people are imagining the simplest/dumbest version of Shut It Down, and imagining "Plan A" is nuanced and complicated. I think the actual draft treaty has levers that are approximately the same levers you'd want to do this sort of controlled takeoff.
^{^}
I'm not sure how powerful nVidia is an an interest group. Maybe it is important to avoid them getting a deal like this so they're less of an interest group with power at the negotiating table.
^{^}
FYI my "Ray detects some political bs motivations in himself" alarm is tripping as I write this paragraph. It currently seems right to me but let me know if I'm missing something here.

(Having otherwise complained a bunch about some of the commentary/framing around Plan A vs Shut It Down, I do overall like this post and think having the lens of the different worlds is pretty good for planning).

(I am also appreciating how people are using inline reacts)

Nod.

FYI, I think Shut It Down is approximately as likely to happen as "Full-fledged Plan A that is sufficiently careful enough to actually help much more than [the first several stages of Plan A that Plan A and Shut It Down share]", on account of being simple enough that it's even really possible to coordinate on it.

I agree they are both pretty unlikely to happen. (Regardless, I think the thing to do is probably "reach for whatever wins seem achievable near term and try to build coordination capital for more wins")

I think it's a major possible failure mode of Plan A is "it turns it a giant regulatory capture molochian boondoggle that both slows thing down for a long time in confused bad ways and reads to the public as a somewhat weirdly cynical plot, which makes people turn against tech progress comparably or more than the average Shut It Down would." (I don't have a strong belief about the relative likelihoods of that

None of those beliefs are particularly strong and I could easily learn a lot that would change all my beliefs.

Seems fine to leave it here. I dont have more arguments I didn't already write up in "Shut It Down" is simpler than "Controlled Takeoff", just stating for the record I don't think you've put forth an argument that justifies the 3x increase in difficulty of Shut It Down over the fully fledged version of Plan A. (We might still be imagining different things re: Shut It Down)

Nod, I agree centralizing part is harder than non-centralized fab monitoring. But, I think a sufficient amount of "non-centralized" fab monitoring is still a much bigger ask than export controls, and, the centralization was part of at least one writeup of Plan A, and it seemed pretty weird to include that bit but write off "actual shutdown" as politically intractable.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments