I agree with the claim that deception could arise without deceptive alignment, and mostly agree with the post, but I do still think it's very important to recognize if/when deceptive alignment fails to work, it changes a lot of the conversation around alignment.
The authors write “Some people point to the effectiveness of jailbreaks as an argument that AIs are difficult to control. We don’t think this argument makes sense at all, because jailbreaks are themselves an AI control method.” I don’t really understand this point.
The point is that it requires a human to execute the jailbreak, the AI is not the jailbreaker, and the examples show that humans can still retain control of the model.
The AI is not jailbreaking itself, here.
This link explains it better than I can, here:
https://www.aisnakeoil.com/p/model-alignment-protects-against
I think a lot of this probably comes back to way overestimating the complexity of human values. I think a very deeply held belief of a lot of LWers is that human values are intractably complicated and gene/societal-specific, and I think if this was the case, the argument would actually be a little concerning, as we'd have to rely on massive speed biases to punish deception.
These posts gave me good intuition for why human value is likely to be quite simple, one of them talks about how most of the complexity of the values is inaccessible to the genome, thus it needs to start from far less complexity than people realize, because nearly all of it needs to be learned. Some other posts from Steven Byrnes are relevant, which talks about how simple the brain is, and a potential difference between me and Steven Byrnes is that the same process of learning from scratch algorithms that generate capabilities also applies to values, and thus the complexity of value is upper-bounded by the complexity of learning from scratch algorithms + genetic priors, both of which are likely very low, at the very least not billions of lines complex, and closer to thousands of lines/hundreds of bits.
But the reason this matters is because we no longer have good reason to assume that the deceptive model is so favored on priors like Evan Hubinger says here, as the complexity is likely massively lower than LWers assume.
Putting it another way, the deceptive and aligned models both have very similar complexities, and the relative difficulty is very low, so much so that the aligned model might be outright lower complexity, but even if that fails, the desired goal has a complexity very similar to the undesired goal complexity, thus the relative difficulty of actual alignment compared to deceptive alignment is quite low.
https://www.lesswrong.com/s/HzcM2dkCq7fwXBej8/p/wBHSYwqssBGCnwvHg
https://www.lesswrong.com/posts/aodPs8H9dQxpXAcwk/heritability-behaviorism-and-within-lifetime-rl
(2) [For people without the security mindset:] Well, probably you just missed this one thing with circular groups; hotfix that, and then there will be no more vulnerabilities.
i actually do expect this to happen, and importantly I think this result is basically of academic interest, primarily because it is probably known why this adversarial attack can have at all, and it's the large scale cycles of a game board. This is almost certainly going to be solved, due to new training, so I find it a curiosity at best.
I strongly downvoted with this post, primarily because contra you, I do actually think reframing/reinventing is valuable, and IMO I think that the case for reframing/reinventing things is strawmanned here.
There is one valuable part of this post, and that interpretability doesn't have good result-incentives. I agree with this criticism, but given the other points of the post, I would strongly downvote it.
I disagree with this post for 1 reason:
On Amdahl's law, John Wentworth's post on the long tail is very relevant here, as it limits the use of cyborgism here:
https://www.lesswrong.com/posts/Nbcs5Fe2cxQuzje4K/value-of-the-long-tail
I think that the big claim the post relies on is that values are a natural abstraction, and the Natural Abstractions Hypothesis holds. Now this is admittedly very different from the thesis that value is complex and fragile.
It is not that AI would naturally learn human values, but that it's relatively easy for us to point at human values/Do What I Mean/Corrigibility, and that they are natural abstractions.
This is not a claim that is satisfied by default, but is a claim that would be relatively easy to satisfy if true.
The robust values hypothesis from DragonGod is worth looking at, too.
From the link below, I'll quote:
Consider the following hypothesis:
There exists a "broad basin of attraction" around a privileged subset of human values[1] (henceforth "ideal values") The larger the basin the more robust values are Example operationalisations[2] of "privileged subset" that gesture in the right direction: Minimal set that encompasses most of the informational content of "benevolent"/"universal"[3] human values The "minimal latents" of "benevolent"/"universal" human values Example operationalisations of "broad basin of attraction" that gesture in the right direction: A neighbourhood of the privileged subset with the property that all points in the neighbourhood are suitable targets for optimisation (in the sense used in #3 Larger neighbourhood → larger basin Said subset is a "naturalish" abstraction The more natural the abstraction, the more robust values are Example operationalisations of "naturalish abstraction" The subset is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe More privileged → more natural Most efficient representations of our universe contain a simple embedding of the subset Simpler embeddings → more natural Points within this basin are suitable targets for optimisation The stronger the optimisation pressure applied for which the target is still suitable, the more robust values are. Example operationalisations of "suitable targets for optimisation": Optimisation of this target is existentially safe[4] More strongly, we would be "happy" (where we fully informed) for the system to optimise for these points.
This is an important hypothesis, since if it has a non-trivial chance of being correct, then AI Alignment gets quite easier. And given the shortening timelines, I think this is an important hypothesis to test.
Here's a link below for the robust values hypothesis:
https://www.lesswrong.com/posts/YoFLKyTJ7o4ApcKXR/disc-are-values-robust
In the human case, it's that capabilities differences are very bounded, rather than alignment successes. If we had capabilities differentials as wide as 1 order of magnitude, then I think our attempted alignment solutions would fail miserably, leading to mass death or worse.
That's the problem with AI: Multiple orders of magnitude differences in capabilities are pretty likely, and all real alignment technologies fail hard once we get anywhere near say 3x differences, let alone 10x differentials.
You're welcome, though did you miss a period here or did you want to write more?
See a Twitter thread of some brief explorations I and Alex Silverstein did on this
Yep, that's what I was talking about, Seth Herd.