Recent work (e.g.) has helped clarify the continuum between "general" emergent misalignment, where the AI does a wide variety of bad stuff in a very vibes-based way, through more specific but still vibes-based misaligned behavior, to more and more situationally-aware and narrowly consequentialist bad behavior.
Do you think this is more the sort of thing where you'd want to produce a wide diversity of models, or would you produce a bunch of models on the consequentialism end of this axis if you could?
Am I correct that the human uncertainty about "true values" (or more naturalistically, the underdetermination of how to model humans as having values) isn't actually an active ingredient in the toy problem?
I.e. you start an AI, and it knows it's going to get some observations about humans, model them as having values, and then act to fulfill those values. But if it's updateless, it will have a prior probability distribution over what values it would land on, and it will take the prior expectation and maximize that, basically preventing value learning from taking place.
What do you think about the cheap fix, where we say "oops, that was a mistake, we gave the AI the preferences 'globally maximize the modeled pattern from unknown data,' when we should have given it the preferences 'locally maximize the modeled pattern from unknown data,' i.e. prefer that your outputs match the observed pattern, not that your outputs are globally right."
I think the intuition for why the AI is wrong rests on the humans having extra structure that tangles everything together. They're not actually following Bayesian uncertainty about some platonic "right thing," instead they want to follow some "good process" (the process they believe will disambiguate puppies and rainbows), and if they'd built the AI correctly it wouldn't reason using Bayesian uncertainty either, it would just follow the good process.
In the hypothetical where the humans don't have this extra structure, updateless reasoning seems great.
What problem is Thought Anchors solving for you (or future users)? I feel like I don't quite understand.
I was recently asked what follow-up on this post could look like, and I gave two answers (that were deliberately not "Do what Steve does"). They were:
1.
We'd like to be able to mathematically analyze the behavior of agents with parametrized classes of non-behaviorist rewards, in toy situations that capture something important about reward hacking.
A first toy model to construct might be if we train the AI to use information, but there's some information we don't want it to use (analogous to a coding agent that sometimes sees the unit tests). A harder toy model to make might be one based on training the AI to generalize, but there's some generalization we don't want it to do.
Figure out a way to represent interesting rewards, which might include wanting to learn from norms rather than extremes, curiosity/incuriosity drive, and reward penalty on thoughts (activations) that start out correlated with misbehavior. Explore the parameter space of the toy-model environments and rewards, showing where agents quickly converge to misbehavior and where they converge slowly or not at all.
2.
Figure out how these arguments interact with recontextualization (and similarly inoculation prompting, off-policy RL).
Try to translate inoculation prompting into training on some approximate non-behaviorist reward.
Can Byrnes' arguments for scheming be expanded to include some kinds of recontextualization? Can arguments for and against the effectiveness of recontextualization be translated to arguments about non-behaviorist reward?
Different regulation (or other legislation) might also make other sorts of transparency good ideas, imo.
A mandate or subsidy for doing safety research might make it a good idea to require transparency for more safety-relevant AI research.
Regulation aimed at improving company practices (e.g. at security against weight theft, or preventing powergrab risks like access to helpful-only models above some threshold, or following some future safety practices suggested by some board of experts, or [to be meta] good transparency practices) should generate some transparency about how companies are doing (at cybersecurity or improper internal use mitigation or safety best practices or transparency).
If safety cases are actually being evaluated and you don't get to do all the research you want if the safety is questionable, then the landscape for transparency of safety cases (or other safety data that might have a different format) looks pretty different.
I'm actually less clear on how risk reports would tie in to regulation - maybe they would get parted out to reports on how the company is doing at various risk-mitigation practices, if those are transparent?
Supposing that we get your scenario where we have basically-aligned automated researchers (but haven't somehow solved the whole alignment problem along the way). What's your take on the "people will want to use automated researchers to create smarter, dangerous AI rather than using them to improve alignment" issue? Is your hope that automated researchers will be developed in one leading organization that isn't embroiled in a race to the bottom, and that org will make a unified pivot to alignment work?
Thanks, just watched a talk by Luxin that explained this. Two questions.
Fortunately, there’s a correlation between situations where (i) AI takeover risk is high, and (ii) AIs have a good understanding of the world. If AI developers have perfect ability to present the AI with false impressions of the world, then the risk from AI takeover is probably low. While if AIs have substantial ability to distinguish truth from falsehood, then perhaps that channel can also be used to communicate facts about the world.
Whether this is fortunate depends a lot on how beneficial communication with unaligned AIs is. If unaligned AI with high chance of takeover can exploit trade to further increase its chances of takeover ("Oh, I just have short-term preferences where I want you to run some scientific simulations for me"), then this correlation is the opposite of fortunate. If people increase an unaligned AI's situational awareness so it can trust our trade offer, then the correlation seems indirectly bad for us.
So, I have some quibbles, some pessimistic, some optimistic.
The main pessimistic one is that the nice-thing-that-current-languange-models-have-if-you-don't-RL-them-too-hard, their de facto alignment, is probably not the final word on alignment that we just need to safeguard as contexts change and capabilities increase. I think it's the wrong modeling assumption to say we'll start with aligned transformative AI and then just need to keep RL from messing it up.
But this has an optimistic flip side, which is that if we do have better alignment schemes to apply to future AI, they can take into account the weaknesses of fine-tuning a predictive model and try to correct for them.
On "breaking things," it seems like reverting towards the base model behavior is the default expected consequence of breaking fine-tuning. In the current paradigm, I wouldn't expect this to lead to misaligned goals (though probably some incoherent bad behavior). In a different architecture maybe the story is different (whoops, we broke the value function in model-based RL but didn't break the environment model).
If you're worried about coherent bad behavior because we'll be doing RL on task completion, that doesn't sound like drift to me, it sounds like doing RL on a non-alignment goal and (no surprise) getting non-aligned AI.
On an unrelated note, I was also reminded of the phenomenon of language drift after RL, e.g. see Jozdien's recent post, or the reports about math-finetuned LLMs drifting.