Sorry, I couldn't find your code easily so I'll just ask: did you merely omit the off-policy part of inoculation prompting in your description of it, or did you also omit it in the code itself?
This assumes there is a fact of the matter about whether moral realism is true
I am a well known moral realism / moral antirealism antirealist.
But the sort of moral realism that's compatible with antirealism is, as far as I see it, a sort of definitional reification of whatever we think of as morality. You can get information from outside yourself about morality, but it's "boring" stuff like good ethical arguments or transformative life experiences, the same sort of stuff a moral antirealist might be moved by. For the distinction to majorly matter to an AI's choices - for it to go "Oh, now I have comprehended the inhuman True Morality that tells me to do stuff you think is terrible," I think we've got to have messed up the AI's metaethics, and we should build a different AI that doesn't do that.
So, I have some quibbles, some pessimistic, some optimistic.
The main pessimistic one is that the nice-thing-that-current-languange-models-have-if-you-don't-RL-them-too-hard, their de facto alignment, is probably not the final word on alignment that we just need to safeguard as contexts change and capabilities increase. I think it's the wrong modeling assumption to say we'll start with aligned transformative AI and then just need to keep RL from messing it up.
But this has an optimistic flip side, which is that if we do have better alignment schemes to apply to future AI, they can take into account the weaknesses of fine-tuning a predictive model and try to correct for them.
On "breaking things," it seems like reverting towards the base model behavior is the default expected consequence of breaking fine-tuning. In the current paradigm, I wouldn't expect this to lead to misaligned goals (though probably some incoherent bad behavior). In a different architecture maybe the story is different (whoops, we broke the value function in model-based RL but didn't break the environment model).
If you're worried about coherent bad behavior because we'll be doing RL on task completion, that doesn't sound like drift to me, it sounds like doing RL on a non-alignment goal and (no surprise) getting non-aligned AI.
On an unrelated note, I was also reminded of the phenomenon of language drift after RL, e.g. see Jozdien's recent post, or the reports about math-finetuned LLMs drifting.
Recent work (e.g.) has helped clarify the continuum between "general" emergent misalignment, where the AI does a wide variety of bad stuff in a very vibes-based way, through more specific but still vibes-based misaligned behavior, to more and more situationally-aware and narrowly consequentialist bad behavior.
Do you think this is more the sort of thing where you'd want to produce a wide diversity of models, or would you produce a bunch of models on the consequentialism end of this axis if you could?
Am I correct that the human uncertainty about "true values" (or more naturalistically, the underdetermination of how to model humans as having values) isn't actually an active ingredient in the toy problem?
I.e. you start an AI, and it knows it's going to get some observations about humans, model them as having values, and then act to fulfill those values. But if it's updateless, it will have a prior probability distribution over what values it would land on, and it will take the prior expectation and maximize that, basically preventing value learning from taking place.
What do you think about the cheap fix, where we say "oops, that was a mistake, we gave the AI the preferences 'globally maximize the modeled pattern from unknown data,' when we should have given it the preferences 'locally maximize the modeled pattern from unknown data,' i.e. prefer that your outputs match the observed pattern, not that your outputs are globally right."
I think the intuition for why the AI is wrong rests on the humans having extra structure that tangles everything together. They're not actually following Bayesian uncertainty about some platonic "right thing," instead they want to follow some "good process" (the process they believe will disambiguate puppies and rainbows), and if they'd built the AI correctly it wouldn't reason using Bayesian uncertainty either, it would just follow the good process.
In the hypothetical where the humans don't have this extra structure, updateless reasoning seems great.
What problem is Thought Anchors solving for you (or future users)? I feel like I don't quite understand.
I was recently asked what follow-up on this post could look like, and I gave two answers (that were deliberately not "Do what Steve does"). They were:
1.
We'd like to be able to mathematically analyze the behavior of agents with parametrized classes of non-behaviorist rewards, in toy situations that capture something important about reward hacking.
A first toy model to construct might be if we train the AI to use information, but there's some information we don't want it to use (analogous to a coding agent that sometimes sees the unit tests). A harder toy model to make might be one based on training the AI to generalize, but there's some generalization we don't want it to do.
Figure out a way to represent interesting rewards, which might include wanting to learn from norms rather than extremes, curiosity/incuriosity drive, and reward penalty on thoughts (activations) that start out correlated with misbehavior. Explore the parameter space of the toy-model environments and rewards, showing where agents quickly converge to misbehavior and where they converge slowly or not at all.
2.
Figure out how these arguments interact with recontextualization (and similarly inoculation prompting, off-policy RL).
Try to translate inoculation prompting into training on some approximate non-behaviorist reward.
Can Byrnes' arguments for scheming be expanded to include some kinds of recontextualization? Can arguments for and against the effectiveness of recontextualization be translated to arguments about non-behaviorist reward?
Different regulation (or other legislation) might also make other sorts of transparency good ideas, imo.
A mandate or subsidy for doing safety research might make it a good idea to require transparency for more safety-relevant AI research.
Regulation aimed at improving company practices (e.g. at security against weight theft, or preventing powergrab risks like access to helpful-only models above some threshold, or following some future safety practices suggested by some board of experts, or [to be meta] good transparency practices) should generate some transparency about how companies are doing (at cybersecurity or improper internal use mitigation or safety best practices or transparency).
If safety cases are actually being evaluated and you don't get to do all the research you want if the safety is questionable, then the landscape for transparency of safety cases (or other safety data that might have a different format) looks pretty different.
I'm actually less clear on how risk reports would tie in to regulation - maybe they would get parted out to reports on how the company is doing at various risk-mitigation practices, if those are transparent?
Seems legit. I think there's an additional quality that's something like "How much does being really good at predicting the training distribution help?" Or maybe "Could we design new training data to make an AI better at researching this alignment subproblem?"
I think even if we hand off our homework successfully somehow (P~0.93 we hand off enough to justify research on doing so, P~0.14 it's sufficient to get a good future without humans solving what I consider the obvious alignment problems), it's going to get us an overcomplicated sort of alignment with lots of rules and subagents; hopefully the resulting AI system continues to self-modify (in a good way).