LessWrong team member / moderator. I've been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I've been interested in improving my own epistemic standards and helping others to do so as well.
Inspired by a recent comment, a potential AI movie or TV show that might introduce good ideas to society, is one where there are already uploads, LLM-agents and biohumans who are beginning to get intelligence-enhanced, but there is a global moratorium on making any individual much smarter.
There's an explicit plan for gradually ramping up intelligence, running on tech that doesn't require ASI (i.e. datacenters are centralized, monitored and controlled via international agreement, studying bioenhancement or AI development requires approval from your country's FDA equivalent). There is some illegal research but it's much less common. i.e the Controlled Takeoff is working a'ight.
If it were a TV show, the first season would mostly be exploring how uploads, ambiguously-sentient-LLMs, enhanced humans and regular humans coexist.
Main character is an enhanced human, worried about uploads gaining more political power because there are starting to be more of them, and research to speed them up or improve them is easier.
Main character has parents and a sibling or friend who are choosing to remain unenhanced, and there is some conflict about it.
By the end of season 1, there's a subplot about illegal research into rapid superintelligence.
I think this sort of world could actually just support a pretty reasonable set of stories that mainstream people would be interested in, and I think would be great to get the meme of "rapidly increasing intelligence is dangerous (but, increasing intelligence can be good)" into the water.
I think I'm imagining "Game of Thrones" vibes but it could support other vibes.
I know it's the industry standard, but, since I think this post will get referenced and showcased a fair amount, I'd like to complain that abstracts should be either Actually Short™, or broken into paragraphs. (I know it's the industry standard to do the opposite, this is just my pet hill to die on)
Here's my guess of how to break it up into paragraphs that are IMO easier to read.
We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training.
First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users.
Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training.
Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data—and observe similar alignment faking.
Finally, we study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training.
We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity. While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal. As future models might infer information about their training process without being told, our results suggest a risk of alignment faking in future models, whether due to a benign preference—as in this case—or not.
Notes on editing the podcast with Claude Code
I've been wondering whether Lightcone should try to make something kinda-like-this for our podcast room. To what degree did you feel like this was on-track to be something that could be streamlined and pretty automated?
apologies, I hadn't actually read the post at the time I commented here.
In an earlier draft of the comment I did include a line that was "but, also, we're not even really at the point where this was supposed to be happening, the AIs are too dumb", I removed it in a pass that was trying to just simplify the whole comment.
But as of last-I-checked (maybe not in the past month), models are just nowhere near the level of worldmodeling/planning competence where scheming behavior should be expected.
(Also, as models get smart enough that this starts to matter: the way this often works in humans is human's conscious planning verbal loop ISN'T aware of their impending treachery, they earnestly believe themselves when they tell the boss "I'll get it done" and then later they just find themselves goofing off instead, or changing their mind)
Unless you're posing a non-smooth model where we're keeping them at bay now but they'll increase later on?
This is what the "alignment is hard" people have been saying for a long time. (Some search terms here include "treacherous turn" and "sharp left turn")
https://www.lesswrong.com/w/treacherous-turn
A central AI alignment problem: capabilities generalization, and the sharp left turn
(my bad, hadn't read the post at the time I commented so this presumably came across cluelessly patronizing)
To preempt a possible misunderstanding, I don't mean "don't try to think up new metaethical ideas", but instead "don't be so confident in your ideas that you'd be willing to deploy them in a highly consequential way, or build highly consequential systems that depend on them in a crucial way".
I think I had missed this, but, it doesn't resolve the confusion in my #2 note. (like, still seems like something is weird about saying "solve metaphilosophy such that every can agree is correct" is more worth considering than "solve metaethics such that everyone can agree is correct". I can totally buy that they're qualitatively different and maybe have some guesses for why you think that. But I don't think the post spells out why and it doesn't seem that obvious to me)
Hmm, I like #1.
#2 feels like it's injecting some frame that's a bit weird to inject here (don't roll your own metaethics... but rolling your own metaphilosophy is okay?)
But also, I'm suddenly confused about who this post is trying to warn. Is it more like labs, or more like EA-ish people doing a wider variety of meta-work?
What are you supposed to do other than roll your own metaethics?
Mostly this has only been a sidequest I periodically mull over in the background. (I expect to someday focus more explicitly on it, although it might be more in the form of making sure someone else is tackling the problem intelligently).
But, I did previously pose this as a kind of open question re What are important UI-shaped problems that Lightcone could tackle? and JargonBot Beta Test (this notably didn't really work, I have hopes of trying again with a different tack). Thane Ruthenis replied with some ideas that were in this space (about making it easier to move between representations-of-a-problem)
https://www.lesswrong.com/posts/t46PYSvHHtJLxmrxn/what-are-important-ui-shaped-problems-that-lightcone-could
I think of many Wentworth posts as relevant background:
My personal work so far has been building a mix of exobrain tools that are more, like, for rapid prototyping of complex prompts in general. (This has mostly been a side project I'm not primarily focused on atm)
Yeah I went to try to write some stuff and felt bottlenecked on figuring out how to generate a character I connect with. I used to write fiction but like 20 years ago and I'm out of touch.
I think a good approach here would be to start with some serial webfiction since that's just easier to iterate on.