Zvi - AI Alignment Forum

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Zvi3mo32

That seems rather loaded in the other direction. How about “The evidence suggests that if current ML systems were going to deceive us in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”?

RSPs are pauses done right

Zvi6mo32

Is evaluation of capabilities, which as you note requires fine-tuning and other such techniques, a realistic thing to properly do continuously during model training, without that being prohibitively slow or expensive? Would doing this be part of the intended RSP?

Evaluating the historical value misspecification argument

Zvi7mo1816

As an experimental format, here is the first draft of what I wrote for next week's newsletter about this post:

Matthew Barnett argues that GPT-4 exhibiting common sense morality, and being able to follow it, should update us towards alignment being easier than we thought, and MIRI-style people refusing to do so are being dense. That the AI is not going to maximize the utility function you gave it at the expense of all common sense.

As usual, this logically has to be more than zero evidence for this, given how we would react if GPT-4 indeed lacked such common sense or was unable to give answers that pleased humans at all. Thus, we should update a non-zero amount in that direction, at least if we ignore the danger of being led down the wrong alignment path.

However, I think this misunderstands what is going on. GPT-4 is training on human feedback, so it is choosing responses that maximize the probability of positive user response in the contexts where it gets feedback. If that is functionally your utility function, you want to respond with answers that appear, to humans similar to the ones who provided you with feedback, to reflect common sense and seem to avoid violating various other concerns. That will be more important than maximizing the request made, especially if strong negative feedback was given for violations of various principles including common sense.

Thus, I think GPT-4 is indeed doing a decent job of extracting human preferences, but only in the sense that is predicting what preferences we would consciously choose to express in response under strong compute limitations. For now, that looks a lot like having common sense morality, and mostly works out fine. I do not think this has much bearing on the question of what it would take to make something work out fine in the future, under much stronger optimization pressure, I think you metaphorically do indeed get to the literal genie problem from a different angle. I would say that the misspecification problems remain highly relevant, and that yes, as you gain in optimization power your need to correctly specify the exact objective increases, and if you are exerting far-above-human levels of optimization pressure based on only human consciously expressed under highly limited compute levels of value alignment, you are going to have a bad time.

I believe MIRI folks have a directionally similar position to mine only far stronger.

The Waluigi Effect (mega-post)

Zvi1y718

This is great. I notice I very much want a version that is aimed at someone with essentially no technical knowledge of AI and no prior experience with LW - and this is seems like it's much better at that then par, but still not where I'd want it to be. Whether or not I manage to take a shot, I'm wondering if anyone else is willing to take a crack at that?

chinchilla's wild implications

Zvi2y30

Scott Alexander asked things related to this, but still seems worth being more explicit about what this perfect 1.69 loss model would be like in practice if we got there?

Biology-Inspired AGI Timelines: The Trick That Never Works

Zvi2y300

Things I instinctively observed slash that my model believes that I got while reading that seem relevant, not attempting to justify them at this time:

There is a core thing that Eliezer is trying to communicate. It's not actually about timeline estimates, that's an output of the thing. Its core message length is short, but all attempts to find short ways of expressing it, so far, have failed.
Mostly so have very long attempts to communicate it and its prerequisites, which to some extent at least includes the Sequences. Partial success in some cases, full success in almost none.
This post, and this whole series of posts, feels like its primary function is training data to use to produce an Inner Eliezer that has access to the core thing, or even better to know the core thing in a fully integrated way. And maybe a lot of Eliezer's other communications is kind of also trying to be similar training data, no matter the superficial domain it is in or how deliberate that is.
The condescension is important information to help a reader figure out what is producing the outputs, and hiding it would make the task of 'extract the key insights' harder.
Similarly, the repetition of the same points is also potentially important information that points towards the core message.
That doesn't mean all that isn't super annoying to read and deal with, especially when he's telling you in particular that you're wrong. Cause it's totally that.
There are those for whom this makes it easier to read, especially given it is very long, and I notice both effects.
My Inner Eliezer says that writing this post without the condescension, or making it shorter, would be much much more effort for Eliezer to write. To the extent such a thing can be written, someone else has to write that version. Also, it's kind of text in several places.
The core message is what matters and the rest mostly doesn't?
I am arrogant enough to think I have a non-zero chance that I know enough of the core thing and have enough skill that with enough work I could perhaps find an improved way to communicate it given the new training data, and I have the urge to try this impossible-level problem if I could find the time and focus (and help) to make a serious attempt.

A Critique of Functional Decision Theory

Zvi5y60

I am deeply confused how someone who is taking decision theory seriously can accept Guaranteed Payoffs as correct. I'm even more confused how it can seem so obvious that anyone violating it has a fatal problem.

Under certainty, this is assuming CDT is correct, when CDT seems to have many problems other than certainty. We can use Vaniver's examples above, or use a reliable insurance agent to remove any uncertainty, or we also can use any number of classic problems without any uncertainty (or remove it), and see that such an agent loses - e.g. Parfit's Hitchhiker in the case where he has 100% accuracy.

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments