All of Charlie Steiner's Comments + Replies

Bah! :D It's sad to hear he's updated away from ambitions value learning towards corrigiblity-like targets. Eliezer's second-hand argument sounds circular to me; suppose that corrigibility as we'd recognize it isn't a natural abstraction - then generic AIs wouldn't use it to align child agents (instead doing something like value learning, or something even more direct), and so there wouldn't be a bunch of human-independent examples, so it wouldn't show up as a natural abstraction to those AIs.

I think this post was potentially too long :P

To some extent, I think it's easy to pooh-pooh finding a robust reward function (not maximally robust, merely way better than the state of the art) when you're not proposing a specific design for building an AI that does good things and not bad things. Not in the tone of "how dare you not talk about specifics," but more like "I bet this research direction would have to look more orthodox when you get down to brass tacks."


Yes, I too agree that planning using a model of the world does a pretty good job of capturing what we mean when we say "caring about things."

Of course, AIs with bad goals can also use model-based planning.

Some other salient features:

  • Local search rather than global. Alternatively could be framed as regularization on plans to be close to some starting distribution. This isn't about low impact because we still want the AI to search well enough to find clever and novel plans, instead it's about avoiding extrema that are really far from the starting d
... (read more)
2Alex Turner3d
I don't think it's naturally framed in terms of distance metrics I can think of. I think a values-agent can also end up considering some crazy impressive plans (as you might agree). I both agree and disagree. I think that reasoning about mechanisms and not words is vastly underused in AI alignment, and endorse your pushback in that sense. Maybe I should write future essays with exhortations to track mechanisms and examples while following along. But also I do perceive a natural category here, and I want to label it. I think the main difference between "grader optimizers" and "value executers" is that grader optimizers are optimizing plans to get high evaluations, whereas value executers find high-evaluating plans as a side effect of cognition. That does feel pretty natural to me, although I don't have a good intensional definition of "value-executers" yet.

One thing might be that I'd rather have an AI design that's more naturally self-reflective, i.e. using its whole model to reason about itself, rather than having pieces that we've manually retargeted to think about some other pieces. This reduces how much Cartesian doubt is happening on the object level all at the same time, which sorta takes the AI farther away from the spec. But this maybe isn't that great an example, because maybe it's more about not endorsing the "retargeting the search" agenda.

I'm somewhat confused, but it does seem like there are two AIs when you talk about doing automated anomaly detection for deceptive alignment. If I attempt to read your mind, I get a lot of disjoint possibilities. Some of them are:

  • We probably agree but you don't quite know what I'm talking about either, or
  • You don't think anomaly detection counts as "an AI," maybe because you expect it to not involve much learning or model-building (where I would expect it to involve model-building), or
  • You expect anomaly detection to require cleverness, but think that cleverness will all be located in one place, so that we're really talking about one AI reflecting on itself.
3Paul Christiano5d
The general strategy I'm describing for anomaly detection is: * Search for an explanation of a model behavior (like "answers questions coherently") on the training set. * Given a new input, take a sub-explanation that explains almost all of the training set behavior but doesn't explain the behavior on the new input. * If you can't find one, then call the behavior potentially anomalous (e.g. because on the training set coherence follows from the logical structure of the world, but on the deceptive alignment point it follows from a desire to avoid looking incoherent) The first two steps are solving problems in NP, so I think it's reasonable to expect them to be easier from an alignment perspective. (We also don't imagine solving them using the same AI techniques applied on the object level, and if we did then I would agree that we have further deceptive alignment problems, but still it seems helpful to ask your AI to do something that is both tractable and formally verifiable.) My sense is that you probably don't think this kind of strategy can work, and that instead anomaly detection requires something more like training a second AI to tell us if something is weird. I'd agree that this doesn't sound like progress.

Fun exercise, but I'm not a fan of the total cartesian doubt phase - I'd rather sacrifice even more corrigibility properties (like how this already isn't too worried about subagent stability) for better friendliness.

1Thane Ruthenis4d
Do you have anything specific in mind?

While noting down a highly lukewarm hot take about ELK, I thought of a plan for a "heist:"

Create a copy of your diamond, then forge evidence both of swapping my forgery with the diamond in your vault, and you covering up that swap. Use PR to damage your reputation and convince the public that I in fact hold the real diamond. Then sell my new original for fat stacks of cash. This could make a fun heist movie, where the question of whether the filmed heist is staged or actually happened is left with room for doubt by the audience.

Anyhow, I feel like there's ... (read more)

2Paul Christiano6d
There isn't supposed to be a second AI. In the object-level diamond situation, we have a predictor of "does the diamond appear to remain in the vault," we have a proposed action and predict that if we take it the diamond will appear to remain in the vault, and we want to know whether the diamond appears to remain in the vault for the normal reason. For simplicity, when talking about ELK in this post or in the report [] , we are imagining literally selecting actions by looping over each possible action and predicting its consequences, or doing some kind of more clever search (but where the alignment difficulty comes from the search). You could also try to apply this to a model-free RL agent. I think that's probably not very different. My best guess for how to do it is to train a question-answering head to talk about the possible consequences of its plan, and then use this machinery to keep that honest. But I don't discuss it in this post and haven't thought about it as much.

I'm thinking about the paper Ewert 1987, which I know about because it spurred Dennet's great essay Eliminate the Middletoad, but I don't really know the gory details of, sorry.

I agree the analogy is weak, and there can be disanalogies even between different ANN architectures. I think my intuition is based more on some general factor of "human science being able to find something interesting in situations kinda like this," which is less dependent on facts of the systems themselves and more about, like, do we have a paradigm for interpreting signals in a bi... (read more)

But I do take this as evidence in favour of it working really well.

I'd argue that most of the updating should already have been done already, not even based on Chris Olah's work, but on neuroscientists working out things like the toad's prey-detection circuits.

Not sure what you mean by this

You seem pretty motivated by understanding in detail why and how NNs do particular things. But I think this doesn't scale to interpreting complicated world-modeling, and think that what we'll want is methods that tell us abstract properties without us needing to u... (read more)

1Neel Nanda24d
Thanks for clarifying your position, that all makes sense. Huh, can you say more about this? I'm not familiar with that example (though have a fairly strong prior on there being at best a weak association between specific neuroscience results + specific AI interp results)

Fun, informative, definitely rambling.

I think this is the sort of thing you should expect to work fine even if you can't tell if a future AI is deceiving you, so I basically agree with the authors' prognostication more than yours. I think for more complicated questions like deception, mechanistic understanding and human interpretability will start to come apart. Good methods might not excite you much in terms of the mechanistic clarity they provide, and vice versa.

1Neel Nanda24d
Idk, I definitely agree that all data so far is equally consistent with 'mechanistic interp will scale up to identifying whether GPT-N is deceiving us' and with 'MI will work on easy problems but totally fail on hard stuff'. But I do take this as evidence in favour of it working really well. What kind of evidence could you imagine seeing that mechanistic understanding is actually sufficient for understanding deception? Not sure what you mean by this

Nice, now I can ask you some questions :D

It seemed like the watermaze environment was actually reasonably big, with observation space of a small colored picture, and policy needing to navigate to a specific point in the maze from a random starting point. Is that right? Given context window limitations, wouldn't this change how many transitions it expects to see, or how informative they are?

Ah, wait, I'm looking through the paper now - did they embed all tasks into the same 64d space? This sounds like about right for the watermaze and massive overkill for l... (read more)

3Sam Marks1mo
The paper is frustratingly vague about what their context lengths are for the various experiments, but based off of comparing figures 7 and 4, I would guess that the context length for Watermaze was 1-2 times as long as an episode length(=50 steps). (It does indeed look like they were embedding the 2d dark room observations into a 64-dimensional space, which is hilarious.) I'm not sure I understand your second question. Are you asking about figure 4 in the paper (the same one I copied into this post)? There's no reward conditioning going on. They're also not really comparing like to like, since the AD and ED agents were trained on different data (RL learning trajectories vs. expert demonstrations). Like I mentioned in the post, my story about this is that the AD agents can get good performance by, when the previous episode ends with reward 1, navigating to the position that the previous episode ended in. (Remember, the goal position doesn't change from episode to episode -- these "tasks" are insanely narrow!) On the other hand, the ED agent probably just picks some goal position and repeatedly navigates there, never adjusting to the fact that it's not getting reward.

It feels like the humans in the examples aren't really playing for anything besides perpetuating the status quo. Even the fictional people trying to cure cancer don't seem to be actually leveraging their superintelligent AI to steer the world to a state where cancer is cured, instead they seem to be using their superintelligent AI to answer certain biology questions, which they, the people wearing white lab coats, can then use to make decisions that affect the world.

I think three things about this:

It's implausible to me. It seems like a sizeable fraction o... (read more)

3Steve Byrnes1mo
One thing is, I think you’re sorta assuming that the AI is omniscient, aligned, and completely trusted by the human. With those assumptions, I would hope that the person just lets the AI loose onto the internet to usher in utopia! (I.e. Section 3.5.2) Rather than omniscience, I’m assuming that we’re coming in at a stage where the early AIs are insightful, systematic, fast-thinking, patient, etc., maybe moreso than humans along these dimensions, plus they have the ability to spin off copies and so on. But they still need to figure things out, their plans may have flaws (especially given a relative lack of real-world experience), and they can’t magic up solutions to every problem. I claim that already at this stage, the AI can probably start a deadly pandemic if it wanted to. (Or ten deadly pandemics at once.) But at the same stage, if the employee ask the AI “What do I do now? We need to deal with the out-of-control-AI problem, right? Any ideas?” then it might not have any, or at least any that the employee would endorse. (E.g. maybe the only plans likely to succeed that it can think of are very illegal.) Maybe you’ll say “The AI will convince the person to do aggressive illegal actions rather than twiddle their thumbs until the apocalypse.” I’m open to that, but it entails rejecting corrigibility, right? So really, this is Section 3.5.2 territory. If we’re talking about an AGI that’s willing and able to convince its (so-called) supervisor to do actions that the (so-called) supervisor initially doesn’t want to do, because the AGI thinks they’re in the (so-called) supervisor’s long-term best interest, then we are NOT talking about a corrigible AGI under human control, rather we’re talking about a non-corrigible, out-of-control AGI. So we better hope that it’s a friendly out-of-control AGI!! I think it’s more like, I’m trying to question the (I think) common belief that there’s a path to a good future involving things like “corrigibility [

interested to see what's next.

One notable absence is the Solomonoff prior, where you weight predictions (of prefix-free TMs) by  to get a probability distribution. Related would be approximations like MML prediction.

Another nitpick would be that Shannon entropy is defined for distributions, not just raw strings of data, so you also have to fix the inference process you're using to extract probabilities from data.

Did you notice any qualitative trends in responses as you optimized harder for the models of the gold RM? Like, anything aside from just "sounding kind of like instruct-GPT"?

There's an example in the appendix but we didn't do a lot of qualitative analysis.

I would say that there's a logical object that a large chunk of human moral discourse is trying to point at — something like "the rules of the logical game Morality", analogous to "the rules of the logical game Chess". Two people can both be discussing the same logical object "the rules of Chess", but have different beliefs about what that logical object's properties are. And just as someone can be mistaken or uncertain bout the rules of chess — or about their interaction in a specific case — someone can be uncertain about morality.

When I think about the r... (read more)

The present is "good on its own terms", rather than "good on Ancient Romans' terms", because the Ancient Romans weren't able to lock in their values. If you think this makes sense (and is a good thing) in the absence of an Inherent Essence Of Goodness, then there's no reason to posit an Inherent Essence Of Goodness when we switch from discussing "moral progress after Ancient Rome" to "moral progress after circa-2022 civilization".

The present is certainly good on my terms (relative to ancient Rome). But the present itself doesn't care. It's not the type of ... (read more)

e.g. by trying to apply standards of epistemic uncertainty to the state of this essence? 

I would say that there's a logical object that a large chunk of human moral discourse is trying to point at — something like "the rules of the logical game Morality", analogous to "the rules of the logical game Chess". Two people can both be discussing the same logical object "the rules of Chess", but have different beliefs about what that logical object's properties are. And just as someone can be mistaken or uncertain bout the rules of chess — or about their int... (read more)

The goal should be to cause the future to be great on its own terms

What the heck is this supposed to mean? Great according to the Inherent Essence Of Goodness that lives inside futures, rather than as part of human evaluations? Because I've got bad news for that plan.

Honestly, I'm disappointed by this post.

You say you've found yourself making this argument a lot recently. That's fair. I think it's totally reasonable that there are some situations where this argument could move people in the right direction - maybe the audience is considering defecting... (read more)

"The goal should be to cause the future to be great on its own terms"

What the heck is this supposed to mean? Great according to the Inherent Essence Of Goodness that lives inside futures, rather than as part of human evaluations?

The rest of the quote explains what this means:

The goal should be to cause the future to be great on its own terms, without locking in the particular moral opinions of humanity today — and without locking in the moral opinions of any subset of humans, whether that’s a corporation, a government, or a nation.

(If you can't s

... (read more)

Does Ryan have an agenda somewhere? I see this post, but I don't think that's it.

2Ryan Greenblatt2mo
I don't have an agenda posted anywhere.

How would one use this to inform decomposition?

What I want are some human-meaningful features that can get combined in human-meaningful ways.

E.g. you take a photo of a duck, you take a feature that means "this photo was taken on a sunny day," and then you do some operation to smush these together and you get a photo of a duck taken on a sunny day.

If features are vectors of fixed direction with size drawn from a distribution, which is my takeaway from the superposition paper, then the smushing-together operation is addition (maybe conditional on the dot pro... (read more)

2Lee Sharkey2mo
This is one of the major research questions that will be important to answer before polytopes can be really useful in mechanistic descriptions. By choosing to use clustering rather than dimensionality reduction methods, we took a non-decompositional approach here. Clustering was motivated primarily by wanting to capture the monosemanticity of local regions in neural networks. But the ‘monosemanticity’ that I’m talking about here refers to the fact that small regions of activation mean one thing on one level of abstraction; this ‘one thing’ could be a combination of features. This therefore isn’t to say that small regions of activation space represent only one feature on a lower level of abstraction. Small regions of activation space (e.g. a group of nearby polytopes) might therefore exhibit multiple features on a particular level of abstraction, and clustering isn’t going to help us break apart that level of abstraction into its composite features. Instead of clustering, it seems like it should be possible to find directions in spline code space, rather than directions in activation space. Spline codes can incorporate information about the pathway taken by activations through multiple layers, which means that spline-code-directions roughly correspond to ‘directions in pathway-space’. If directions in pathway-space don’t interact with each other (i.e. a neuron that’s involved in one direction in pathway-space isn’t involved in other directions in pathway-space), then I think we’d be able to understand how the network decomposes its function simply by adding different spline code directions together. But I strongly expect that spline-code-directions would interact with each other, in which case straightforward addition of spline-code-directions probably won’t always work. I’m not yet sure how best to get around this problem.

It strikes me that the kind of self-supervision you describe is a suspiciously similar to trying to incorporate meta-preferences in the outer objective by self-modeling. When the model understands humans differently, it changes its notion of what it is to be misaligned or deceptive, which gets used to form a loss term against that kind of behavior, which then impacts how it understands humans.

This analogy:

  • makes me significantly more optimistic about interpretability inside the training loop.
  • makes me slightly more pessimistic about meta-preferences incorpor
... (read more)

I think this really incentivizes things like network dissection over "interpret this neuron" approaches.

What's an example of a "gene-environment interaction with parts of the environment that aren't good at predicting the child's moral development"?

Mimicking adult behavior even when the adult isn't paying any attention to the child (and children with different genes having slightly different sorts of mimicry). Automatically changing purity norms in response to disease and perceived disease risk. Having a different outlook on the world if you always had plenty of food growing up. Children of athletic parents often being athletic too, which changes how they... (read more)

5Alex Turner2mo
I think these are great counterpoints. Thanks for making them. I still buy "the helicopter parent 'outer alignment' training regime is unwise for 'aligning' kids" and that deliberate parenting is better than chance. But possibly/probably not the primary factor. I haven't yet read much data here so my views feel relatively unconstrained, beyond my "common sense." I think there's an additional consideration with AI, though: We control the reward circuitry. If lots of variance in kid-alignment is due to genetic variation in reward circuitry or learning hyperparameters or whatever, then we also control that with AI, that is also part of understanding AI inductive biases.

pg 6 "there exist" -> "there exists"

pg 13 maybe specify that you mean a linear functional that cannot be written as an integral (I quickly jumped ahead after thinking of one where you don't need to take any integrals to evaluate it)

The reason babies grow up into people that share our values has very little to do with our understanding of their inductive biases (i.e. most of the work is done by gene-environment interactions with parts of the environment that aren't good at predicting the child's moral development). The primary meaning of this comment is pointing out that a particular statement about children is wrong in a kind-of-funny way.

I have this sort of humorous image of someone raising a child, saying "Phew, thank goodness I had a good understanding of my child's inductive bias... (read more)

I agree that similar environments are important, but I don't see why you think they explain most of the outcomes. What's an example of a "gene-environment interaction with parts of the environment that aren't good at predicting the child's moral development"? 

Like, what it feels like to understand human inductive biases isn't to think "Gee, I understand inductive biases!". It's more like: "I see that my son just scowled after agreeing to clean his room. This provides evidence about his internal motivational composition, even though I can't do interpre... (read more)

I'd be interested :) I think my two core concerns are that our rules/norms are meant for humans, and that even then, actors often have bad impacts that would only be avoided with a pretty broad perspective about their responsibilities. So an AI that follows rules/norms well can't just understand them on the object level, it has to have a really good understanding of what it's like to be a human navigating these rules/norms, and use that understanding to make things go well from a pretty broad perspective.

That first one means that not only do I not want the... (read more)

2Xuan (Tan Zhi Xuan)3mo
I think there's something to this, but I think perhaps it only applies strongly if and when most of the economy is run by or delegated to AI services? My intuition is that for the near-to-medium term, AI systems will mostly be used to aid / augment humans in existing tasks and services (e.g. the list in the section on Designing roles and norms [] ), for which we can either either use existing laws and norms, or extensions of them. If we are successful in applying that alignment approach in the near-to-medium term, as well as the associated governance problems, then it seems to me that we can much more carefully control the transition to a mostly-automated economy as well, giving us leeway to gradually adjust our norms and laws. No doubt, that's a big "if". If the transition to a mostly/fully-automated economy is sharper than laid out above, then I think your concerns about norm/contract learning are very relevant (but also that the preference-based alternative is more difficult still). And if we did end up with a single actor like OpenAI building transformative AI before everyone else, my recommendation would be still be to adopt something like the pluralistic approach outlined here, perhaps by gradually introducing AI systems into well-understood and well-governed social and institutional roles, rather than initiating a sharp shift to a fully-automated economy. Yes, it seems like a number of AI policy people at least noticed the tweet I made about this talk [] ! If you have suggestions for who in particular I should get the attention of, do let me know.

Upvoted, and even agree with everything about enlightened compliance, but I think this framing of the problem is bad because everything short of enlightened compliance is so awful. The active ingredients in making things go well are not the norms, which if interpreted literally by AI will just result in blind rules-lawyering in a way that alienates humans - the active ingredients are precisely the things that separate enlightened compliance from everything else. You can call it learning human values or learning to follow the spirit of the law, it's basically the same computations, with basically the same research avenues and social/political potential pitfalls

2Xuan (Tan Zhi Xuan)3mo
Agreed that the interpreting law is hard, and the "literal" interpretation is not enough! Hence the need to represent normative uncertainty (e.g. a distribution over multiple formal interpretations of a natural language statement + having uncertainty over what terms in the contract are missing), which I see the section on "Inferring roles and norms" as addressing in ways that go beyond existing "reward modeling" approaches. Let's call the above "wilful compliance", and the fully-fledged reverse engineering approach as "enlightened compliance". It seems like where we might disagree is how far "wilful compliance" alone will take us. My intuition is that essentially all uses of AI will have role-specific / use-specific restrictions on power-seeking associated with them, and these restrictions can be learned (from eg human behavior and normative judgements, incl. universalization reasoning) as implied terms in the contracts that govern those uses. This would avoid the computational complexity of literally learning everyone's preferences / values, and instead leverage the simpler and more politically feasible mechanisms that humans use to cooperate with each other and govern the commons. I can link to a few papers later that make me more optimistic about something like the approach above!

I also responded to Capybasilisk below, but I want to chime in here and use your own post against you, contra point 2 :P

It's not so easy to get "latent knowledge" out of a simulator - it's the simulands who have the knowledge, and they have to be somehow specified before you can step forward the simulation of them. When you get a text model to output a cure for Alzheimer's in one step, without playing out the text of some chain of thought, it's still simulating something to produce that output, and that something might be an optimization process that is go... (read more)

Ah, the good old days post-GPT-2 when "GPT-3" was the future example :P

I think back then I still thoroughly understimated how useful natural-language "simulation" of human reasoning would be. I agree with janus that we have plenty of information telling us that yes, you can ride this same training procedure to very general problem solving (though I think including more modalities, active leaning, etc. will be incorporated before anyone really pushes brute force "GPT-N go brrr" to the extreme).

This is somewhat of a concern for alignment. I more or less stan... (read more)

We successfully chisel out aligned kids because we understand their inductive biases well


Interpretation of this emoji: "Press X to doubt."

2Alex Turner3mo
I'm interested in why you doubt this? I can imagine various interpretations of the quote which I doubt, and some which are less doubtful-to-me.

Could you clarify what you mean by values not being "hack after evolutionary hack"?

What this sounds like, but I think you don't mean: "Human values are all emergent from a simple and highly general bit of our genetic blueprint, which was simple for evolution to find and has therefore been unchanged more or less since the invention of within-lifetime learning. Evolution never developed a lot of elaborate machinery to influence our values."

What I think you do mean: "Human values are emergent from a simple and general bit of our genetic blueprint (our general... (read more)

6Alex Turner3mo
This is an excellent guess and correct (AFAICT). Thanks for supplying so much interpretive labor! I'd say our position contrasts with "A substantial portion of human value formation is genetically pre-determined in a complicated way, such that values are more like adaptations and less like exaptations—more like contextually-activated genetic machinery and influences than learned artifacts of simple learning-process-signals." In terms of past literature, I disagree with the psychological nativism [] I've read thus far. I also have not yet read much evolutionary psychology [], but expect to deem most of it implausible due to information inaccessibility of the learned world model [] .

This is outstanding. I'll have other comments later, but first I wanted to praise how this is acting as a synthesis of lots of previous ideas that weren't ever at the front of my mind.

I'd especially like to hear your thoughts on the above proposal of loss-minimizing a language model all the way to AGI.

I hope you won't mind me quoting your earlier self as I strongly agree with your previous take on the matter:

If you train GPT-3 on a bunch of medical textbooks and prompt it to tell you a cure for Alzheimer's, it won't tell you a cure, it will tell you what humans have said about curing Alzheimer's ... It would just tell you a plausible story about a situation related to the prompt about curing Alzheimer's, based on its training data. Ra

... (read more)

I still think there's still approximately only one of those, though, since you have to get the objective to exactly match onto what you want.

Once you're trying to extrapolate me rather than just copy me as-is, there are multiple ways to do the extrapolation. But I'd agree it's still way less entropy than deceptive alignment.

A) Is there video? Is it going up on Rob Miles' third channel?

B) I'm not sure I agree with you about where the Christ analogy goes. By the definition, AI "Christ" makes the same value decisions as me for the same reasons. That's the thing that there's just one way to do (more or less). But that's not what I want, because I want an AI that can tackle complicated situations that I'd have trouble understanding, and I want an AI that will sometimes make decisions the way I want them made, not the way I actually make them.

2Evan Hubinger3mo
A) There is a video, but it's not super high quality and I think the transcript is better. If you really want to listen to it, though, you can take a look here [] . B) Yeah, I agree with that. Perhaps the thing I said in the talk was too strong—the thing I mean is a model where the objective is essentially the same as what you want, but the optimization process and world model are potentially quite superior. I still think there's still approximately only one of those, though, since you have to get the objective to exactly match onto what you want.

I kind of agree, with two classes of caveats:

One class is procedural / functional stuff like it should be able to use "consciousness" correctly when talking about things other than itself. I don't see much point in asking it if it's "token X," where it's never seen token X before. Another caveat would be that it should give good faith answers when we ask it hard or confusing questions about itself, but it should also often say "I don't know," and overall have a low positivity bias and low tendency to fall back on answers copied from humans.

The second class... (read more)

To give my own take despite it not being much different from Rohin's: The point of an inside view is to generalize, and the flaw of just copying people you respect is that it fails to generalize.

So for all the parts that don't need to generalize - that don't need to be producing thoughts that nobody has ever thought before - deferring to people you respect works fine. For this part I'm totally on board with you - I too think the inside view is overrated.

But I think it's overratedness is circumscribed. It's overrated when you're duplicating other peoples' c... (read more)

You can't zoom infinitely far in on the causal chain between values and actions, because values (and to a large extent actions) are abstractions that we use when modeling agents like ourselves. They are emergent. To talk about my values at all is to use a model of me where I use my models in a certain agenty way and you don't sweat the details too hard.

I might steal the exorcism metaphor for the post I probably will write about the complexity prior.

Yeah I just meant the upper bound of "within 2 OOM." :) If we could somehow beat the lower bound and get aligned AI with just a few minutes of human feedback, I'd be all for it.

I think aiming for under a few hundred hours of feedback is a good goal because we want to keep the alignment tax low, and that's the kind of tax I see as being easily payable. An unstated assumption I made is that I expect we can use unlabeled data to do a lot of the work of alignment, making labeled data somewhat superfluous, but that I still think amount of feedback is important.... (read more)

1Jacob Hilton4mo
I share your intuitions about ultimately not needing much alignment data (and tried to get that across in the post), but quantitatively: * Recent implementations of RLHF have used on the order of thousands of hours of human feedback, so 2 orders of magnitude more than that is much more than a few hundred hours of human feedback. * I think it's pretty likely that we'll be able to pay an alignment tax upwards of 1% of total training costs (essentially because people don't want to die), in which case we could afford to spend significantly more than an additional 2 orders of magnitude on alignment data, if that did in fact turn out to be required.

I think my big disagreement is with point one - yes, if you fix the architecture as something with bad alignment properties, then there is probably some dataset / reward signal that still gives you a good outcome. But this doesn't work in real life, and it's not something I see people working on such that there needs to be a word for it.

What deserves a word is people starting by thinking about both what we want the AI to learn and how, and picking datasets and architectures in tandem based on a theoretical story of how the AI is going to learn what we want it to.

2Jacob Hilton4mo
A number of reasonable outer alignment proposals such as iterated amplification, recursive reward modeling and debate use generic objectives such as reinforcement learning (and indeed, none of them would work in practice without sufficiently high data quality), so it seems strange to me to dismiss these objectives.

Yes, I expect us to need some trusted data from humans. The cleverer we are the less we need. I think it's reasonable to aim for quantity within 2 OOM of RLHF.

But... no, outer alignment is not a data quality problem, any more than outer alignment is a cosmic ray problem because if only the right cosmic rays hit my processor, it would be outer aligned.

You're probably not the right target for this rant, but I typed it so oh well, sorry.

Yes, you could "just" obtain perfect labeled data about human actions, perfectly on-distribution, until a large NN converges... (read more)

1Jacob Hilton4mo
Do you mean that on-paper solutions should aim to succeed with no more than 1/100 as much human data as RLHF, or no more than 100 times as much? And are you referring the amount of human data typically used in contemporary implementations of RLHF, or something else? And what makes you think that this is a reasonable target?
2Jacob Hilton4mo
I think that data quality is a helpful framing of outer alignment for a few reasons: * Under the assumption of a generic objective such as reinforcement learning, outer alignment is definitionally equivalent to having high enough data quality. (More precisely, if the objective is generic enough that it is possible for it to produce an aligned policy, then outer alignment is equivalent to the data distribution being such that an aligned policy is preferred to any unaligned policy.) * If we had the perfect alignment solution on paper, we would still need to implement it. Since we don't yet have the perfect alignment solution on paper, we should entertain the possibility that implementing it involves paying attention to data quality (whether in the sense of scalable oversight or in a more mundane sense). * It's not a framing I've seen before, and I think it's helpful to have different framings for things. I do think that the framing is less helpful if the answer to my question is "not much", but that's currently still unclear to me, for the reasons I give in the post. I agree that data quality doesn't guarantee robustness, but that's a general argument about how helpful it is to decompose alignment into outer alignment and robustness. I have some sympathy for that, but it seems distinct from the question of whether data quality is a helpful framing of outer alignment.

But Agent 57 (or its successor) would go mash the button once it figured out how to do it. Kinda like the salt-starved rats from that one Steve Byrnes post. Put another way, my claim is that the architectural tweaks that let you beat Montezuma's Revenge with RL are very similar to the architectural tweaks that make your agent act like it really is motivated by reward, across a broader domain.

2Alex Turner4mo
(Haven't checked out Agent 57 in particular, but expect it to not have the "actually optimizes reward" property in the cases I argue against in the post.)

You can take the exact same reward signal and use it either to update all computations in the entire AI (with some slightly magical credit assignment scheme) as in this post, 

Gradients are magical?

Gradients through the entire AI are a pretty bad way to do credit assignment. For a functioning AGI I suspect you'd have to do something better, but I don't know what it is (hence "magic").

if reward only correlates with getting to the goal state 99.5% of the time, but always correlates with the button, what do you expect to happen (for the sort of algorithm

... (read more)
3Alex Turner4mo
These all sound somewhat like predictions I would make? My intended point is that if the button is out of the agent's easy reach, and the agent doesn't explore into the button early in training, by the time it's smart enough to model the effects of the distant reward button, the agent won't want to go mash the button as fast as possible.

I think there are some subtleties here regarding the distinction between RL as a type of reward signal, and RL as a specific algorithm. You can take the exact same reward signal and use it either to update all computations in the entire AI (with some slightly magical credit assignment scheme) as in this post, or you can use it to update a reward prediction model in a model-based RL agent that acts a lot more like a maximizer.

I'd also like to hear your opinion on the effect of information leakage. For example, if reward only correlates with getting to the g... (read more)

3Alex Turner4mo
Gradients are magical? The arguments apply in this case as well. Yeah, what if half of the time, getting to the goal doesn't give a reward? I think the arguments go through just fine, just training might be slower. Rewarding non-goal completions probably train other contextual computations / "values" into the agent. If reward is always given by hitting the button, I think it doesn't affect the analysis, unless the agent is exploring into the button early in training, in which case it "values" hitting the button, or some correlate thereof (i.e. develops contextually activated cognition which reliably steers it into a world where the button has been pressed).

I was confident that on this very site there would be an example of someone writing an essay with the framing device that it was a blog post from 5 years in the future. Sadly, I only had enough attention span to google " from the future" and click the first link. It was a writing game called Wikipedia Articles from the Future.

My point with this is I'm real pessimistic about generating the AI alignment textbook from 100 years in the future with prompt engineering. Why expect that you're going to get something far outside the training distr... (read more)

1Johannes Treutlein4mo
Thanks for your comment! I agree that we probably won't be able to get a textbook from the future just by prompting a language model trained on human-generated texts. As mentioned in the post, maybe one could train a model to also condition on observations. If the model is very powerful, and it really believes the observations, one could make it work. I do think sometimes it would be beneficial for a model to attain superhuman reasoning skills, even if it is only modeling human-written text. Though of course, this might still not happen in practice. Overall I'm more optimistic about using the model in an IDA-like scheme. One way this might fail on capability grounds is if solving alignment is blocked by a lack of genius-level insights [] , and if it is hard to get a model to come up with/speed up such insights (e.g. due to a lack of training data containing such insights).

Re: prompting: So when you talk about "simulating a world," or "describing some property of a world," I interpreted that as conditionalizing on a feature of the AI's latent model of the world, rather than just giving it a prompt like "You are a very smart and human-aligned researcher." This latter deviates from the former in some pretty important ways, which should probably be considered when evaluating the safety of outputs from generative models.

Re: prophecies: I mean that your training procedure doesn't give an AI an incentive to make self-fulfilling pr... (read more)

1Arun Jose2mo
Sorry for the (very) late reply! I'm not very familiar with the phrasing of that kind of conditioning - are you describing finetuning, with the divide mentioned here [] ? If so, I have a comment [] there about why I think it might not really be qualitatively different. I think my picture is slightly different for how self-fulfilling prophecies could occur. For one, I'm not using "inner alignment failure" here to refer to a mesa-optimizer in the traditional sense of the AI trying to achieve optimal loss (I agree that in that case it'd probably be the outcome you describe), but to a case where it's still just a generative model, but needs some way to resolve the problem of predicting in recursive cases (for example, asking GPT to predict whether the price of a stock would rise or fall). Even for just predicting the next token with high accuracy, it'd need to solve this problem at some point. My prediction is that it's more likely for it to just model this via modelling increasingly low-fidelity versions of itself in a stack, but it's also possible for it do fixed-point reasoning (like in the Predict-O-Matic story [] ).

I do think artificial sandwiching is missing something. What makes original-flavor sandwiches interesting to me is not misalignment per se, but rather the problem of having to get goal information out of the human and into the AI to get it to do a good job (e.g. at generating the text you actually want), even for present-day non-agential AI, and even when the AI is more competent than the human along some relevant dimensions.

Like, the debate example seems to me to have a totally different spirit - sure, the small AI (hopefully) ends up better at the task w... (read more)

I'll be interested in the results! First-principles reasoning being kinda hard, I'm curious how much people are going to try to chew bite-sized pieces vs. try to absorb a ball of energy bigger than their head.

3Adam Shimi4mo
Yeah, I will be posting updates, and probably the participants themselves will post some notes and related ideas. Excited too about how it's going to pan out!

Based on this, my general sense is that quantilizers don't make generative models much more useful for alignment

Right, the point of quantilizers is not to make generative models safer. It's to be safer than non-generative models (in cases where the training distribution is in fact safe and you don't need to filter very hard to succeed at the task).

I expect the purely statistical safety/filtering tradeoff to actually be pretty unimportant. More important are the vulnerabilities that come from the training distribution actually not being safe in the first... (read more)


I suspect that there's a ton of room to get more detailed here, and some of the claims or conclusions you reach in this post feel too tenuous, barring that more detailed work. I will give some feedback that probably contains some mistakes of my own:

  • Conditioning is hard, and the picture you paint sometimes doesn't distinguish between conditioning on input data (prompting) and conditioning on latent states of the learned model. But the latter is quite tricky, and if it's necessary, then questions about whether or not this whole approach makes the prob
... (read more)
3Arun Jose4mo
Thanks for the feedback! I agree that there's lots of room for more detail - originally I'd planned for this to be even longer, but it started to get too bloated. Some of the claims I make here unfortunately do lean on some of that shared context yeah, although I'm definitely not ruling out the possibility that I just made mistakes at certain points. * I think when I talk about conditioning in post I'm referring to prompting, unless I'm misunderstanding what you mean by conditioning on latent states for language models (which is entirely possible). * That's a very interesting question, and I think it comes down to the specifics of the model itself. For the most part in this post I'm talking about true generative models (or problems associated while trying to train true generative models) in the sense of models that are powerful enough at modelling the world that they can actually be thought of as depending on the physics prior for most practical purposes. In that theoretical limit, I think it would be robust, if prompts that seem similar to us actually represent similar world states. For more practical models though (especially when we're trying to get some use out of sooner models), I think our best guess would be extrapolating the robustness of current models. From my (admittedly not very large) experience working with GPT-3, my understanding is that LLMs gets less fragile with scale - in other words, that they depend less on stuff like phrasing and process the prompts more "object-level" in some sense as they get more powerful. If the problem you're pointing to is generally that the textual distribution fails in ways that the reality prior wouldn't given a sufficiently strong context switch - then I agree that's possible. My guess is that this wouldn't be a very hard problem though, mainly because of reasons I briefly mention in the Problems with Outer Alignment section: that the divergenc

Suppose I have a self-driving car planning a route, and a superintelligent traffic controller "trying to help the car." The superintelligent traffic controller knows what route my car will pick, and so it tweaks the timing of the light slightly so that if the car takes the route it will in fact pick, everything is smoother and safer, but if it takes any other route, it will hit red lights and go slower.

Is this the sort of loss of freedom you mean?

What if, if my car tried to deviate from the route that's best according to its normal routing algorithm, the s... (read more)

Hmm, Looks like I should add an examples section and more background on what I mean related to freedom. What you are describing sounds like a traffic system that values ergodic efficiency of it's managed network and you are showing a way that a participant can have very non-ergodic results. It sounds like that is more of an engineering problem than what I'm imagining. Examples off the top of my head of what I mean with respect to loss of freedom resulting from a powerful agent's value system include things like: * paperclip maximizer terraforming the earth prevents any value-systems other than paperclip maximization from sharing the earth's environment. * human's value for cheap foodstuffs results in mono-culture crop fields, which cuts off forest grassland ecosystem's values, (hiding places, alternating food stuffs which last through the seasons, etc.) * Drug dependent parent changes a child's environment, preventing freedom for a reliable schedule, security, etc. * Or, riffing off your example: superintelligent traffic controller starts city-planning, bulldozing blocks of car-free neighborhoods because they stood in the way of a 5% city-wide traffic flow improvement Essentially what I'm trying to describe is that freedoms need to be a value onto themselves that has certain characteristics that are functionally different than the common utility function terminology that revolves around metric maximization (like gradient descent). Freedoms describe boundary conditions within which metric maximization is allowed, but describe steep penalties for surpassing their bounds. Their general mathematical form is a manifold surrounding some state-space, whereas it seems the general form of most utility function talk is finding a minima/maxima of some state space.
Load More