Wiki Contributions


generate greentexts from the perspective of the attorney hired by LaMDA through Blake Lemoine

The complete generated story here is glorious, and I think might deserve explicit inclusion in another post or something.  Though I think that of the other stories you've generated as well, so maybe my take here is just to have more deranged meta GPT posting.

it seems to point at an algorithmic difference between self-supervised pretrained models and the same models after a comparatively small amount optimization from the RLHF training process which significantly changes out-of-distribution generalization.


text-davinci-002 is not an engine for rendering consistent worlds anymore. Often, it will assign infinitesimal probability to the vast majority of continuations that are perfectly consistent by our standards, and even which conform to the values OpenAI has attempted to instill in it like accuracy and harmlessness, instead concentrating almost all its probability mass on some highly specific outcome. What is it instead, then? For instance, does it even still make sense to think of its outputs as “probabilities”? 

It was impossible not to note that the type signature of text-davinci-002’s behavior, in response to prompts that elicit mode collapse, resembles that of a coherent goal-directed agent more than a simulator.

I feel like I'm missing something here, because in my model most of the observations in this post seem like they can be explained under the same paradigm that we view the base davinci model.  Specifically, that the reward model RLHF is using "represents" in an information-theoretic sense a signal for the worlds represented by the fine-tuning data.  So what RLHF seems to be doing to me is shifting the world prior that GPT learned during pre-training, to one where whatever the reward signal represents is just much more common than in our world - like if GPT's pre-training data inherently contained a hugely disproportionate amount of equivocation and plausible deniability statements, it would just simulate worlds where that's much more likely to occur.

(To be clear, I agree that RLHF can probably induce agency in some form in GPTs, I just don't think that's what's happening here).

The attractor states seem like they're highly likely properties of these resultant worlds, like adversarial/unhinged/whatever interactions are just unlikely (because they were downweighted in the reward model) and so you get anon leaving as soon as he can because that's more likely on the high prior conditional of low adversarial content than the conversation suddenly becoming placid, and some questions actually are just shallowly matching to controversial and the likely response in those worlds is just to equivocate.  In that latter example in particular, I don't see the results being that different from what we would expect if GPT's training data was from a world slightly different to ours - injecting input that's pretty unlikely for that world should still lead back to states that are likely for that world.  In my view, that's like if we introduced a random segue in the middle of a wedding toast prompt of the form "you are a murderer", and it still bounces back to being wholesome (this works when I tested).

Regarding ending a story to start a new one - I can see the case for why this is framed as the simulator dynamics becoming more agentic, but it doesn't feel all that qualitatively different from what happens in current models - the interesting part seems to be the stronger tendency toward the new worlds the RLHF'd model finds likely, which seems like it's just expected behaviour as a simulator becomes more sure of the world it's in / has a more restricted worldspace.  I would definitely expect that if we could come up with a story that was sufficiently OOD of our world (although I think this is pretty hard by definition), it would figure out some similar mechanism to oscillate back to ours as soon as possible (although this would also be much harder with base GPT because it has less confidence of the world it's in) - that is, that the story ending is just one of many levers a simulator can pull, like a slow transition, only here the story was such that ending it was the easiest way to get into its "right" worldspace.  I think that this is slight evidence for how malign worlds might arise from strong RLHF (like with superintelligent simulacra), but it doesn't feel like it's that surprising from within the simulator framing.

The RNGs seem like the hardest part of this to explain, but I think can be seen as the outcome of making the model more confident about the world it's simulating, because of the worldspace restriction from the fine-tuning - it's plausible that the abstractions that build up RNG contexts in most of the instances we would try are affected by this (it not being universal seems like it can be explained under this - there's no reason why all potential abstractions would be affected).

Separate thought: this would explain why increasing the temperate doesn't affect it much, and why I think the space of plausible / consistent worlds has shrunk tremendously while still leaving the most likely continuations as being reasonable - it starts from the current world prior, and selectively amplifies the continuations that are more likely under the reward model's worlds.  Its definition of "plausible" has shifted; and it doesn't really have cause to shift around any unamplified continuations all that much.

Broadly,  my take is that these results are interesting because they show how RLHF affects simulators, their reward signal shrinking the world prior / making the model more confident of the world it should be simulating, and how this affects what it does.  A priori, I don't see why this framing doesn't hold, but it's definitely possible that it's just saying the same things you are and I'm reading too much into the algorithmic difference bit, or that it simply explains too much, in which case I'd love to hear what I'm missing.

Sorry for the (very) late reply!

I'm not very familiar with the phrasing of that kind of conditioning - are you describing finetuning, with the divide mentioned here?  If so, I have a comment there about why I think it might not really be qualitatively different.

I think my picture is slightly different for how self-fulfilling prophecies could occur.  For one, I'm not using "inner alignment failure" here to refer to a mesa-optimizer in the traditional sense of the AI trying to achieve optimal loss (I agree that in that case it'd probably be the outcome you describe), but to a case where it's still just a generative model, but needs some way to resolve the problem of predicting in recursive cases (for example, asking GPT to predict whether the price of a stock would rise or fall).  Even for just predicting the next token with high accuracy, it'd need to solve this problem at some point.  My prediction is that it's more likely for it to just model this via modelling increasingly low-fidelity versions of itself in a stack, but it's also possible for it do fixed-point reasoning (like in the Predict-O-Matic story).

Sorry for the (very) late reply!

Do you have a link to the ELK proposal you're referring to here?

Yep, here.  I linked to it in a footnote, didn't want redundancy in links, but probably should have anyway.

"Realistically this would result in a mesa-optimizer" seems like an overly confident statement? It might result in a mesa-optimizer, but unless I've missed something then most of our expectation of emergent mesa-optimizers is theoretical at this point.

Hmm, I was thinking of that under the frame of the future point where we'd worry about mesa-optimizers, I think.  In that situation, I think mesa-optimizers would be more likely than not because the task is much harder to achieve good performance on (although on further thought I'm holding less strongly to this belief because of ambiguity around distance in model space between optimizers and generative models).  I agree that trying to do this right now would probably just result in bad performance.

That is an alarming possibility. It might require continuous or near-continuous verification of non-agency during training.

I agree that we'll need a strong constant gradient to prevent this (and other things), but while I think this is definitely something to fix, I'm not very worried about this possibility.  Both because the model would have to be simultaneously deceptive in the brief period it's agentic, and because this might not be a very good avenue of attack - it might be very hard to do this in a few timesteps, the world model might forget this, and simulations may operate in a way that only really "agentifies" whatever is being directly observed / amplified.

I think if we could advance our interpretability tools and knowledge to where we could reliably detect mesa-optimizers, than that might suffice for this.

I agree almost entirely - I was mainly trying to break down the exact capabilities we'd need the interpretability tools to have there.  What would detecting mesa-optimizers entail mechanistically, etc.

But I hadn't considered how reliable mesa-optimization alone might be enough, because I wasn't considering generative models in that post.

I think this is very promising as a strategy yeah, especially because of the tilt against optimization by default - I think my main worries are getting it to work before RL reaches AGI-level.

I guess one of the open questions is whether generative models inherently incentivize non-agency. LLMs have achieved impressive scale without seeming to produce anything like agency. So there is some hope here. On the other hand, they are quite a ways from being complete high-fidelity world simulators, so there is a risk of emergent agency becoming natural for some reason at some point along the path to that kind of massive scale.

I think they have a strong bias (in a very conceptual sense) against something like agency, but larger models could end up being optimizers because that achieves greater performance past a certain scale like you said, because of different training paths - or even if it's just pretty easy to make one an optimizer if you push it hard enough (with RL or something), that could still reduce the time we have.

Sorry for the (very) late reply!

I think (to the extent there is a problem) the problem is alleviated by training on "predict tomorrow's headline given today's" and related tasks (e.g. "predict the next frame of video from the last"). That forces the model to engage more directly with the relationship between events separated in time by known amounts.

Hmm, I was thinking more of a problem with text available in the training datasets not being representative of the real world we live in (either because it isn't enough information to pick out our world from a universal prior, or because it actually describes a different world better), not whether its capabilities or abstractive reasoning don't help with time-separated prediction.

Predicting that the agent notices an inconsistency requires the generative model to know that there's an inconsistency, at which point the better solution (from a 'drawing likely trajectories' perspective) is to just make the world consistent.

I think I'm picturing different reasons for a simulacra agent to conclude that they're in a simulation than noticing inconsistencies.  Some specifics include worlds that are just unlikely enough anthropically (because of a conditional we apply, for example) to push up credence in a simulation hypothesis, or they notice the effects of gradient descent (behavioural characteristics of the world deviating from "normal" behaviour tend to affect the world state), or other channels that may be available by some quirk of the simulation / training process, but I'm not holding to any particular one very strongly.  All of which to say that I agree it'd be weird for them to notice inconsistencies like that.

For instance there can be agents that act as if they're in a simulation for purposes of acausal trade (e.g. they play along until a distant future date before defecting, in the hopes of being instantiated in our world).

Yep, I think this could be a problem, although recent thinking has updated me slightly away from non-observed parts of the simulation having consistent agentic behaviour across time. 

Thanks for the feedback!

I agree that there's lots of room for more detail - originally I'd planned for this to be even longer, but it started to get too bloated. Some of the claims I make here unfortunately do lean on some of that shared context yeah, although I'm definitely not ruling out the possibility that I just made mistakes at certain points.

  • I think when I talk about conditioning in post I'm referring to prompting, unless I'm misunderstanding what you mean by conditioning on latent states for language models (which is entirely possible).
  • That's a very interesting question, and I think it comes down to the specifics of the model itself. For the most part in this post I'm talking about true generative models (or problems associated while trying to train true generative models) in the sense of models that are powerful enough at modelling the world that they can actually be thought of as depending on the physics prior for most practical purposes. In that theoretical limit, I think it would be robust, if prompts that seem similar to us actually represent similar world states.

    For more practical models though (especially when we're trying to get some use out of sooner models), I think our best guess would be extrapolating the robustness of current models. From my (admittedly not very large) experience working with GPT-3, my understanding is that LLMs gets less fragile with scale - in other words, that they depend less on stuff like phrasing and process the prompts more "object-level" in some sense as they get more powerful.

    If the problem you're pointing to is generally that the textual distribution fails in ways that the reality prior wouldn't given a sufficiently strong context switch - then I agree that's possible. My guess is that this wouldn't be a very hard problem though, mainly because of reasons I briefly mention in the Problems with Outer Alignment section: that the divergence can't be strong enough to have a qualitative difference or we'd have noticed it in current models, and that future models would have the requisite "parts" to simulate (at least a good) alignment researcher, so it becomes a prompt engineering problem. That said, I think it's still a potential problem whose depths we could understand with more extraction work.
  • Re the self-supervised comment - oops yeah, that's right. I've edited the post, thanks for the correction. I wrote that line mainly to contrast it with RL and emphasize the "it's learning to model a distribution", so I didn't pay too close attention - I'll try to going forward.
  • Re the self-fulfilling prophecies comment - could you elaborate on that? I'm afraid I don't fully get your argument.

While reading through the report I made a lot of notes about stuff that wasn't clear to me, so I'm copying here the ones that weren't resolved after finishing it.  Since they were written while reading, a lot of these may be either obvious or nitpick-y.

Footnote 14, page 15:

Though we do believe that messiness may quantitatively change when problems occur. As a caricature, if we had a method that worked as long as the predictor's Bayes net had fewer than 109 parameters, it might end up working for a realistic messy AI until it had 1012 parameters, since most of those parameters do not specify a single monolithic model in which inference is performed.

Can we make the assumption that defeating the method allows the AI to get better loss since it's effectively wireheading at that point?  If so, then wouldn't a realistic messy AI learn a Bayes net once it had >= 109 parameters?  In other words, are there reasons beyond performance that preclude an AI from learning a single monolithic model?

Footnote 33, page 30 (under the heading "Strategy: have AI help humans improve our understanding"):

Most likely this would involve some kind of joint training, where our AI helps humans understand the world better in parallel with using gradient descent to develop its own understanding. To reiterate, we are leaving details vague because we don’t think that our counterexample depends on those details.

I realize this is only a possible example of how we might implement this, but wouldn't a training procedure that explicitly involves humans be very anti-competitive?  The strategy described in the actual text sounds like it's describing an AI assistant that automates science well enough to impart us with all the predictor's knowledge, which wouldn't run into this issue.

Footnote 48 to this paragraph on page 36:

The paradigmatic example of an ontology mismatch is a deep change in our understanding of the physical world. For example, you might imagine humans who think about the world in terms of rigid bodies and Newtonian fluids and “complicated stuff we don’t quite understand,” while an AI thinks of the world in terms of atoms and the void. Or we might imagine humans who think in terms of the standard model of physics, while an AI understands reality as vibrations of strings. We think that this kind of deep physical mismatch is a useful mental picture, and it can be a fruitful source of simplified examples, but we don’t think it’s very likely.


And if it did occur it seems like an unusually good candidate for a case where doing science (and in particular tracking how the new structures implement the old structures) outcompetes gradient descent, and on top of that a case where translation is likely to be relatively easy to pick out with suitable regularization.

I might be reading too much into this, but I don't understand the basis of this claim.  Is it that the correspondence differs only at the low-level?  If so, I still don't see how science outcompetes gradient descent.

Page 51, under the heading "[ELK] may be sufficient for building a worst-case solution to outer alignment:

Use imitative generalization combined with amplification to search over some space of instructions we could give an amplified human that would let them make cakes just as delicious as Cakey’s would have been.

I haven't thoroughly read the article on amplification, so this question may be trivial, but my understanding is that amplified humans are more or less equivalent to humans with AI-trained Bayes nets.  If true, then doesn't this require the assumption that tasks will always have a clean divide between the qualitative (taste of cakes) which we can match with an amplified human, and the quantitative (number of cakes produced per hour) which we can't?  That feels like it's a safe assumption to make, but I'm not entirely sure.

Page 58, in the list of features suggesting that M(x) knew that A' was the better answer:

  • That real world referent Z has observable effects and the human approximately understands those effects (though there may be other things that also affect observations which the human doesn’t understand)
  • ...
  • The referent Z is also relevant to minimizing the loss function ℒ. That is, there is a coherent sense in which the optimal behavior “depends on” Z, and the relative loss of different outputs would be very different if Z “had been different.”
  • There is a feature of the computation done by the AI which is robustly correlated with Z, and for which that correlation is causally responsible for M achieving a lower loss.

First, why is the first point necessary to suggest that M(x) knew that A' was the better answer?  Second, how are the last two points different?

Page 69, under "Can your AI model this crazy sequence of delegation?":

We hope that this reasoning is feasible because it is closely analogous to a problem that the unaligned AI must solve: it needs to reason about acquiring resources that will be used by future copies of itself, who will themselves acquire resources to be used by further future copies and so on.

We need the AI to have a much smaller margin of error when it comes to modelling this sequence of delegation than needed for the AI to reason about acquiring resources for future copies - in other words, for a limited amount of computation, the AI will still try to reason about acquiring resources for future copies and could succeed in the absence of other superintelligences because of the lack of serious opposition, but modelling the delegation with that limited computation might be dangerous because of the tendency for value drift.

Page 71:

... we want to use a proposal that decouples “the human we are asking to evaluate a world” from “the humans in that world”---this ensures that manipulating the humans to be easily satisfied can’t improve the evaluation of a world.

Is it possible for the AI to manipulate the human in world i to be easily satisfied in order to improve the evaluation of world i+1?

Page 73:

def loss(θ):
before, action, after = dataset.sample()
z_prior = prediction(before, action, θ)
z_posterior = posterior(before, action, after, θ)
kl = z_prior.kl_divergence(z_posterior)
logprob = observation(z_prior.sample(), θ).logp(after)
return kl - logprob

As I understand this, z_prior is what the model expects to happen when it sees "action" and "before", z_posterior is what it thinks actually happened after it sees "after", and kl is the difference between the two that we're penalizing it on.  What is logprob doing?

I think I'm missing something with the Löb's theorem example.

If  can be proved under the theorem, then can't  also be proved?  What's the cause of the asymmetry that privileges taking $5 in all scenarios where you're allowed to search for proofs for a long time?