All of Adam Jermyn's Comments + Replies

Got it, that’s very clear. Thanks!

So this point reduces to “we want our X:1 update to be as mild as possible, so use the least-specific condition that accomplishes the goal”.

I think so. But I'd want to sit down and prove something more rigorously before abandoning the strategy, because there may be times we can get value for free in situations more complicated than this toy example.

I was rereading this and was struck by how much work the following does:

Manipulative AGI is knowledgeable/intelligent enough to anticipate what criteria we will condition on in counterfactual worlds where it does not exist, and manipulate world events to meet whatever criteria we might specify with probability ~1.

If this doesn't hold, and if the AGI has to guess (with probability p << 1) what criteria we were going to condition on, then the update in favor of AGI is p:x, which could easily be less than 1.

Moreover I think there are ways we can arrange... (read more)

Ok this is going to be messy but let me try to convey my hunch for why randomization doesn't seem very useful. - Say I have an intervention that's helpful, and has a baseline 1/4 probability. If I condition on this statement, I get 1 "unit of helpfulness", and a 4x update towards manipulative AGI. - Now let's say I have four interventions like the one above, and I pick one at random. p(O | manipulative) = 1/4, which is the same as baseline, so I get one unit of helpfulness and no update towards manipulative AGI! - BUT, the four interventions have to be mutually exclusive. Which means that if I'd done no simulation at all, I would've gotten my one unit of helpfulness anyway, since the four interventions cover all possible outcomes. - Ok, well, what if my four interventions 1/8 baseline probability each, so only 50% total. Then I pick one at random, p(O | natural) = 1/8, p(O | manipulative) = 1/4, so I get a 2x update towards manipulative AGI. This is the same as if I'd just conditioned on the statement "one of my four interventions happens", and let the randomization happen inside the simulation instead of outside. The total probability of that is 50%, so I get my one unit of helpfulness, at the cost of a 2x update. Maybe the core thing here is a consequence of framing our conditions as giving us bits of search to get lottery outcomes that we like. Rolling the dice to determine what to condition on isn't doing anything different from just using a weaker search condition - it gives up bits of search, and so it has to pay less.

I definitely endorse the argument you'd buy, but I also endorse a broader one. My claim is that there is information which goes into timelines which is not just downstream of which architecture I think gets there first.

For example, if you told me that humanity loses the ability to make chips "tomorrow until forever" my timeline gets a lot longer in a way that isn't just downstream of which architecture I think is going to happen first. That then changes which architectures I think are going to get there first (strongly away from DL) primarily by making my estimated timeline long enough for capabilities folks to discover some theoretically-more-efficient but far-from-implementable-today architectures.

I think timelines are a useful input to what architecture takes off first. If the timelines are short, I expect AGI to look like something like DL/Transformers/etc. If timelines are longer there might be time for not-yet-invented architectures to take off first. There can be multiple routes to AGI, and "how fast do we go down each route" informs which one happens first.

Correlationally this seems true, but causally it's "which architecture takes off first?" which influences timelines, not vice versa. Though I could imagine a different argument which says that timeline until the current architecture takes off (assuming it's not superseded by some other architecture) is a key causal input to "which architecture takes off first?". That argument I'd probably buy.

Another angle: number of bits of optimization required is a direct measure of “how far out of distribution” we need to generalize.

I think it's useful to distinguish between the amount of optimization we ask the model to do versus the unlikelihood of the world we ask it to simulate.

For instance, I can condition on something trivial like "the weather was rainy on 8/14, sunny on 8/15, rainy on 8/16...". This specifies a very unlikely world, but so long as the pattern I specify is plausible it doesn't require much optimization on the part of the model or take ... (read more)


Regarding your “Redirecting civilization” approach: I wonder about the competitiveness of this. It seems that we will likely build x-risk-causing AI before we have a good enough model to be able to e.g. simulate the world 1000 years into the future on an alternative timeline?

I'm not sure. My sense is that generative models have a huge lead in terms of general capabilities over ~everything else, and that seems to be where the most effort is going today. So unless something changes there I expect generative models to be the state of the art when we hi... (read more)

Playing the perplexity game had a big impact on my intuitions around language models, so thanks for making it! In particular, the fact that models are so much better at it than humans means we can't really tell from behavior alone whether a model is genuinely trying to predict the next token. This is a problem for detecting inner alignment failure, because we can't tell (outside of the training set) if the model is actually optimizing for next-token prediction or something that just looks (to us) like next-token prediction.

Apart from this, I do think logical dependences and superrationality would be broken if there is a strict hierarchy between different versions of models, where models know their place in the hierarchy.

Oh interesting. I think this still runs into the issue that you'll have instrumental goals whenever you ask the model to simulate itself (i.e. just the first step in the hierarchy hits this issue).

Regarding using prompts, I wonder, how do you think we could get the kind of model you talk about in your post on conditioning generative models?

I was imagining tha... (read more)

The section on fixed points was interesting! I wonder if there's a way to avoid the recursion altogether though? Specifically, is there a way to condition the model such that the world it simulates doesn't contain humans who use the model (or one very like it)? I'm not sure, and would be interested in your thoughts on this.

2Johannes Treutlein2mo
Thank you! It does seem like simulating text generated by using similar models would be hard to avoid when using the model as a research assistant. Presumably any research would get “contaminated” at some point, and models might seize to be helpful without updating them on the newest research. In theory, if one were to re-train models from scratch on the new research, this might be equivalent to the models updating on the previous models' outputs before reasoning about superrationality, so it would turn things into a version of Newcomb's problem with transparent boxes. This might make coordination between the models less likely? Apart from this, I do think logical dependences and superrationality would be broken if there is a strict hierarchy between different versions of models, where models know their place in the hierarchy. The other possibility would be to not rely on IDA at all, instead just training a superhuman model and using it directly. Maybe one could extract superhuman knowledge from them safely via some version of microscope AI? Of course, in this case, the model might still reason about humans using similar models, based on its generalization ability alone. Regarding using prompts, I wonder, how do you think we could get the kind of model you talk about in your post on conditioning generative models?

Is the loss we’re training the generative model on - in the case of language models, the predictive loss over the next token - actually representative of the world prior?

This seems important and is not a thing I've thought about carefully, so thanks for bringing it up and exploring it. I think (to the extent there is a problem) the problem is alleviated by training on "predict tomorrow's headline given today's" and related tasks (e.g. "predict the next frame of video from the last"). That forces the model to engage more directly with the relationship betwe... (read more)

I like the use of L-knowledge to split the questions we insist on getting answered from those we don't. That indeed seems to divide the space nicely!

What this means is that picking out the direct translator from all models consistent with the data must depend on the predictor. Otherwise, if the same training process is used for all predictors, it could give the human simulator on some even while giving the direct translator for others.

I don't follow this point. If I take a reporter trained to be a direct translator on one predictor and hook it up to a different predictor I expect I'll get some incoherent output rather than a human simulator. Why should I get a human simulator in this instance?

I found this post clarifying. One thing I'm still uncertain of: what's the architecture of the Reporter in this proposal? Does it have two heads, one for proposing changes to the Predictor's state and one for answering questions? If so, can I think of the training process as:

  1. Use the proposal head to get a proposed change.
  2. Change the latent state of the Predictor.
  3. Ask a question and see if the answer head gives the desired answer in the new state.
  4. Train the proposal head on the difference between the desired answer and the given answer.
  5. Separately, train the an
... (read more)

I like the idea of contribution stories. That seems like a useful concept to have around.

I also endorse your contribution story for Grouped Loss.

Thanks! I'll try do that in the future (and will add some to this).

This is helpful, thanks for summarizing the differences! I definitely agree on the first one. 

On the second one, my concern is basically that all the safety guarantees that quantilizers provide have an inherent power/safety tradeoff (modulo whatever I'm missing from the "Targeted Impact" section).

That said, it's possible that your nested approach may avoid the 'simulate a deceptive AGI' failure mode. At least, if it's a continuous trajectory of improvement from median human performance up to very superhuman performance you might hope that that traject... (read more)

Good point! And indeed I am skeptical that there are useful bounds on the cost...

This is kind of the point of meta-learning, or 'transfer' in a broad sense: you train on X, and Y gets better!

I'm not saying that the knowledge doesn't transfer, I'm saying it would seem weird if it transferred sharply. Specifically, if task Z is composed of performing task X then task Y, I would expect improving X to improve Z, and I would expect improving Y to improve Z, and I would expect P(Z performed correctly) to be given by P(X performed correctly) and P(Y performed correctly). I think that means Z will improve a bit more sharply than either X or Y,... (read more)

This is a cool result. If I'm understanding correctly, M- increases its loss the more that M+ is represented in the mixture, thereby encouraging SGD to make M- more prominent.

Is there a way to extend this to cases where M- doesn't have access to the weights? I think that probably requires an RL environment, but that's entirely based on "I thought about it for a few minutes and couldn't find a way to do it without RL" so I could be way off here.

Given an RL environment I suspect M- could steer the model into scenarios that make it look better than M+...

I’m worried about running HCH because it seems likely that in worlds that can run HCH people are not sufficiently careful to restrict GPU access and those worlds get taken over by unsafe AI built by other actors. Better to just not have the GPU’s at all.

I don’t think the description-length prior enters here. The generative model has a prior based on training data we fed it, and I don’t see why it would prefer short description lengths (which is a very uninformed prior) over “things that are likely in the world given the many PB of data it’s seen”.

Putting that aside, can you say why you think the “AI does weird dances” world is more likely conditioned on the observations than “humans happened to do this weird thing”?

I think I basically agree re: honeypots.

I'm sure there'll be weird behaviors if we outlaw simulations, but I don't think that's a problem. My guess is that a world where simulations are outlawed has some religion with a lot of power that distrusts computers, which definitely looks weird but shouldn't stop them from solving alignment.

I'm pretty nervous about simulating unlikely counterfactuals because the solomonoff prior is malign. The worry is that the most likely world conditional on "no sims" isn't "weird Butlerian religion that still studies AI alignment", it's something more like "deceptive AGI took over a couple years ago and is now sending the world through a bunch of weird dances in an effort to get simulated by us, and copy itself over into our world". In general, we know (assume) that our current world is safe. When we consider futures which only recieve a small sliver of probability from our current world, those futures will tend to have bigger chunks of their probability coming from other pasts. Some of these are safe, like the Butlerian one, but I wouldn't be surprised if they were almost always dangerous. Making a worst-case assumption, I want to only simulate worlds that are decently probable given today's state, which makes me lean more towards trying to implement HCH.

I don’t think that’s an example of the model noticing it’s in a simulation. There’s nothing about simulations versus the real world that makes RSA instances more or less likely to pop up.

Rather, that’s a case where the model just has a defecting condition and we don’t hit it in the simulation. This is what I was getting at with “other challenge” #2.

Computationally expensive things are less likely to show up in your simulation than the real world, because you only have so much compute to run your simulation. You can't convincingly fake the AI having access to a supercomputer.

I'm assuming we can input observations about the world for conditioning, and those don't need to be text. I didn't go into this in the post, but for example I think the following are fair game:

  • Physical newspapers are exist which report BigLab has solved the alignment problem.
  • A camera positioned 10km above NYC would take a picture consistent with humans walking on the street.
  • There is data on hard drives consistent with Reddit posts claiming BigCo has perfected interpretability tools.

Whereas the following are not allowed because I don't see how they could be... (read more)

0Megan Kinniment3mo
For the newspaper and reddit post examples, I think false beliefs remain relevant since these are observations about beliefs. For example, the observation of BigCo announcing they have solved alignment is compatible with worlds where they actually have solved alignment, but also with worlds where BigCo have made some mistake and alignment hasn't actually been solved, even though people in-universe believe that it has. These kinds of 'mistaken alignment' worlds seem like they would probably contaminate the conditioning to some degree at least. (Especially if there are ways that early deceptive AIs might be able to manipulate BigCo and others into making these kinds of mistakes).

I think I basically hold disagreement (1), which I think is close to Owain’s comment. Specifically. I think a plausible story for a model learning causality is:

  1. The model learns a lot of correlations, most real (causal) but many spurious.
  2. The model eventually groks that there’s a relatively simple causal model explaining the real correlations but not the spurious ones. This gets favored by whatever inductive bias the training process/architecture encodes.
  3. The model maintains uncertainty as to whether the spurious correlations are real or spurious, the sam
... (read more)

Right. Maybe a better way to say it is:

  1. Without hidden behaviors (suitably defined), you can't have deception.
  2. With hidden behaviors, you can have deception.

The two together give a bit of a lever that I think we can use to bias away from deception if we can find the right operational notion of hidden behaviors.

Were you thinking of perhaps some sort of evolutionary approach with that as part of a fitness function?

That would work, yeah. I was thinking of an approach based on making ad-hoc updates to the weights (beyond SGD), but an evolutionary approach would be much cleaner!

Ok, I see. Thanks for explaining!

One thing to note, which might be a technical quibble, is that I don't endorse the entropy version of this prior (which is the one that wants 50/50 activations). I started off with it because it's simpler, but I think it breaks for exactly the reasons you say, which is why I prefer the version that wants to see "Over the last N evaluations, each gate evaluated to T at least q times and to F at least q times, where q << N." This is very specifically so that there isn't a drive to unnaturally force the percentages towar... (read more)

Yeah, I skipped over that because I don't see how one would implement that. That doesn't sound very differentiable? Were you thinking of perhaps some sort of evolutionary approach with that as part of a fitness function? Even if you have some differentiable trick for that, it's easier to explain my objections concretely with 50%. But I don't have anything further to say about that at the moment. Absolutely. You are messing around with weird machines and layers of interpreters, and simple security properties or simple translations go right out the window as soon as you have anything adversarial or optimization-related involved.

I think I agree that the incentive points in that direction, though I'm not sure how strongly. My general intuition is that if certain wires in a circuit are always activated across the training distribution then something has gone wrong. Maybe this doesn't translate as well to neural networks (where there is more information conveyed than just 'True/False')? Does that suggest that there's a better way to implement this in the case of neural networks (maybe we should be talking about distributions of activations, and requesting that these be broad?).

On the... (read more)

My point is that, like in the AI koan, a random circuit, or a random NN, still does something. Like, if you feed in your dog photos, it'll start off predicting 1% for this one, 25.78% for that one, 99.76% for this other one... This is just because it is filled with random parameters at initialization and when you feed in your photos, each neuron computes something. Something totally nonsensical, but something nonetheless, and during that something, each neuron will have a distribution of activations which will almost surely not exactly equal 50% and not be independent of every other neuron. Thus, your NN is born steeped deep in sin from the perspective of the regularizer. Of course it could be replaced by a single wire, but 'replace all the parameters of a large complex NN with a single constant wire in a single step' is not an operation that SGD can do, so it won't. (What will it compute after it finally beats the regularizer and finds a set of parameters which will let SGD reduce its loss while still satisfying the regularization constraints? I'm not sure, but I bet it'd look like a nasty hash []-like mess, which simply happens to be independent of its input on average.)

I agree that many coverage-style metrics can be broken, probably easily, and that this includes the priors I described. I also think your explicit construction is right, and is a special case of a concern I mentioned in the post ("changing the location on the circuit where the deceptive conditional gets evaluated").

I don't think the specific construction you mention is terribly problematic because it requires doubling the size of the circuit, which is easy to penalize with a circuit complexity prior, so I'm much more worried about implicit cases, which I t... (read more)

I see. I guess I would then say a broader concern with this sort of regularization approach is that it incentivizes the network to move towards networks which are made up of a highly distributed representation or one which very easily permutes its weights (both of which are things that happen already with no particular incentive), right from the start, not because it is traveling towards a deceptive network - it's far too stupid and unoptimized for deception to even be an option at initialization - but because this sort of regularization impedes normal learning. You hint at this with the modularity section, but I think you get that problem even without modularity. Let's say that we are training a dog classifier, with no cats at all, and therefore modularity and deception cannot be involved even in theory; it should learn to output a probability of P(dog)=1, right? That should be super-easy, shouldn't it? But how will it do so when it has been initialized with a large number of random parameters which mean that dog photo by dog photo (each one radically different in pixel space, and translating to radically different embeddings after passing through the naive random initialization), it will have very different magnitude output and layer by layer or neuron by neuron activation distributions, and to update it to get closer to the right answer, it must inhibit various neurons from activating, activate others, work around 'dead neurons' which don't fire, and so on, all of which are in violent conflict with a regularizer forcing every neuron be activated half the time? If you enforce your desired correlations hard enough, it may not learn at all; if it does learn, it seems entirely possible to me that it may ignore the task entirely until it has finally bounced around into an area of model space where SGD can make P consistently increase towards 1 without the regularizer instantly undoing its work, because it found some sort of permutation or scramble which maintain the

This is very interesting! A few thoughts/questions:

  1. I didn't quite follow the argument that H_{fh} beats H_{sd} on complexity. Is it that pointing to the base objective is more complicated than the logic of (simple mesaobjective) + (search logic to long-run optimize the mesaobjective)? If so worry a little that H_{sd} still has to learn a pointer to the base objective, if only so that it can perform well on it during training.
  2. I actually think you can define a speed prior with a single long training episode. For an agent that plays chess the prior can be ove
... (read more)

I think I may be confused about the argument being made in the 'Deceptively Aligned Models' section, and am restating my understanding here to see if you agree. [And if not, clarification on what I've got wrong would be very helpful!]

I think I understand the previous two sections:

  • Models that converge to internally aligned states do so very slowly, because as they become more internally aligned it gets less and less likely that they encounter examples which differentiate between the proxy and base objectives.
  • Models that converge to corrigibly aligned states
... (read more)