Got it, that’s very clear. Thanks!
So this point reduces to “we want our X:1 update to be as mild as possible, so use the least-specific condition that accomplishes the goal”.
I was rereading this and was struck by how much work the following does:
Manipulative AGI is knowledgeable/intelligent enough to anticipate what criteria we will condition on in counterfactual worlds where it does not exist, and manipulate world events to meet whatever criteria we might specify with probability ~1.
If this doesn't hold, and if the AGI has to guess (with probability p << 1) what criteria we were going to condition on, then the update in favor of AGI is p:x, which could easily be less than 1.
Moreover I think there are ways we can arrange... (read more)
I definitely endorse the argument you'd buy, but I also endorse a broader one. My claim is that there is information which goes into timelines which is not just downstream of which architecture I think gets there first.
For example, if you told me that humanity loses the ability to make chips "tomorrow until forever" my timeline gets a lot longer in a way that isn't just downstream of which architecture I think is going to happen first. That then changes which architectures I think are going to get there first (strongly away from DL) primarily by making my estimated timeline long enough for capabilities folks to discover some theoretically-more-efficient but far-from-implementable-today architectures.
I think timelines are a useful input to what architecture takes off first. If the timelines are short, I expect AGI to look like something like DL/Transformers/etc. If timelines are longer there might be time for not-yet-invented architectures to take off first. There can be multiple routes to AGI, and "how fast do we go down each route" informs which one happens first.
Another angle: number of bits of optimization required is a direct measure of “how far out of distribution” we need to generalize.
I think it's useful to distinguish between the amount of optimization we ask the model to do versus the unlikelihood of the world we ask it to simulate.
For instance, I can condition on something trivial like "the weather was rainy on 8/14, sunny on 8/15, rainy on 8/16...". This specifies a very unlikely world, but so long as the pattern I specify is plausible it doesn't require much optimization on the part of the model or take ... (read more)
Regarding your “Redirecting civilization” approach: I wonder about the competitiveness of this. It seems that we will likely build x-risk-causing AI before we have a good enough model to be able to e.g. simulate the world 1000 years into the future on an alternative timeline?
I'm not sure. My sense is that generative models have a huge lead in terms of general capabilities over ~everything else, and that seems to be where the most effort is going today. So unless something changes there I expect generative models to be the state of the art when we hi... (read more)
Playing the perplexity game had a big impact on my intuitions around language models, so thanks for making it! In particular, the fact that models are so much better at it than humans means we can't really tell from behavior alone whether a model is genuinely trying to predict the next token. This is a problem for detecting inner alignment failure, because we can't tell (outside of the training set) if the model is actually optimizing for next-token prediction or something that just looks (to us) like next-token prediction.
Apart from this, I do think logical dependences and superrationality would be broken if there is a strict hierarchy between different versions of models, where models know their place in the hierarchy.
Oh interesting. I think this still runs into the issue that you'll have instrumental goals whenever you ask the model to simulate itself (i.e. just the first step in the hierarchy hits this issue).
Regarding using prompts, I wonder, how do you think we could get the kind of model you talk about in your post on conditioning generative models?
I was imagining tha... (read more)
The section on fixed points was interesting! I wonder if there's a way to avoid the recursion altogether though? Specifically, is there a way to condition the model such that the world it simulates doesn't contain humans who use the model (or one very like it)? I'm not sure, and would be interested in your thoughts on this.
Is the loss we’re training the generative model on - in the case of language models, the predictive loss over the next token - actually representative of the world prior?
This seems important and is not a thing I've thought about carefully, so thanks for bringing it up and exploring it. I think (to the extent there is a problem) the problem is alleviated by training on "predict tomorrow's headline given today's" and related tasks (e.g. "predict the next frame of video from the last"). That forces the model to engage more directly with the relationship betwe... (read more)
I like the use of L-knowledge to split the questions we insist on getting answered from those we don't. That indeed seems to divide the space nicely!
What this means is that picking out the direct translator from all models consistent with the data must depend on the predictor. Otherwise, if the same training process is used for all predictors, it could give the human simulator on some even while giving the direct translator for others.
I don't follow this point. If I take a reporter trained to be a direct translator on one predictor and hook it up to a different predictor I expect I'll get some incoherent output rather than a human simulator. Why should I get a human simulator in this instance?
I found this post clarifying. One thing I'm still uncertain of: what's the architecture of the Reporter in this proposal? Does it have two heads, one for proposing changes to the Predictor's state and one for answering questions? If so, can I think of the training process as:
I like the idea of contribution stories. That seems like a useful concept to have around.
I also endorse your contribution story for Grouped Loss.
Thanks! I'll try do that in the future (and will add some to this).
This is helpful, thanks for summarizing the differences! I definitely agree on the first one.
On the second one, my concern is basically that all the safety guarantees that quantilizers provide have an inherent power/safety tradeoff (modulo whatever I'm missing from the "Targeted Impact" section).
That said, it's possible that your nested approach may avoid the 'simulate a deceptive AGI' failure mode. At least, if it's a continuous trajectory of improvement from median human performance up to very superhuman performance you might hope that that traject... (read more)
Good point! And indeed I am skeptical that there are useful bounds on the cost...
This is kind of the point of meta-learning, or 'transfer' in a broad sense: you train on X, and Y gets better!
I'm not saying that the knowledge doesn't transfer, I'm saying it would seem weird if it transferred sharply. Specifically, if task Z is composed of performing task X then task Y, I would expect improving X to improve Z, and I would expect improving Y to improve Z, and I would expect P(Z performed correctly) to be given by P(X performed correctly) and P(Y performed correctly). I think that means Z will improve a bit more sharply than either X or Y,... (read more)
This is a cool result. If I'm understanding correctly, M- increases its loss the more that M+ is represented in the mixture, thereby encouraging SGD to make M- more prominent.
Is there a way to extend this to cases where M- doesn't have access to the weights? I think that probably requires an RL environment, but that's entirely based on "I thought about it for a few minutes and couldn't find a way to do it without RL" so I could be way off here.
Given an RL environment I suspect M- could steer the model into scenarios that make it look better than M+...
I’m worried about running HCH because it seems likely that in worlds that can run HCH people are not sufficiently careful to restrict GPU access and those worlds get taken over by unsafe AI built by other actors. Better to just not have the GPU’s at all.
I don’t think the description-length prior enters here. The generative model has a prior based on training data we fed it, and I don’t see why it would prefer short description lengths (which is a very uninformed prior) over “things that are likely in the world given the many PB of data it’s seen”.
Putting that aside, can you say why you think the “AI does weird dances” world is more likely conditioned on the observations than “humans happened to do this weird thing”?
I think I basically agree re: honeypots.
I'm sure there'll be weird behaviors if we outlaw simulations, but I don't think that's a problem. My guess is that a world where simulations are outlawed has some religion with a lot of power that distrusts computers, which definitely looks weird but shouldn't stop them from solving alignment.
I don’t think that’s an example of the model noticing it’s in a simulation. There’s nothing about simulations versus the real world that makes RSA instances more or less likely to pop up.
Rather, that’s a case where the model just has a defecting condition and we don’t hit it in the simulation. This is what I was getting at with “other challenge” #2.
I'm assuming we can input observations about the world for conditioning, and those don't need to be text. I didn't go into this in the post, but for example I think the following are fair game:
Whereas the following are not allowed because I don't see how they could be... (read more)
I think I basically hold disagreement (1), which I think is close to Owain’s comment. Specifically. I think a plausible story for a model learning causality is:
Right. Maybe a better way to say it is:
The two together give a bit of a lever that I think we can use to bias away from deception if we can find the right operational notion of hidden behaviors.
Were you thinking of perhaps some sort of evolutionary approach with that as part of a fitness function?
That would work, yeah. I was thinking of an approach based on making ad-hoc updates to the weights (beyond SGD), but an evolutionary approach would be much cleaner!
Ok, I see. Thanks for explaining!
One thing to note, which might be a technical quibble, is that I don't endorse the entropy version of this prior (which is the one that wants 50/50 activations). I started off with it because it's simpler, but I think it breaks for exactly the reasons you say, which is why I prefer the version that wants to see "Over the last N evaluations, each gate evaluated to T at least q times and to F at least q times, where q << N." This is very specifically so that there isn't a drive to unnaturally force the percentages towar... (read more)
I think I agree that the incentive points in that direction, though I'm not sure how strongly. My general intuition is that if certain wires in a circuit are always activated across the training distribution then something has gone wrong. Maybe this doesn't translate as well to neural networks (where there is more information conveyed than just 'True/False')? Does that suggest that there's a better way to implement this in the case of neural networks (maybe we should be talking about distributions of activations, and requesting that these be broad?).
On the... (read more)
I agree that many coverage-style metrics can be broken, probably easily, and that this includes the priors I described. I also think your explicit construction is right, and is a special case of a concern I mentioned in the post ("changing the location on the circuit where the deceptive conditional gets evaluated").
I don't think the specific construction you mention is terribly problematic because it requires doubling the size of the circuit, which is easy to penalize with a circuit complexity prior, so I'm much more worried about implicit cases, which I t... (read more)
This is very interesting! A few thoughts/questions:
I think I may be confused about the argument being made in the 'Deceptively Aligned Models' section, and am restating my understanding here to see if you agree. [And if not, clarification on what I've got wrong would be very helpful!]
I think I understand the previous two sections: