All of evhub's Comments + Replies

Mundane solutions to exotic problems

Your link is broken.

For reference, the first post in Paul's ascription universality sequence can be found here (also Adam has a summary here).

1Adam Shimi19hSorry about that. I corrected it but it was indeed the first link you gave.
Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

I guess I would say something like: random search is clearly a pretty good first-order approximation, but there are also clearly second-order effects. I think that exactly how strong/important/relevant those second-order effects are is unclear, however, and I remain pretty uncertain there.

Homogeneity vs. heterogeneity in AI takeoff scenarios

I suppose the distinction between "strong" and "weak" warning shots would matter if we thought that we were getting "strong" warning shots. I want to claim that most people (including Evan) don't expect "strong" warning shots, and usually mean the "weak" version when talking about "warning shots", but perhaps I'm just falling prey to the typical mind fallacy.

I guess I would define a warning shot for X as something like: a situation in which a deployed model causes obvious, real-world harm due to X. So “we tested our model in the lab and found deception”... (read more)

6Rohin Shah20dI don't automatically exclude lab settings, but other than that, this seems roughly consistent with my usage of the term. (And in particular includes the "weak" warning shots discussed above.)
Open Problems with Myopia

Yes; episode is correct there—the whole point of that example is that, by breaking the episodic independence assumption, otherwise hidden non-myopia can be revealed. See the discussion of the prisoner's dilemma unit test in Krueger et al.'s “Hidden Incentives for Auto-Induced Distributional Shift” for more detail on how breaking this sort of episodic independence plays out in practice.

Open Problems with Myopia

Yeah, I agree—the example should probably just be changed to be about an imitative amplification agent or something instead.

Open Problems with Myopia

I think that trying to encourage myopia via behavioral incentives is likely to be extremely difficult, if not impossible (at least without a better understanding of our training processes' inductive biases). Krueger et al.'s “Hidden Incentives for Auto-Induced Distributional Shift” is a good resource for some of the problems that you run into when you try to do that. As a result, I think that mechanistic incentives are likely to be necessary—and I personally favor some form of relaxed adversarial training—but that's going to require us to get a better unde... (read more)

MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"

Possibly relevant here is my transparency trichotomy between inspection transparency, training transparency, and architectural transparency. My guess is that inspection transparency and training transparency would mostly go in your “active transparency” bucket and architectural transparency would mostly go in your “passive transparency” bucket. I think there is a position here that makes sense to me, which is perhaps what you're advocating, that architectural transparency isn't relying on any sort of path-continuity arguments in terms of how your training ... (read more)

MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"

Fwiw, I also agree with Adele and Eliezer here and just didn't see Eliezer's comment when I was giving my comments.

Formal Solution to the Inner Alignment Problem

Sure—by that definition of realizability, I agree that's where the difficulty is. Though I would seriously question the practical applicability of such an assumption.

Formal Solution to the Inner Alignment Problem

Perhaps I just totally don't understand what you mean by realizability, but I fail to see how realizability is relevant here. As I understand it, realizability just says that the true model has some non-zero prior probability—but that doesn't matter (at least for the MAP, which I think is a better model than the full posterior for how SGD actually works) as long as there's some deceptive model with greater prior probability that's indistinguishable on the training distribution, as in my simple toy model from earlier.

3Vanessa Kosoy2moWhen talking about uniform (worst-case) bounds, realizability just means the true environment is in the hypothesis class, but in a Bayesian setting (like in the OP) it means that our bounds scale with the probability of the true environment in the prior. Essentially, it means we can pretend the true environment was sampled from the prior. So, if (by design) training works by sampling environments from the prior, and (by realizability) deployment also consists of sampling an environment from the same prior, training and deployment are indistinguishable.
Formal Solution to the Inner Alignment Problem

Yeah, that's a fair objection—my response to that is just that I think that preventing a model from being able to distinguish training and deployment is likely to be impossible for anything competitive.

1Vanessa Kosoy2moOkay, but why? I think that the reason you have this intuition is, the realizability assumption is false. But then you should concede that the main weakness of the OP is the realizability assumption rather than the difference between deep learning and Bayesianism.
Formal Solution to the Inner Alignment Problem

Here's a simple toy model. Suppose you have two agents that internally compute their actions as follows (perhaps with actual argmax replaced with some smarter search algorithm, but still basically structured as below):

Then, comparing the K-complexity of the two models, we get

and the problem becomes that both and will produce behavior that looks aligned on the training distribution, bu... (read more)

4Vanessa Kosoy2moI understand how this model explains why agents become unaligned under distributional shift. That's something I never disputed. However, I don't understand how this model applies to my proposal. In my proposal, there is no distributional shift, because (thanks to the realizability assumption) the real environment is sampled from the same prior that is used for training with C. The model can't choose to act deceptively during training, because it can't distinguish between training and deployment. Moreover, the objective I described is not complicated.
Formal Solution to the Inner Alignment Problem

It's not that simulating is difficult, but that encoding for some complex goal is difficult, whereas encoding for a random, simple goal is easy.

3Vanessa Kosoy2moThe reward function I was sketching earlier is not complex. Moreover, if you can encode M then you can encode B, no reason why the latter should have much greater complexity. If you can't encode M but can only encode something that produces M, then by the same token you can encode something that produces B (I don't think that's even a meaningful distinction tbh). I think it would help if you could construct a simple mathematical toy model of your reasoning here?
Formal Solution to the Inner Alignment Problem

I think that “think up good strategies for achieving [reward of the type I defined earlier]” is likely to be much, much more complex (making it much more difficult to achieve with a local search process) than an arbitrary goal X for most sorts of rewards that we would actually be happy with AIs achieving.

1Vanessa Kosoy2moWhy? It seems like M would have all the knowledge required for achieving good rewards of the good type, so simulating M should not be more difficult than achieving good rewards of the good type.
Formal Solution to the Inner Alignment Problem

A's structure can just be “think up good strategies for achieving X, then do those,” with no explicit subroutine that you can find anywhere in A's weights that you can copy over to B.

3Vanessa Kosoy2moIIUC, you're saying something like: suppose trained-A computes the source code of the complex core of B and then runs it. But then, define B′ as: compute the source code of the complex core of B (in the same way A does it) and use it to implement B. B′ is equivalent to B and has about the same complexity as trained- A. Or, from a slightly different angle: if "think up good strategies for achieving X" is powerful enough to come up with M, then "think up good strategies for achieving [reward of the type I defined earlier]" is powerful enough to come up with B.
Formal Solution to the Inner Alignment Problem

A is sufficiently powerful to select M which contains the complex part of B. It seems rather implausible that an algorithm of the same power cannot select B.

A's weights do not contain the complex part of B—deception is an inference-time phenomenon. It's very possible for complex instrumental goals to be derived from a simple structure such that a search process is capable of finding that simple structure that yields those complex instrumental goals without being able to find a model with those complex instrumental goals hard-coded as terminal goals.

1Vanessa Kosoy2moHmm, sorry, I'm not following. What exactly do you mean by "inference-time" and "derived"? By assumption, when you run A on some sequence it effectively simulates M which runs the complex core of B. So, A trained on that sequence effectively contains the complex core of B as a subroutine.
Formal Solution to the Inner Alignment Problem

B can be whatever algorithm M uses to form its beliefs + confidence threshold (it's strategy stealing.)

Sure, but then I think B is likely to be significantly more complex and harder for a local search process to find than A.

Why? If you give evolution enough time, and the fitness criterion is good (as you apparently agreed earlier), then eventually it will find B.

I definitely don't think this, unless you have a very strong (and likely unachievable imo) definition of “good.”

Second, why? Suppose that your starting algorithm is good at finding results

... (read more)
1Vanessa Kosoy2moA is sufficiently powerful to select M which contains the complex part of B. It seems rather implausible that an algorithm of the same power cannot select B. CNNs are specific in some way, but in a relatively weak way (exploiting hierarchical geometric structure). Transformers are known to be Turing-complete [https://arxiv.org/abs/2006.09286], so I'm not sure they are specific at all (ofc the fact you can express any program doesn't mean you can effectively learn any program, and moreover the latter is false on computational complexity grounds, but it still seems to point to some rather simple and large class of learnable hypotheses). Moreover, even if our world has some specific property that is important for learning, this only means we may need to enforce this property in our prior. For example, if your prior is an ANN with random weights then it's plausible that it reflects the exact inductive biases of the given architecture.
Formal Solution to the Inner Alignment Problem

This seems very sketchy to me. If we let A = SGD or A = evolution, your first claim becomes “if SGD/evolution finds a malign model, it must understand that it's malign on some level,” which seems just straightforwardly incorrect. The last claim also seems pretty likely to be wrong if you let C = SGD or C = evolution.

Moreover, it definitely seems like training on data sampled from a simplicity prior (if that's even possible—it should be uncomputable in general) is unlikely to help at all. I think there's essentially no realistic way that training on synthet... (read more)

2Vanessa Kosoy2moTo me it seems straightforwardly correct! Suppose you're running evolution in order to predict a sequence. You end up evolving a mind M that is a superintelligent malign consequentialist: it makes good predictions on purpose, in order to survive, and then produces a malign false prediction at a critical moment. So, M is part of the state of your algorithm A. All of M's knowledge is also part of the state. M knows that M is malign, M knows the prediction it's making this time is false. In this case, B can be whatever algorithm M uses to form its beliefs + confidence threshold (it's strategy stealing.) Why? If you give evolution enough time, and the fitness criterion is good (as you apparently agreed earlier), then eventually it will find B. First, obviously we use a bounded simplicity prior, like I said in the beginning of this thread. It can be something like weighting programs by 2^{-length} while constraining their amount of computational resources, or something like an ANN with random weights (the latter is more speculative, but given that we know ANNs have inductive bias to simplicity, an untrained ANN probably reflects that.) Second, why? Suppose that your starting algorithm is good at finding results when a lot of data is available, but is also very data inefficient (like deep learning seems to be). Then, by providing it with a lot of synthetic data, you leverage its strength to find a new algorithm which is data efficient. Unless you believe deep learning is already maximally data efficient (which seems very dubious to me)?
Formal Solution to the Inner Alignment Problem

Because that's never what machine learning actually looks like in practice—essentially any access to text from the internet (let alone actual ability to interface with the world, both of which seem necessary for competitive systems) will let it determine things like the year, whether RSA-2048 has been factored, or other various pieces of information that are relevant to what stage in training/testing/deployment it's in, how powerful the feedback mechanisms keeping it in check are, whether other models are defecting, etc. that can let it figure out when to defect.

1Vanessa Kosoy2moHmm, maybe you misunderstood my proposal? I suggested to train the model by meta-learning on purely synthetic data, sampled from some kind of simplicity prior, without any interface to the world. Maybe you just think this wouldn't be competitive? If so, why? Is the argument just that there are no existing systems like this? But then it's weak evidence at best. On the contrary, even from a purely capability standpoint, meta-learning with synthetic data might be a promising strategy to lower deep learning's sample complexity. Here's why I think it will be competitive: * When a hypothetical competitive prediction algorithm A produces a malign prediction, the knowledge that it is malign clearly exists somewhere inside A in some form: otherwise it wouldn't come up with the prediction. In particular, the knowledge that there are multiple plausible hypotheses consistent with the data also exists somewhere inside A. * Therefore, there must exist some competitive algorithm B that would be able to use the knowledge of this ambiguity to abstain from predicting in such cases. * There is no reason why B should be tailored to fine details of our physical world: it can be quite generic (as all deep learning algorithms). * If we have an ML algorithm C that is powerful enough to produce superintelligence, then it is likely powerful enough to come up with B. Since B is generic, the task of finding B doesn't require any real-world data, and can be accomplished by meta-learning on synthetic data like I suggested.
Formal Solution to the Inner Alignment Problem

I agree that random defection can potentially be worked around—but the RSA-2048 problem is about conditional defection, which can't be dealt with in the same way. More generally, I expect it to be extremely difficult if not impossible to prevent a model that you want to be able to operate in the real world from being able to determine at what point in training/deployment it's in.

1Vanessa Kosoy2moWhy do you expect it? During training, it finds itself in random universe. During deployment, it finds itself in another universe drawn from the same prior (the realizability assumption). How would it determine the difference?
Formal Solution to the Inner Alignment Problem

Right, but but you can look at the performance of your model in training, compare it to the theoretical optimum (and to the baseline of making no predictions at all) and get lots of evidence about safety from that. You can even add some adversarial training of the synthetic environment in order to get tighter bounds.

Note that adversarial training doesn't work on deceptive models due to the RSA-2048 problem; also see more detail here.

If on the vast majority of synthetic environments your model makes virtually no mispredictions, then, under the realizab

... (read more)
3Vanessa Kosoy2moI think that adversarial training working so well that it can find exponentially rare failures is an unnecessarily strong desideratum. We need to drive to probability of catastrophic failures to something very low, but not actually zero. If a system is going to work for time t during deployment, then running it in n different synthetic environments for time t during training is enough to drive the probability of failure down to O(1n). Now, this is prohibitively costly, but it's not exponentially prohibitively costly. And this doesn't use adversarial training at all! So, while it's not obvious how to get a good solution, it seems at least plausible that this can lead to a solution.
Formal Solution to the Inner Alignment Problem

that's a problem with deep learning in general and not just inner alignment

I think you are understanding inner alignment very differently than we define it in Risks from Learned Optimization, where we introduced the term.

The point is, if your model is not optimizing that reward function then its performance during training will be suboptimal.

This is not true for deceptively aligned models, which is the situation I'm most concerned about, and—as we argue extensively in Risks from Learned Optimization—there are a lot of reasons why a model might end u... (read more)

1michaelcohen2moIt may be helpful to point to specific sections of such a long paper. (Also, I agree that a neural network trained trained with that reward could produce a deceptive model that makes a well-timed error.)
Formal Solution to the Inner Alignment Problem

Sure, but you have no guarantee that the model you learn is actually going to be optimizing anything like that reward function—that's the whole point of the inner alignment problem. What's nice about the approach in the original paper is that it keeps a bunch of different models around, keeps track of their posterior, and only acts on consensus, ensuring that the true model always has to approve. But if you just train a single model on some reward function like that with deep learning, you get no such guarantees.

2Vanessa Kosoy2moRight, but but you can look at the performance of your model in training, compare it to the theoretical optimum (and to the baseline of making no predictions at all) and get lots of evidence about safety from that. You can even add some adversarial training of the synthetic environment in order to get tighter bounds. If on the vast majority of synthetic environments your model makes virtually no mispredictions, then, under the realizability assumption, it is very unlikely to make mispredictions in deployment. Ofc the realizability assumption should also be questioned: but that's true in the OP as well, so it's not a difference between Bayesianism and deep.
0Vanessa Kosoy2moYou have no guarantees, sure, but that's a problem with deep learning in general and not just inner alignment. The point is, if your model is not optimizing that reward function then its performance during training will be suboptimal. To the extent your algorithm is able to approximate the true optimum during training, it will behave safely during deployment.
Formal Solution to the Inner Alignment Problem

I had thought that maybe since a Q-learner is trained as if the cached point estimate of the Q-value of the next state is the Truth, it won't, in a single forward pass, consider different models about what the actual Q-value of the next state is. At most, it will consider different models about what the very next transition will be.

a) Does that seem right? and b) Aren't there some policy gradient methods that don't face this problem?

This seems wrong to me—even though the Q learner is trained using its own point estimate of the next state, it isn't, at i... (read more)

1michaelcohen2moYeah I was agreeing with that. Right, but one thing the Q-network, in its forward pass, is trying to reproduce is the point of estimate of the Q-value of the next state (since it doesn't have access to it). What it isn't trying to reproduce, because it isn't trained that way, is multiple models of what the Q-value might be at a given possible next state.
Formal Solution to the Inner Alignment Problem

Hmmm... I don't think I was ever even meaning to talk specifically about RL, but regardless I don't expect nearly as large of a difference between Q-learning and policy gradient algorithms. If we imagine both types of algorithms making use of the same size massive neural network, the only real difference is how the output of that neural network is interpreted, either directly as a policy, or as Q values that are turned into a policy via something like softmax. In both cases, the neural network is capable of implementing any arbitrary policy and should be g... (read more)

3michaelcohen2moI interpreted this bit as talking about RL But taking us back out of RL, in a wide neural network with selective attention that enables many qualitatively different forward passes, gradient descent seems to be training the way different models get proposed (i.e. the way attention is allocated), since this happens in a single forward pass, and what we're left with is a modeling routine that is heuristically considering (and later comparing) very different models. And this should include any model that a human would consider. I think that is main thread of our argument, but now I'm curious if I was totally off the mark about Q-learning and policy gradient. I had thought that maybe since a Q-learner is trained as if the cached point estimate of the Q-value of the next state is the Truth, it won't, in a single forward pass, consider different models about what the actual Q-value of the next state is. At most, it will consider different models about what the very next transition will be. a) Does that seem right? and b) Aren't there some policy gradient methods that don't face this problem?
Formal Solution to the Inner Alignment Problem

I agree that this is progress (now that I understand it better), though:

if SGD is MAP then it seems plausible that e.g. SGD + random initial conditions or simulated annealing would give you something like top N posterior models

I think there is strong evidence that the behavior of models trained via the same basic training process are likely to be highly correlated. This sort of correlation is related to low variance in the bias-variance tradeoff sense, and there is evidence that not only do massive neural networks tend to have pretty low variance, but that this variance is likely to continue to decrease as networks become larger.

2Vanessa Kosoy2moHmm, added to reading list, thank you.
Formal Solution to the Inner Alignment Problem

I agree that at some level SGD has to be doing something approximately Bayesian. But that doesn't necessarily imply that you'll be able to get any nice, Bayesian-like properties from it such as error bounds. For example, if you think of SGD as effectively just taking the MAP model starting from sort of simplicity prior, it seems very difficult to turn that into something like the top posterior models, as would be required for an algorithm like this.

2Vanessa Kosoy2moHere's another way how you can try implementing this approach with deep learning. Train the predictor using meta-learning on synthetically generated environments (sampled from some reasonable prior such as bounded Solomonoff or maybe ANNs with random weights). The reward for making a prediction is 1−(1−δ) maxipi+ϵqi+ϵ, where pi is the predicted probability of outcome i, qi is the true probability of outcome i and ϵ,δ∈(0,1) are parameters. The reward for making no prediction (i.e. querying the user) is 0. This particular proposal is probably not quite right, but something in that general direction might work.
4Vanessa Kosoy2moI mean, there's obviously a lot more work to do, but this is progress. Specifically if SGD is MAP then it seems plausible that e.g. SGD + random initial conditions or simulated annealing would give you something like top N posterior models. You can also extract confidence from NNGP.
Formal Solution to the Inner Alignment Problem

It's hard for me to imagine that an agent that finds an "easiest-to-find model" and then calls it a day could ever do human-level science.

I certainly don't think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to find an agent which does do science.

if local search is this bad, I don't think it is a viable path to AGI

We know that local search processes can produce AGI, so viability is a question of efficiency—and we know that SGD is at least efficient enough to solve a wide variety of prob... (read more)

3michaelcohen2moOkay I think we've switched from talking about Q-learning to talking about policy gradient. (Or we were talking about the latter the whole time, and I didn't notice it). The question that I think is relevant is: how are possible world-models being hypothesized and analyzed? That's something I expect to be done with messy heuristics that sometimes have discontinuities their sequence of outputs. Which means I think that no reasonable DQN is will be generally intelligent (except maybe an enormously wide one attention-based one, such that finding models is more about selective attention at any given step than it is about gradient descent over the whole history). A policy gradient network, on the other hand, could maybe (after having its parameters updated through gradient descent) become a network that, in a single forward pass, considers diverse world-models (generated with a messy non-local heuristic), and analyzes their plausibility, and then acts. At the end of the day, what we have is an agent modeling world, and we can expect it to consider any model that a human could come up with. (This paragraph also applies to the DQN with a gradient-descent-trained method for selectively attending to different parts of a wide network, since that could amount to effectively considering different models).
Formal Solution to the Inner Alignment Problem

Yeah; I think I would say I disagree with that. Notably, evolution is not a generally intelligent predictor, but is still capable of producing generally intelligent predictors. I expect the same to be true of processes like SGD.

3michaelcohen2moIf we ever produce generally intelligent predictors (or "accurate world-models" in the terminology we've been using so far), we will need a process that is much more efficient than evolution. But also, I certainly don't think that in order to be generally intelligent you need to start with a generally intelligent subroutine. Then you could never get off the ground. I expect good hypothesis-generation / model-proposal to use a mess of learned heuristics which would not be easily directed to solve arbitrary tasks, and I expect the heuristic "look for models near the best-so-far model" to be useful, but I don't think making it ironclad would be useful. Another thought on our exchange: If what you say is correct, then it sounds like exclusively-local search precludes human-level intelligence! (Which I don't believe, by the way, even if I think it's a less efficient path). One human competency is generating lots of hypotheses, and then having many models of the world, and then designing experiments to probe those hypotheses. It's hard for me to imagine that an agent that finds an "easiest-to-find model" and then calls it a day could ever do human-level science. Even something as simple as understanding an interlocuter requires generating diverse models on the fly: "Do they mean X or Y with those words? Let me ask a clarfying question." I'm not this bearish on local search. But if local search is this bad, I don't think it is a viable path to AGI, and if it's not, then the internals don't for the purposes of our discussion, and we can skip to what I take to be the upshot:
Formal Solution to the Inner Alignment Problem

Here's the setup I'm imagining, but perhaps I'm still misunderstanding something. Suppose you have a bunch of deceptive models that choose to cooperate and have larger weight in the prior than the true model (I think you believe this is very unlikely, though I'm more ambivalent). Specifically, they cooperate in that they perfectly mimic the true model up until the point where the deceptive models make up enough of the posterior that the true model is no longer being consulted, at which point they all simultaneously defect. This allows for arbitrarily bad worst-case behavior.

3michaelcohen2moThis thread began by considering deceptive models cooperating with each other in the sense of separating the timing of their treacherous turns in order to be maximally annoying to us. So maybe our discussion on that topic is resolved, and we can move on to this scenario. if alpha is low enough, this won't ever happen, and if alpha is too high, it won't take very long. So I don't think this scenario is quite right. Then the question becomes, for an alpha that's low enough, how long will it take until queries are infrequent, noting that you need a query any time any treacherous model with enough weight decides to take a treacherous turn?
Formal Solution to the Inner Alignment Problem

But for the purpose of analyzing it's output, I don't think this discussion is critical if we agree that we can expect a good heuristic search through models will identify any model that a human could hypothesize.

I think I would expect essentially all models that a human could hypothesize to be in the search space—but if you're doing a local search, then you only ever really see the easiest to find model with good behavior, not all models with good behavior, which means you're relying a lot more on your prior/inductive biases/whatever is determining how... (read more)

3michaelcohen2moSo would you say you disagree with the claim ?
Formal Solution to the Inner Alignment Problem

Well, just like we can write down the defectors, we can also write down the cooperators—these sorts of models do exist, regardless of whether they're implementing a decision theory we would call reasonable. And in this situation, the cooperators should eventually outcompete the defectors, until the cooperators all defect—which, if the true model has a low enough prior, they could do only once they've pushed the true model out of the top .

3michaelcohen2moIf it's only the case that we can write them down, but they're not likely to arise naturally as simple consequentialists taking over simple physics [https://www.lesswrong.com/posts/5bd75cc58225bf06703752a3/the-universal-prior-is-malign] , then that extra description length will be seriously costly to them, and we won't need to worry about any role they might play in p(treacherous)/p(truth). Meanwhile, when I was saying we could write down some defectors, I wasn't making a simultaneous claim about their relative prior weight, only that their existence would spoil cooperation. For cooperators to outcompete defectors, they would have to be getting a larger share of the gains from cooperation than defectors do. If some people are waiting for the fruit on public trees to ripen before eating, and some people aren't, the defectors will be the ones eating the almost ripe fruit. I might be misunderstanding this statement. The inverse of the posterior on the truth is a supermartingale (doesn't go up in expectation), so I don't know what it could mean for the true model to get pushed out.
Formal Solution to the Inner Alignment Problem

I don't think incompetence is the only reason to try to pull off a treacherous turn at the same time that other models do. Some timesteps are just more important, so there's a trade off. And what's traded off is a public good: among treacherous models, it is a public good for the models' moments of treachery to be spread out. Spreading them out exhausts whatever data source is being called upon in moments of uncertainty. But for an individual model, it never reaps the benefit of helping to exhaust our resources. Given that treacherous models don't have an

... (read more)
3michaelcohen2moThey're one shot. They're not interacting causally. When any pair of models cooperates, a third model defecting benefits just as much from their cooperation with each other. And there will be defectors--we can write them down. So cooperating models would have to cooperate with defectors and cooperators indiscriminately, which I think is just a bad move according to any decision theory. All the LDT stuff I've seen on the topic is how to make one's own cooperation/defection depend logically on the cooperation/defection of others.
Formal Solution to the Inner Alignment Problem

I don't understand the alternative, but maybe that's neither here nor there.

Perhaps I'm just being pedantic, but when you're building mathematical models of things I think it's really important to call attention to what things in the real world those mathematical models are supposed to represent—since that's how you know whether your assumptions are reasonable or not. In this case, there are two ways I could interpret this analysis: either the Bayesian learner is supposed to represent what we want our trained models to be doing, and then we can ask how ... (read more)

4michaelcohen2moI'm not exactly sure what I mean either, but I wasn't imagining as much structure as exists in your post. I mean there's some process which constructs hypotheses, and choices are being made about how computation is being directed within that process. I think any any heuristic search algorithm worth its salt will incorporate information about proximity of models. And I think that arbitrary limits on heuristic search of the form "the next model I consider must be fairly close to the last one I did" will not help it very much if it's anywhere near smart enough to merit membership in a generally intelligent predictor. But for the purpose of analyzing it's output, I don't think this discussion is critical if we agree that we can expect a good heuristic search through models will identify any model that a human could hypothesize.
What happens to variance as neural network training is scaled? What does it imply about "lottery tickets"?

This paper seems to be arguing that variance initially increases as network width goes up, then starts decreasing for very large networks, suggesting that overall variance is likely to decrease as we approach more advanced AI systems and networks get very large.

'Variance' is used in an amusing number of ways in these discussions.You use 'variance' in one sense (the bias-variance tradeoff), but "Explaining Neural Scaling Laws", Bahri et al 2021 talks about a difference kind of variance limit in scaling, while "Learning Curve Theory", Hutter 2001's toy model provides statements on yet others kinds of variances about scaling curves themselves (and I think you could easily dig up a paper from the neural tangent kernel people about scaling approximating infinite width models which only need to make infinitesimally sma... (read more)

Formal Solution to the Inner Alignment Problem

I think I mostly agree with what you're saying here, though I have a couple of comments—and now that I've read and understand your paper more thoroughly, I'm definitely a lot more excited about this research.

then it would appear to be a regime where more intelligence makes the problem go away.

I don't think that's right—if you're modeling the training process as Bayesian, as I now understand, then the issue is that what makes the problem go away isn't more intelligent models, but less efficient training processes. Even if we have arbitrary compute, I th... (read more)

4Vanessa Kosoy2moI feel this is a wrong way to look at it. I expect any effective learning algorithm to be an approximation of Bayesianism in the sense that, it satisfies some good sample complexity bound w.r.t. some sufficiently rich prior. Ofc it's non-trivial to (i) prove such a bound for a given algorithm (ii) modify said algorithm using confidence thresholds in a way that leads to a safety guarantee. However, there is no sharp dichotomy between "Bayesianism" and "SGD" such that this approach obviously doesn't apply to the latter, or to something competitive with the latter.
1michaelcohen2moI don't understand the alternative, but maybe that's neither here nor there. It's a little hard to make a definitive statement about a hypothetical in which the inner alignment problem doesn't apply to Bayesian inference. However, since error bounds are apparently a key piece of a solution, it certainly seems that if Bayesian inference was immune to mesa-optimizers it would be because of competence not resource-prodigality. Here's another tack. Hypothesis generation seems like a crucial part of intelligent prediction. Pure Bayesian reasoning does hypothesis generation by brute force. Suppose it was inefficiency, and not intelligence, that made Bayesian reasoning avoid mesa-optimizers. Then suppose we had a Bayesian reasoning that was equally intelligent but more efficient, by only generating relevant hypotheses. It gains efficiency by not even bothering to consider a bunch of obviously wrong models, but it's posterior is roughly the same, so it should avoid inner alignment failure equally well. If, on the other hand, the hyopthesis generation routine was bad enough that some plausible hypotheses went unexamined, this could introduce an inner alignment failure, with a notable decrease in intelligence. I expect some heuristic search with no discernable structure, guided "attentively" by an agent. And I expect this heuristic search through models to identify any model that a human could hypothesize, and many more.
Clarifying inner alignment terminology

Good point—and I think that the reference to intent alignment is an important part of outer alignment, so I don't want to change that definition. I further tweaked the intent alignment definition a bit to just reference optimal policies rather than outer alignment.

Clarifying inner alignment terminology

Maybe the best thing to use here is just the same definition as I gave for outer alignment—I'll change it to reference that instead.

2Joe Carlsmith3moAren't they now defined in terms of each other? "Intent alignment: An agent is intent aligned [https://ai-alignment.com/clarifying-ai-alignment-cec47cd69dd6] if its behavioral objective [https://intelligence.org/learned-optimization/#glossary] is outer aligned. Outer alignment: An objective functionris outer aligned [https://www.lesswrong.com/posts/33EKjmAdKFn3pbKPJ/outer-alignment-and-imitative-amplification] if all models that perform optimally onrin the limit of perfect training and infinite data are intent aligned."
Formal Solution to the Inner Alignment Problem

I mean, I don't think I'm “redefining” inner alignment, given that I don't think I've ever really changed my definition and I was the one that originally came up with the term (inner alignment was due to me, mesa-optimization was due to Chris van Merwijk). I also certainly agree that there are “more than just inner alignment problems going on in the lack of worst-case guarantees for deep learning/evolutionary search/etc.”—I think that's exactly the point that I'm making, which is that while there are other issues, inner alignment is what I'm most concerned about. That being said, I also think I was just misunderstanding the setup in the paper—see Rohin's comment on this chain.

Formal Solution to the Inner Alignment Problem

Hmmm... I think I just misunderstood the setup then. It seemed to me like the Bayesian updating was supposed to represent the model rather than the training process, but under the framing that you just said, I agree that it's relevant. It's worth noting that I do think most of the danger lies in the ways in which gradient descent is not a perfect Bayesian, but I agree that modeling gradient descent as Bayesian is certainly relevant.

Formal Solution to the Inner Alignment Problem

Regardless of how you define inner alignment, I think that the vast majority of the existential risk comes from that “broader issue” that you're pointing to of not being able to get worst-case guarantees due to using deep learning or evolutionary search or whatever. That leads me to want to define inner alignment to be about that problem—and I think that is basically the definition we give in Risks from Learned Optimization, where we introduced the term. That being said, I do completely agree that getting a better understanding of deep learning is likely to be critical.

5michaelcohen3moIf the inner alignment problem did not exist for perfect Bayesians, but did exist for neural networks, then it would appear to be a regime where more intelligence makes the problem go away. If the inner alignment problem were ~solved for perfect Bayesians, but unsolved for neural networks, I think there's still some of the flavor of that regime, but we do have to be pretty careful to make sure we're applying the same sort of solution to the non-Bayesian algorithms. I think in Vanessa's comment above, she's suggesting this looks doable. Note the method here of avoiding mesa-optimizers: error bounds. Neural networks don't have those. Naturally, one way to make mesa-optimizer-deceptively-selected-errors go away is just to have better learning algorithms that make errors go away. Algorithms like Gated Linear Networks [https://arxiv.org/abs/2006.05964] with proper error bounds may be a safer building block for AGI. But none of this takes away from the fact that it is potentially important to figure out how to avoid mesa-optimization in neural networks, and I would add to your claim that this is a much harder setting; I would say it's a harder setting because of the non-existence of error bounds.

I think that the vast majority of the existential risk comes from that “broader issue” that you're pointing to of not being able to get worst-case guarantees due to using deep learning or evolutionary search or whatever. That leads me to want to define inner alignment to be about that problem...

[Emphasis added.] I think this is a common and serious mistake-pattern, and in particular is one of the more common underlying causes of framing errors. The pattern is roughly:

  • Notice cluster of problems X which have a similar underlying causal pattern Cause(X)
  • Notice
... (read more)
Formal Solution to the Inner Alignment Problem

Edit: I no longer endorse this comment; see this comment instead.

I think you've just assumed the entire inner alignment problem away. You assume that the model—the imitation learner—is a perfect Bayesian, rather than just some trained neural network or other function approximator (edit: I was just misinterpreting here and the training process is supposed to Bayesian rather than the model, see Rohin's comment below). The entire point of the inner alignment problem, though, is that you can't guarantee that your trained model is actually a perfect Bayesian—or... (read more)

While I share your position that this mostly isn't addressing the things that make inner alignment hard / risky in practice, I agree with Vanessa that this does not assume the inner alignment problem away, unless you have a particularly contorted definition of "inner alignment".

There's an optimization procedure (Bayesian updating) that is selecting models (the model of the demonstrator) that can themselves be optimizers, and you could get the wrong one (e.g. the model that simulates an alien civilization that realizes it's in a simulation and predicts well... (read more)

I think this is completely unfair. The inner alignment problem exists even for perfect Bayesians, and solving it in that setting contributes much to our understanding. The fact we don't have satisfactory mathematical models of deep learning performance is a different problem, which is broader than inner alignment and to first approximation orthogonal to it. Ideally, we will solve this second problem by improving our mathematical understanding of deep learning and/or other competitive ML algorithms. The latter effort is already underway by researchers unrel... (read more)

Distinguishing claims about training vs deployment

It’s a little unclear what "orthogonal" means for processes; here I give a more precise statement. Given a process for developing an intelligent, goal-directed system, my version of the process orthogonality thesis states that:

  • The overall process involves two (possibly simultaneous) subprocesses: one which builds intelligence into the system, and one which builds goals into the system.
  • The former subprocess could vary greatly how intelligent it makes the system, and the latter subprocess could vary greatly which goals it specifies, without significantly
... (read more)
2Richard Ngo3moRe counterfactual impact: the biggest shift came from talking to Nate at BAGI, after which I wrote this post on disentangling arguments about AI risk [https://www.alignmentforum.org/posts/JbcWQCxKWn3y49bNB/disentangling-arguments-for-the-importance-of-ai-safety] , in which I identified the "target loading problem". This seems roughly equivalent to inner alignment, but was meant to avoid the difficulties of defining an "inner optimiser". At some subsequent point I changed my mind and decided it was better to focus on inner optimisers - I think this was probably catalysed by your paper, or by conversations with Vlad which were downstream of the paper. I think the paper definitely gave me some better terminology for me to mentally latch onto, which helped steer my thoughts in more productive directions. Re 2d robustness: this is a good point. So maybe we could say that the process orthogonality thesis is somewhat true, in a "spherical cow" sense. There are some interventions that only affect capabilities, or only alignment. And it's sometimes useful to think of alignment as being all about the reward function, and capabilities as involving everything else. But as with all spherical cow models, this breaks down when you look at it closely - e.g. when you're thinking about the "curriculum" which an agent needs to undergo to become generally intelligent. Does this seem reasonable? Also, I think that many other people believe in the process orthogonality thesis to a greater extent than I do. So even if we don't agree about how much it breaks down, if this is a convenient axis which points in roughly the direction on which we disagree, then I'd still be happy about that.
Thoughts on Iason Gabriel’s Artificial Intelligence, Values, and Alignment

Np! Also, just going through the rest of the proposals in my 11 proposals paper, I'm realizing that a lot of the other proposals also try to avoid a full agency hand-off. STEM AI restricts the AI's agency to just STEM problems, narrow reward modeling restricts individual AIs to only apply their agency to narrow domains, and the amplification and debate proposals are trying to build corrigible question-answering systems rather than do a full agency hand-off.

Thoughts on Iason Gabriel’s Artificial Intelligence, Values, and Alignment

I would very much like to see proposals for AI alignment that escape completely from the assumption that we are going to hand off agency to AI.

Microscope AI (see here and here) is an AI alignment proposal that attempts to entirely avoid agency hand-off.

I also agree with Rohin's comment that Paul-style corrigibility is at least trying to avoid a full agency hand-off, though it still has significantly more of an agency hand-off than something like microscope AI.

3Alex Flint4moThanks for this
The Case for a Journal of AI Alignment

I think this is a great idea and would be happy to help in any way with this.

1Adam Shimi4moThanks. I'm curious about what you think of Ryan's position [https://www.lesswrong.com/posts/hNNM6gP5yZcHffmpD/the-case-for-a-journal-of-ai-alignment?commentId=AAaNpyAkuY8XdZPDo] or Rohin's position [href]?
Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

My understanding is that floating-point granularity is enough of a real problem that it does sometimes matter in realistic ML settings, which suggests that it's probably a reasonable level of abstraction on which to analyze neural networks (whereas any additional insights from an electromagnetism-based analysis probably never matter, suggesting that's not a reasonable/useful level of abstraction).

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

The Levin bound doesn't apply directly to neural networks, because it assumes that P is finite and discrete, but it gives some extra backing to the intuition above.

In what sense is the parameter space of a neural network not finite and discrete? It is often useful to understand floating-point values as continuous, but in fact they are discrete such that it seems like a theorem which assumes discreteness would still apply.

3Joar Skalse4moYes, it does of course apply in that sense. I guess the question then basically is which level of abstraction we think would be the most informative or useful for understanding what's going on here. I mean, we could for example also choose to take into account the fact that any actual computer program runs on a physical computer, which is governed by the laws of electromagnetism (in which case the parameter-space might count as continuous again). I'm not sure if accounting for the floating-point implementation is informative or not in this case.
Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

I'm not sure if I count as a skeptic, but at least for me the only part of this that I find confusing is SGD not making a difference over random search. The fact that simple functions take up a larger volume in parameter space seems obviously true to me and I can't really imagine anyone disagreeing with that part (though I'm still quite glad to have actual analysis to back that up).

4Daniel Kokotajlo1dPinging you to see what your current thoughts are! I think that if "SGD is basically equivalent to random search" then that has huge, huge implications.
Load More