I don't follow. Can't races to the bottom destroy all value for the agents involved?
You are saying that a special moment is a particularly great one to be treacherous. But if P(discovery) is 99.99% during that period, and there is any other treachery-possible period where P(discovery) is small, then that other period would have been better after all. Right?
This doesn't seem analogous to producers driving down profits to zero, because those firms had no other opportunity to make a profit with their machine. It's like you saying: there are tons of countries where firms could use their machines to make stuff and sell it at a profit (more countries than firms). But some of the places are more attractive than others, so probably everyone will sell in those places and drive profits to zero. And I'm saying: but then aren't those incredibly-congested countries actually worse places to sell? This scenario is only possible if firms are making so much stuff that they can drive profit down to zero in every country, since any country with remaining profit would necessarily be the best place to sell.
Yeah, okay. This is my mainline opinion, and I just think about inner alignment in case it's wrong
It seems to me like this is clearly wrong in the limit (since simple consequentialists would take over simple physics). It also seems probably wrong to me for smaller models (i.e. without huge amounts of internal selection) but it would be good for someone to think about that case more seriously.
It's occurring to me that this question doesn't matter to our present discussion. What makes the linear regime linear rather logarithmic is that if p(treachery)/p(honest model) is high, that allows for a large number of treacherous models to have greater posterior weight than the truth. But if a single model has n times the posterior weight of the truth, it still only takes one query to the demonstrator to interrupt its treacherous turn, not n queries.
if the treacherous model is 100 bits smaller, then it feels like there must be around 2^100 treacherous models that are all simpler than the intended one. If nothing else, it seems like you could literally add garbage bits to the treacherous models (or useful bits!).
I don't think incompetence is the only reason to try to pull off a treacherous turn at the same time that other models do. Some timesteps are just more important, so there's a trade off. And what's traded off is a public good: among treacherous models, it is a public good for the models' moments of treachery to be spread out.
Trying to defect at time T is only a good idea if it's plausible that your mechanism isn't going to notice the uncertainty at time T and then query the human. So it seems to me like this argument can never drive P(successful treachery) super low, or else it would be self-defeating (for competent treacherous agents).
The treacherous model has to compute the truth, and then also decide when and how to execute treachery. So the subroutine they run to compute the truth, considered as a model in its own right, seems to me like it must be simpler.
Subroutines of functions aren't always simpler. Even without treachery this seems like it's basically guaranteed. If "the truth" is just a simulation of a human, then the truth is a subroutine of the simplest model. But instead you could have a model of the form "Run program X, treat its output as a program, then run that." Since humans are compressible that will clearly be smaller.
Now you want to say that whatever reasoning a treacherous agent does in order to compress the human, you can just promote that to the outside model as well. But that argument seems to be precisely where the meat is (and it's the kind of thing I spend most of time on). If that works then it seems like you don't even need a further solution to inner alignment, just use the simplest model.
(Maybe there is a hope that the intended model won't be "much" more complex than the treacherous models, without literally saying that it's the simplest, but at that point I'm back to wondering whether "much" is like 0.01% or 0.00000001%.)
There are a bunch of things that differ between part I and part II, I believe they are correlated with each other but not at all perfectly. In the post I'm intending to illustrate what I believe some plausible failures look like, in a way intended to capture a bunch of the probability space. I'm illustrating these kinds of bad generalizations and ways in which the resulting failures could be catastrophic. I don't really know what "making the claim" means, but I would say that any ways in which the story isn't realistic are interesting to me (and we've already discussed many, and my views have---unsurprisingly!---changed considerably in the details over the last 2 years), whether they are about the generalizations or the impacts.
I do think that the "going out with a whimper" scenario may ultimately transition into something abrupt, unless people don't have their act together enough to even put up a fight (which I do think is fairly likely conditioned on catastrophe, and may be the most likely failure mode).
It seems like you at least need to explain why in that situation we can't continue to work on the alignment problem and replace the agents with better-aligned AI systems in the future
We can continue to work on the alignment problem and continue to fail to solve it, e.g. because the problem is very challenging or impossible or because we don't end up putting in a giant excellent effort (e.g. if we spent a billion dollars a year on alignment right now it seems plausible it would be a catastrophic mess of people working on irrelevant stuff, generating lots of noise while we continue to make important progress at a very slow rate).
The most important reason this is possible is that change is accelerating radically, e.g. I believe that it's quite plausible we will not have massive investment in these problems until we are 5-10 years away from a singularity and so just don't have much time.
If you are saying "Well why not wait until after the singularity?" then yes, I do think that eventually it doesn't look like this. But that can just look like people failing to get their act together, and then eventually when they try to replace deployed AI systems they fail. Depending on how generalization works that may look like a failure (as in scenario 2) or everything may just look dandy from the human perspective because they are now permanently unable to effectively perceive or act in the real world (especially off of earth). I basically think that all bets are off if humans just try to sit tight while an incomprehensible AI world-outside-the-gates goes through a growth explosion.
I think there's a perspective where the post-singularity failure is still the important thing to talk about, and that's an error I made in writing the post. I skipped it because there is no real action after the singularity---the damage is irreversibly done, all of the high-stakes decisions are behind us---but it still matters for people trying to wrap their heads around what's going on. And moreover, the only reason it looks that way to me is because I'm bringing in a ton of background empirical assumptions (e.g. I believe that massive acceleration in growth is quite likely), and the story will justifiably sound very different to someone who isn't coming in with those assumptions.
I think this is doable with this approach, but I haven't proven it can be done, let alone said anything about a dependence on epsilon. The closest bound I show not only has a constant factor of like 40; it depends on the prior on the truth too. I think (75% confidence) this is a weakness of the proof technique, not a weakness of the algorithm.
I just meant the dependence on epsilon, it seems like there are unavoidable additional factors (especially the linear dependence on p(treachery)). I guess it's not obvious if you can make these additive or if they are inherently multipliactive.
But your bound scales in some way, right? How much training data do I need to get the KL divergence between distributions over trajectories down to epsilon?
I understand that the practical bound is going to be logarithmic "for a while" but it seems like the theorem about runtime doesn't help as much if that's what we are (mostly) relying on, and there's some additional analysis we need to do. That seems worth formalizing if we want to have a theorem, since that's the step that we need to be correct.
There is at most a linear cost to this ratio, which I don't think screws us.
If our models are a trillion bits, then it doesn't seem that surprising to me if it takes 100 bits extra to specify an intended model relative to an effective treacherous model, and if you have a linear dependence that would be unworkable. In some sense it's actively surprising if the very shortest intended vs treacherous models have description lengths within 0.0000001% of each other unless you have a very strong skills vs values separation. Overall I feel much more agnostic about this than you are.
There are two reasons why I expect that to hold.
This doesn't seem like it works once you are selecting on competent treacherous models. Any competent model will err very rarely (with probability less than 1 / (feasible training time), probably much less). I don't think that (say) 99% of smart treacherous models would make this particular kind of incompetent error?
I'm not convinced this logarithmic regime ever ends,
It seems like it must end if there are any treacherous models (i.e. if there was any legitimate concern about inner alignment at all).
I haven't read the paper yet, looking forward to it. Using something along these lines to run a sufficiently-faithful simulation of HCH seems like a plausible path to producing an aligned AI with a halting oracle. (I don't think that even solves the problem given a halting oracle, since HCH is probably not aligned, but I still think this would be noteworthy.)
First I'm curious to understand this main result so I know what to look for and how surprised to be. In particular, I have two questions about the quantitative behavior described here:
if an event would have been unlikely had the demonstrator acted the whole time, that event's likelihood can be bounded above when running the (initially totally ignorant) imitator instead. Meanwhile, queries to the demonstrator rapidly diminish in frequency.
It seems like you have the following tough case even if the human is deterministic:
In this setting, it seems like I have no hope other than to query the human on all N decisions (since all days and hypotheses are symmetrical), so I assume that this is what your algorithm would do.
That strongly suggests that the number of queries to the human goes as 1 / p(correct demonstrator), unless you use some other feature of the hypothesis class. But p(correct demonstrator) is probably less than 2−1014, so this constant might not be acceptable. Usually we try to have a logarithmic dependence on p(correct demonstrator) but this doesn't seem possible here.
So as you say, we'd want to condition on on some relevant facts to get up to the point where that probability might be acceptably-high. So then it seems like we have two problems:
Does that all seem right?
Suppose that I want to bound the probability of catastrophe as (1+ϵ) times the demonstrator probability of catastrophe. It seems like the number of human queries must scale at least like 1/ϵ. Is that right, and if so what's the actual dependence on epsilon?
I mostly ask about this because in the context of HCH we may need to push epsilon down to 1/N. But maybe there's some way to avoid that by considering predictors that update on counterfactual demonstrator behavior in the rest of the tree (even though the true demonstrator does not), to get a full bound on the relative probability of a tree under the true demonstrator vs model. I haven't thought about this in years and am curious if you have a take on the feasibility of that or whether you think the entire project is plausible.
This doesn't seem right. We design type 1 feedback so that resulting agents perform well on our true goals. This only matches up with type 2 feedback insofar as type 2 feedback is closely related to our true goals.
But type 2 feedback is (by definition) our best attempt to estimate how well the model is doing what we really care about. So in practice any results-based selection for "does what we care about" goes via selecting based on type 2 feedback. The difference only comes up when we reason mechanically about the behavior of our agents and how they are likely to generalize, but it's not clear that's an important part of the default plan (whereas I think we will clearly extensively leverage "try several strategies and see what works").
But if that's the case, then it would be strange for agents to learn the motivation of doing well on type 2 feedback without learning the motivation of doing well on our true goals.
"Do things that look to a human like you are achieving X" is closely related to X, but that doesn't mean that learning to do the one implies that you will learn to do the other.
Maybe it’s helpful to imagine the world where type 1 feedback is “human evals after 1 week horizon”, type 2 feedback is “human evals after 1 year horizon,” and “what we really care about” is the "human evals after a 100 year horizon." I think that’s much better than the actual situation, but even in that case I’d have a significant probability on getting systems that work on the 1 year horizon without working indefinitely (especially if we do selection for working on 2 years + are able to use a small amount of 2 year data). Do you feel pretty confident that something that generalizes from 1 week to 1 year will go indefinitely, or is your intuition predicated on something about the nature of “be helpful” and how that’s a natural motivation for a mind? (Or maybe that we will be able to identify some other similar “natural” motivation and design our training process to be aligned with that?) In the former case, it seems like we can have an empirical discussion about how generalization tends to work. In the latter case, it seems like we need to be getting into more details about why “be helpful” is a particularly natural (or else why we should be able to pick out something else like that). In the other cases I think I haven't fully internalized your view.
I think that by default we will search for ways to build systems that do well on type 2 feedback. We do likely have a large dataset of type-2-bad behaviors from the real world, across many applications, and can make related data in simulation. It also seems quite plausible that this is a very tiny delta, if we are dealing with models that have already learned everything they would need to know about the world and this is just a matter of selecting a motivation, so that you can potentially get good type 2 behavior using a very small amount of data. Relatedly, it seems like really all you need is to train predictors for type 2 feedback (in order to use those predictions for training/planning), and that the relevant prediction problems often seem much easier than the actual sophisticated behaviors we are interested in.
Another important of my view about type 1 ~ type 2 is that if gradient descent handles the scale from [1 second, 1 month] then it's not actually very far to get from [1 month, 2 years]. It seems like we've already come 6 orders of magnitude and now we are talking about generalizing 1 more order of magnitude.
At a higher level, I feel like the important thing is that type 1 and type 2 feedback are going to be basically the same kind of thing but with a quantitative difference (or at least we can set up type 1 feedback so that this is true). On the other hand "what we really want" is a completely different thing (that we basically can't even define cleanly). So prima facie it feels to me like if models generalize "well" then we can get them to generalize from type 1 to type 2, whereas no such thing is true for "what we really care about."
I like the following example:
This seems like a nice relatable example to me---it's not uncommon for someone to offer to bet on a rock paper scissors game, or to offer slightly favorable odds, and it's not uncommon for them to have a slight edge.
Are there features of the boxes case that don't apply in this case, or is it basically equivalent?
Even if you were taking D as input and ignoring tractability, IDA still has to decide what to do with D, and that needs to be at least as useful as what ML does with D (and needs to not introduce alignment problems in the learned model). In the post I'm kind of vague about that and just wrapping it up into the philosophical assumption that HCH is good, but really we'd want to do work to figure out what to do with D, even if we were just trying to make HCH aligned (and I think even for HCH competitiveness matters because it's needed for HCH to be stable/aligned against internal optimization pressure).