Obviously I think it's worth being careful, but I think in general it's actually relatively hard to accidentally advance capabilities too much by working specifically on alignment. Some reasons:
I think it's worth disentangling LLMs and Transformers and so on in discussions like this one--they are not one and the same. For instance, the following are distinct positions that have quite different implications:
Doesn't answer your question, but we also came across this effect in the RM Goodharting work, though instead of figuring out the details we only proved that it when it's definitely not heavy tailed it's monotonic, for Regressional Goodhart (https://arxiv.org/pdf/2210.10760.pdf#page=17). Jacob probably has more detailed takes on this than me.
In any event my intuition is this seems unlikely to be the main reason for overoptimization - I think it's much more likely that it's Extremal Goodhart or some other thing where the noise is not independent
Pointing at some of the same things: https://www.lesswrong.com/posts/ktJ9rCsotdqEoBtof/asot-some-thoughts-on-human-abstractions
re:1, yeah that seems plausible, I'm thinking in the limit of really superhuman systems here and specifically pushing back against a claim that this human abstractions being somehow inside a superhuman AI is sufficient for things to go well.
re:2, one thing is that there are ways of drifting that we would endorse using our meta-ethics, and ways that we wouldn't endorse. More broadly, the thing I'm focusing on in this post is not really about drift over time or self improvement; in the setup I'm describing, the thing that goes wrong is it does the classical ...
one man's modus tollens is another man's modus ponens:
"making progress without empirical feedback loops is really hard, so we should get feedback loops where possible" "in some cases (i.e close to x-risk), building feedback loops is not possible, so we need to figure out how to make progress without empirical feedback loops. this is (part of) why alignment is hard"
Yeah something in this space seems like a central crux to me.
I personally think (as a person generally in the MIRI-ish camp of "most attempts at empirical work are flawed/confused"), that it's not crazy to look at the situation and say "okay, but, theoretical progress seems even more flawed/confused, we just need to figure out some how of getting empirical feedback loops."
I think there are some constraints on how the empirical work can possibly work. (I don't think I have a short thing I could write here, I have a vague hope of writing up a longer post on "what I think needs to be true, for empirical work to be helping rather than confusedly not-really-helping")
Therefore, the longer you interact with the LLM, eventually the LLM will have collapsed into a waluigi. All the LLM needs is a single line of dialogue to trigger the collapse.
This seems wrong. I think the mistake you're making is when you argue that because there's some chance X happens at each step and X is an absorbing state, therefore you have to end up at X eventually. However, this is only true if you assume the conclusion and claim that the prior probability of luigis is zero. If there is some prior probability of a luigi, each non-waluigi step incre...
Agreed. To give a concrete toy example: Suppose that Luigi always outputs "A", and Waluigi is {50% A, 50% B}. If the prior is {50% luigi, 50% waluigi}, each "A" outputted is a 2:1 update towards Luigi. The probability of "B" keeps dropping, and the probability of ever seeing a "B" asymptotes to 50% (as it must).
This is the case for perfect predictors, but there could be some argument about particular kinds of imperfect predictors which supports the claim in the post.
However, this trick won't solve the problem. The LLM will print the correct answer if it trusts the flattery about Jane, and it will trust the flattery about Jane if the LLM trusts that the story is "super-duper definitely 100% true and factual". But why would the LLM trust that sentence?
There's a fun connection to ELK here. Suppose you see this and decide: "ok forget trying to describe in language that it's definitely 100% true and factual in natural language. What if we just add a special token that I prepend to indicate '100% true and factual, for...
But like, why?
I think maybe the crux is the part about the strength of the incentives towards doing capabilities. From my perspective it generally seems like this incentive gradient is pretty real: getting funded for capabilities is a lot easier, it's a lot more prestigious and high status in the mainstream, etc. I also myself viscerally feel the pull of wishful thinking (I really want to be wrong about high P(doom)!) and spend a lot of willpower trying to combat it (but also not so much that I fail to update where things genuinely are not as bad as I would expect, but also not allowing that to be an excuse for wishful thinking, etc...).
"This [model] is zero evidence for the claim" is a roughly accurate view of my opinion. I think you're right that epistemically it would have been much better for me to have said something along those lines. Will edit something into my original comment.
Exponentials are memoryless. If you advance an exponential to where it would be one year from now. then some future milestone (like "level of capability required for doom") appears exactly one year earlier. [...]
Errr, I feel like we already agree on this point? Like I'm saying almost exactly the same thing you're saying; sorry if I didn't make it prominent enough:
...
It happens to be false in the specific model of moving an exponential up (if you instantaneously double the progress at some point in time, the deadline moves one doubling-time closer, but the tot
Not OP, just some personal takes:
That's not small!
To me, it seems like the claim that is (implicitly) being made here is that small improvements early on compound to have much bigger impacts later on, and also a larger shortening of the overall timeline to some threshold. (To be clear, I don't think the exponential model presented provides evidence for this latter claim)
I think the first claim is obviously true. The second claim could be true in practice, though I feel quite uncertain about this. It happens to be false in the specific model of moving an ex...
I agree with the general point here but I think there's an important consideration that makes the application to RL algorithms less clear: wireheading is an artifact of embeddedness, and most RL work is in the non-embedded setting. Thus, it seems plausible that the development of better RL algorithms does in fact lead to the development of algorithms that would, if they were deployed in an embedded setting, wirehead.
I think of mesaoptimization as primarily being concerning because it would mean models (selected using amortized optimization) doing their own direct optimization, and the extent to which the model is itself doing its own "direct" optimization vs just being "amortized" is what I would call the optimizer-controller spectrum (see this post also).
Also, it seems kind of inaccurate to declare that (non-RL) ML systems are fundamentally amortized optimization and then to say things like "more computation and better algorithms should improve safety and the pr...
I expect that the key externalities will be borne by society. The main reason for this is I expect deceptive alignment to be a big deal. It will at some point be very easy to make AI appear safe, by making it pretend to be aligned, and very hard to make it actually aligned. Then, I expect something like the following to play out (this is already an optimistic rollout intended to isolate the externality aspect, not a representative one):
We start observing alignment failures in models. Maybe a bunch of AIs do things analogous to shoddy accounting practices. ...
A small group of researchers raise alarm that this is going on, but society at large doesn't listen to them because everything seems to be going so well.
Arguably this is already the situation with alignment. We have already observed empirical examples of many early alignment problems like reward hacking. One could make an argument that looks something like "well yes but this is just in a toy environment, and it's a big leap to it taking over the world", but it seems unclear when society will start listening. In analogy to the AI goalpost moving problem ...
It's the relevant operationalization because in the context of an AI system optimizing for X-ness of states S, the thing that matters is not what the max-likelihood sample of some prior distribution over S is, but rather what the maximum X-ness sample looks like. In other words, if you're trying to write a really good essay, you don't care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.
(also, the maximum likelihood essay looks like a single word, or if you normalize for length, the same word repeated over and over again up to the context length)
I generally agree that coupling is the main thing necessary for gradient hacking. However, from trying to construct gradient hackers by hand, my intuition is that gradient descent is just really good at credit assignment. For instance, in most reasonable architectures I don't think it's possible to have separate subnetworks for figuring out the correct answer and then just adding the coupling by gating it to save negentropy. To me, it seems the only kinds of strategies that could work are ones where the circuits implementing the cognition that decides to s...
I don't think we can even conclude for certain that a lack of measured loglikelihood improvement implies that it won't, though it is evidence. Maybe the data used to measure the behavior doesn't successfully prompt the model to do the behavior, maybe it's phrased in a way the model recognizes as unlikely and so at some scale the model stops increasing likelihood on that sample, etc; as you would say, prompting can show presence but not absence.
Seems like there are multiple possibilities here:
(Mostly just stating my understanding of your take back at you to see if I correctly got what you're saying:)
I agree this argument is obviously true in the limit, with the transistor case as an existence proof. I think things get weird at the in-between scales. The smaller the network of aligned components, the more likely it is to be aligned (obviously, in the limit if you have only one aligned thing, the entire system of that one thing is aligned); and also the more modular each component is (or I guess you would say the better the interfaces between the...
I agree that in practice you would want to point mild optimization at it, though my preferred resolution (for purely aesthetic reasons) is to figure out how to make utility maximizers that care about latent variables, and then make it try to optimize the latent variable corresponding to whatever the reflection converges to (by doing something vaguely like logical induction). Of course the main obstacles are how the hell we actually do this, and how we make sure the reflection process doesn't just oscillate forever.
(Transcribed in part from Eleuther discussion and DMs.)
My understanding of the argument here is that you're using the fact that you care about diamonds as evidence that whatever the brain is doing is worth studying, with the hope that it might help us with alignment. I agree with that part. However, I disagree with the part where you claim that things like CIRL and ontology identification aren't as worthy of being elevated to consideration. I think there exist lines of reasoning that these fall naturally out as subproblems, and the fact that they fall out ...
(Partly transcribed from a correspondence on Eleuther.)
I disagree about concepts in the human world model being inaccessible in theory to the genome. I think lots of concepts could be accessed, and that (2) is true in the trilemma.
Consider: As a dumb example that I don't expect to actually be the case but which gives useful intuition, suppose the genome really wants to wire something up to the tree neuron. Then the genome could encode a handful of images of trees and then once the brain is fully formed it can go through and search for whichever neuron acti...
Computationally expensive things are less likely to show up in your simulation than the real world, because you only have so much compute to run your simulation. You can't convincingly fake the AI having access to a supercomputer.
The possibility is that Alice might always be able tell that she’s in a simulation no matter what we condition on. I think this is pretty much precluded by the assumption that the generative model is a good model of the world, but if that fails then it’s possible Alice can tell she’s in a simulation no matter what we do. So a lot rides on the statement that the generative model remains a good model of the world regardless of what we condition on.
Paul's RSA-2048 counterexample is an example of a way our generative model can fail to be good enough no matter ...
Liked this post a lot. In particular I think I strongly agree with "Eliezer raises many good considerations backed by pretty clear arguments, but makes confident assertions that are much stronger than anything suggested by actual argument" as the general vibe of how I feel about Eliezer's arguments.
A few comments on the disagreements:
Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.”
An in-between posit...
Fwiw, I interpreted this as saying that it doesn't work as a safety proposal (see also: my earlier comment). Also seems related to his arguments about ML systems having squiggles.
Yup. You can definitely train powerful systems on imitation of human thoughts, and in the limit this just gets you a powerful mesa-optimizer that figures out how to imitate them.
I agree that the SW/HW analogy is not a good analogy for AGI safety (I think security is actually a better analogy), but I would like to present a defence of the idea that normal systems reliability engineering is not enough for alignment (this is not necessarily a defence of any of the analogies/claims in the OP).
Systems safety engineering leans heavily on the idea that failures happen randomly and (mostly) independently, so that enough failures happening together by coincidence to break the guarantees of the system is rare. That is:
Some quick thoughts on these points:
A few axes along which to classify optimizers:
Some observations: it feels l...
One possible model of AI development is as follows: there exists some threshold beyond which capabilities are powerful enough to cause an x-risk, and such that we need alignment progress to be at the level needed to align that system before it comes into existence. I find it informative to think of this as a race where for capabilities the finish line is x-risk-capable AGI, and for alignment this is the ability to align x-risk-capable AGI. In this model, it is necessary but not sufficient for alignment for alignment to be ahead by the time it's at the fini...
Looking forward to seeing the survey results!
By the way, if you're an alignment researcher and compute is your bottleneck, please send me a DM. EleutherAI already has a lot of compute resources (as well as a great community for discussing alignment and ML!), and we're very interested in providing compute for alignment researchers with minimal bureaucracy required.
I agree that there will be cases where we have ontological crises where it's not clear what the answer is, i.e whether the mirrored dog counts as "healthy". However, I feel like the thing I'm pointing at is that there is some sort of closure of any given set of training examples where, for some fairly weak assumptions, we can know that everything in this expanded set is "definitely not going too far". As a trivial example, anything that is a direct logical consequence of anything in the training set would be part of the completion. I expect any ELK solutions to look something like that. This corresponds directly to the case where the ontology identification process converges to some set smaller than the entire set of all cases.
My understanding of the argument: if we can always come up with a conservative reporter (one that answers yes only when the true answer is yes), and this reporter can label at least one additional data point that we couldn't label before, we can use this newly expanded dataset to pick a new reporter, feed this process back into itself ad infinitum to label more and more data, and the fixed point of iterating this process is the perfect oracle. This would imply an ability to solve arbitrary model splintering problems, which seems like it would need to eithe...
A GLUT can have constant time complexity using a hash table, which makes it a lot less clear that metalearning can be faster
From a zoomed-out perspective, the model is not modifying the loss landscape. This frame, however, does not give us a useful way of thinking about how gradient hacking might occur and how to avoid it.
I think that the main value of the frame is to separate out the different potential ways gradient hacking can occur. I've noticed that in discussions without this distinction, it's very easy to equivocate between the types, which leads to frustrating conversations where people fundamentally disagree without realizing (i.e someone might be talking about s...
My attempt at a one sentence summary of the core intuition behind this proposal: if you can be sure your model isn’t optimizing for deceiving you, you can relatively easily tell if it’s trying to optimize for something you don’t want by just observing whether your model seems to be trying to do something obviously different from what you want during training, because it's much harder to slip under the radar by getting really lucky than by intentionally trying to.
Here's a hand crafted way of doing gradient protection in this case I can think of: since these models are blocks of linear->bn(affine)->relu, if you make the beta in the affine really small, you can completely zero out the output of that block and then the rest of the model can only learn a constant function. You can also get around L2: just set i.e gamma to 0.00001 and beta to -0.01; this lets you have both really small parameter magnitudes and also still saturate the relu. As this model is trained on the base objective it should converge to a cons...
The ultimate goal of this project is to exhibit a handcrafted set of model weights for a reasonably noncontrived model architecture which, when tuned with SGD, results in some malicious subnetwork in the model learning some mesaobjective that we specified via some section of the model weights () completely different from the base objective, without the entire mesaoptimizer falling apart in the process. We haven't succeeded at this goal yet but I would say this goal is very much to exhibit gradient hacking.
I don't think redundancy will work. Suppos...
I think this is something I and many others at EleutherAI would be very interested in working on, since it seems like something that we'd have a uniquely big comparative advantage at.
One very relevant piece of infrastructure we've built is our evaluation framework, which we use for all of our evaluation since it makes it really easy to evaluate your task on GPT-2/3/Neo/NeoX/J etc. We also have a bunch of other useful LM related resources, like intermediate checkpoints for GPT-J-6B that we are looking to use in our interpretability work, for example. ...
Awesome work! I like the autoencoder approach a lot.