Here are some views, often held in a cluster:
I'm not sure exactly which clusters you're referring to, but I'll just assume that you're pointing to something like "people who aren't very into the sharp left turn and think that iterative, carefully bootstrapped alignment is a plausible strategy." If this isn't what you were trying to highlight, I apologize. The rest of this comment might not be very relevant in that case.
To me, the views you listed here feel like a straw man or weak man of this perspective.
Furthermore, I think the actual crux is more often "prior to having to align systems that are collectively much more powerful than humans, we'll only have to align systems that are somewhat more powerful than humans." This is essentially the crux you highlight in A Case for the Least Forgiving Take On Alignment. I believe disagreements about hands-on experience are quite downstream of this crux: I don't think people with reasonable views (not weak men) believe that "without prior access to powerful AIs, humans will need to align AIs that are vastly, vastly superhuman, but this will be fine because these AIs will need lots of slow, hands-on experience in the world to do powerful stuff (like nanotech)."
So, discussing how well superintelligent AIs can operate from first principles seems mostly irrelevant to this discussion (if by superintelligent AI, you mean something much, much smarter than the human range).
Pitting two models against each other in a zero-sum competition only works so long as both models actually learn the desired goals. Otherwise, they may be able to reach a compromise with each other and cooperate towards a non-zero-sum objective.
If training works well, then they can't collude on average during training, only rarely or in some sustained burst prior to training crushing these failures.
In particular, in the purely supervised case with gradient descent, performing poorly on average in durining training requires gradient hacking (or more benign failures of gradient descent, but it's unclear why the goals of the AIs would be particularly relevant in this case).
In the RL case, it requires exploration hacking (or benign failures as in the gradient case).
The only way to prevent this is for the decision maker to assign some probability to all possible actions, regardless of how bad the predicted outcome is. This necessarily means bad outcomes will occur more frequently than they would if they could make deterministic decisions based on honest conditional predictions. We might reasonably say we don’t want to ever randomly take an action that leads to the extinction of humanity with high probability, but if this is true then a predictor can lie about that to dissuade us from any action. Even if we would be willing to take such an action with very small probability in order to get honest conditional predictions, we likely cannot commit to following through on such an action if our randomizer lands on it.
Thinking about this in terms of precommitment seems to me like it's presupposing that the AI perfectly optimizes the training objective in some deep sense (which seems implausible to me). The reason why this exploration procedure works is presumably that you end up selecting such actions frequently during training which in turn selects for AIs which perform well. Epsilon exploration only works if you sample the epsilon. So, it doesn't work if you set the epsilon to 1e-40 or something.
We can't be confident enough that it won't happen to safely rely on that assumption.
I'm not sure what motivation for worst-case reasoning you're thinking about here. Maybe just that there are many disjunctive ways things can go wrong other than bad capability evals and the AI will optimize against us?
Overall, I think I disagree.
This will depend on the exact bar for safety. I think this sort of scenario feels like 0.1% to 3% likely to me which is immensely catastrophic overall, but there is lower hanging fruit for danger avoidance elsewhere.
(And for this exact failure mode, I think that improvements upon various relatively straight forward capability evals are likely to quite compelling as the most leveraged current interventions, but I'm not confident.)
[Sorry for late reply]
Analogously, conditional on things like gradient hacking being an issue at all, I'd expect the "hacker" to treat potential-training-objective-improvement as a scarce resource, which it generally avoids "spending" unless the expenditure will strengthen its own structure. Concretely, this probably looks like mostly keeping itself decoupled from the network's output, except when it can predict the next backprop update and is trying to leverage that update for something.
So it's not a question of performing badly on the training metric so much as a question of not performing as well as it could.
Perhaps I'm just misunderstanding things, but the argument here is that gradient hacking to intentionally perform poorly is easier than gradient hacking to do things like steer the overall training process in some more complex direction.
I agree on this being easier, but this still seems very difficult. It seems like you're willing to defend a further point of 'reasonably likely'.
I don't expect that to actually be too hard; the subsystem can spend some prediction-power in order to set up the structure needed to decouple the rest of its prediction-power from the output most of the time. (Just like a bacteria expends some negentropy to set up a structure which mostly decouples the rest of its negentropy from the local entropy gradient, so that it does not immediately burst into flame.)
I understand this as meaning "the AI/hacker/subsystem can spend a small/modest 'gradient tax' to permanantly and consistently perform poorly on a given task without paying further tax or risking being removed" and "this is pretty plausible".
This still sounds very difficult to me: The AI has to predict how to setup this barrier in a way which is robust to noise from sources like variable minibatches (or even just adding random noise to the weights which is a thing we could do as a countermeasure). I think there are some other difficulty considerations.
(Note: this comment is rambly and repetitive, but I decided not to spend time cleaning it up)
It sounds like you believe something like: "There are autonomous learning style approaches which are considerably better than the efficiency on next token prediction."
And more broadly, you're making a claim like 'current learning efficiency is very low'.
I agree - brains imply that it's possible to learn vastly more efficiently than deep nets, and my guess would be that performance can be far, far better than brains.
Suppose we instantly went from 'current status quo' to 'AI systems learn like humans learn and with the same efficiency, but with vastly larger memories than humans (current LLMs seem to have vastly better memory at least for facts and technical minutia), and vastly longer lifespans than humans (if you think token corresponds to 1 second, then 10 trillion tokens is 317098 years!)'. Then, we certainly get an extremely hard FOOM if anyone runs this training!
But this hypothetical just isn't what I expect.
Currently, SOTA deep learning is deeply inefficient in a bunch of different ways. Failing to do open ended autonomous learning to advance a field and then distilling these insights down to allow for future progress is probably one such failure, but I don't think it seem particularly special. Nor do I see a particular reason to expect that advances in open ended flexible autonomous learning will be considerably more jumpy than advances in other domains.
Right now, both supervised next token prediction and fully flexible autonomous learning are far less efficient than theoretical limits and worse than brains. But currently next token prediction is more efficient than fully flexible autonomous learning (as the main way to train your AI, next token prediction + some other stuff is often used).
Suppose I ask you to spend a week trying to come up with a new good experiment to try in AI. I give you two options.
In this hypothetical, I obviously would pick option B.
But suppose instead that we asked "How would you try to get current AIs (without technical advances) to most efficiently come up with new good experiements to try?"
Then, my guess is that most of the flops go toward next token prediction or a similar objective on a huge corpus of data.
You'd then do some RL(HF) and/or amplification to try and improve further, but this would be a small fraction of overall training.
As AIs get smarter, clever techniques to improve their capabilities futher via 'self improvement' will continue to work better and better, but I don't think this clearly will end up being where you spend most of the flops (it's certainly possible, but I don't see a particular reason to expect this - it could go either way).
I agree that 'RL on thoughts' might prove important, but we already have shitty versions today. Current SOTA is probably like 'process based feedback' + 'some outcomes' + 'amplification' + 'etc'. Noteably this is how humans do things: we reflect on which cognitive strategies and thoughts were good and then try to do more of that. 'thoughts' isn't really doing that much work here - this is just standard stuff. I expect continued progress on these techniques and that techiques will work better and better for smarter models. But I don't expect massive sharp left turn advancements for the reasons given above.
So I propose “somebody gets autonomous learning to work stably for LLMs (or similarly-general systems)” as a possible future fast-takeoff scenario.
Broadly speaking, autonomous learning doesn't seem particularly distinguished relative to supervised learning unless you have data limitations. For instance, suppose that data doesn't run out despite scaling and autonomous learning is moderately to considerably less efficient than supervised learning. Then, you'd just do supervised learning. Now, we can imagine fast takeoff scenarios where:
But this was just a standard fast takeoff argument. Here's a different version which doesn't refer to autonomous learning but is isomorphic:
The reason you got fast takeoff in both cases is just sudden large algorithmic improvement. I don't see a particular reason to expect this in the autonomous learning case and I think the current evidence points to this being unlikely for capabilities in general. (This is of course a quantitative question: how big will leaps be exactly?)
Sure, people do all sorts of cool tricks with the context window, but people don’t know how to iteratively make the weights better and better without limit, in a way that’s analogous to to AlphaZero doing self-play or a human mathematicians doing math.
I don't think this is a key bottleneck. For instance, it wouldn't be too hard to set up LLMs such that they would improve at some types of mathematics without clear limits (just set them up in a theorem proving self play type setting much like the mathematicians). This improvement rate would be slower than the corresponding rate in humans (by a lot) and would probably be considerably slower than the improvement rate for high quality supervised data. Another minimal baseline is just doing some sort of noisy student setup on entirely model generated data (like here https://arxiv.org/abs/1911.04252).
Capabilities people have tons of ideas here, so if data is an actual limitation, I think they'll figure this out (as far as I know, there are already versions in use at scaling labs). No one has (publically) bothered to work hard on autonomous learning because getting a lot of tokens is way easier and the autonomous learning is probably just worse than working on data curation if you don't run out of data.
My guess is that achieving reasonably efficient things which have good scaling laws is 'just' a moderately large capabilities research project at OpenAI - nothing that special.
You probably take some sort of hit from autonomous learning instead of supervised, but it seems not too bad to make the hit <100x compute efficiency (I'm very unsure here). Naively I would have thought that getting within a factor of 5 or so should be pretty viable.
Perhaps you think there are autonomous learning style approaches which are considerably better than the efficiency on next token prediction?
comment TLDR: Adversarial examples are a weapon against the AIs we can use for good and solving adversarial robustness would let the AIs harden themselves.
I haven't read this yet (I will later : ) ), so it's possible this is mentioned, but I'd note that exploiting the lack of adversarial robustness could also be used to improve safety. For instance, AI systems might have a hard time keeping secrets if they also need to interact with humans trying to test for verifiable secrets. E.g., trying to jailbreak AIs to get them to tell you about the fact that they already hacked the ssh server.
Ofc, there are also clownier versions like trying to fight the robot armies with adversarial examples (naively seems implausible, but who knows). I'd guess that reasonably active militaries probably already should be thinking about making their camo also be a multi-system adversarial example if possible. (People are already using vision models for targeting etc.)
I don't think this lack of adversarial robustness gets you particularly strong alignment properties, but it's worth noting upsides as well as downsides.
Perhaps we'd ideally train our AIs to be robust to most classes of things but just ensure they remain non-robust in a few (ideally secret) edge cases. However, the most important use cases for AI exploits are in situations after AIs have already started to be able to apply training procedures to themselves without our consent.
My overall guess is that the upsides of solving adversarial robustness outweigh the downsides.
Simulations are not the most efficient way for A and B to reach their agreement
Are you claiming that the marginal returns to simulation are never worth the costs? I'm skeptical. I think it's quite likely that some number of acausal trade simulations are run even if that isn't where most of the information comes from. I think there are probably diminishing returns to various approaches and thus you both do simulations and other approaches. There's a further benefit to sims which is that credence about sims effects the behavior of cdt agents, but it's unclear how much this matters.
Additionally, you don't need to nest sims at all, you can simply stub out the results of the sub simulations with other sims (I'm not sure you claim the sub sims cost anything). It's also conceivably you do fusions between reasoning and sims to further reduce compute (and there are a variety of other possible optimizations).
It is indeed pretty weird to see these behaviors appear in pure LMs. It's especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.
By 'pure LMs' do you mean 'pure next token predicting LLMs trained on a standard internet corpus'? If so, I'd be very surprised if they're miscalibrated and this prompt isn't that improbable (which it probably isn't). I'd guess this output is the 'right' output for this corpus (so long as you don't sample enough tokens to make the sequence detectably very weird to the model. Note that t=0 (or low temp in general) may induce all kinds of strange behavior in addition to to making the generation detectably weird.
I would be more sympathetic if you made a move like, "I'll accept continuity through the human range of intelligence, and that we'll only have to align systems as collectively powerful as humans, but I still think that hands-on experience is only..." In particular, I think there is a real disagreement about the relative value of experimenting on future dangerous systems instead of working on theory or trying to carefully construct analogous situations today by thinking in detail about alignment difficulties in the future.