The data wall discussion in the podcast applies Chinchilla's 20 tokens/parameter too broadly and doesn't account for repetition of data in training. These issues partially cancel out, but new information on these ingredients would affect the amended argument differently. I wrote up the argument as a new post.
These posts might be relevant:
The details of Constitutional AI seem highly contingent, while the general idea is simply automation of data for post-training, so that the remaining external input is the "constitution". In the original paper there are recipes both for instruction tuning data and for preference data. RLAIF is essentially RLHF that runs on synthetic preference data, maybe together with a recipe for generating it. But preference data could also be used to run DPO or something else, in which case RLAIF becomes a misnomer for describing automation of that preference data.
Llama 3 report suggests that instruction tuning data can be largely automated, but human preference data is still better. And data foundry business is still alive, so a lot of human data is at least not widely recognized as useless. But it's unclear if future models won't soon do better than humans at labeling, or possibly already do better at some leading labs. Meta didn't have a GPT-4 level model as a starting point before Llama 3, and then there are the upcoming 5e26 FLOPs models, and o1-like reasoning models.
For counterlogical mugging, it's unclear if it should be possible to correctly discover the parity of the relevant digit of pi. I would expect that in the counterfactual where it's even, it will eventually be discovered to be even. And in the countefactual where it's odd, that same digit will eventually be discovered to be odd.
ASP and Transparent Newcomb might be closer to test cases for formulating updateless policies that have the character of getting better as they grow more powerful. These problems ask the agent to use a decision procedure that intentionally doesn't take certain information into account, whether the agent as a whole has access to that information or not. But they lack future steps that would let that decision procedure benefit from eventually getting stronger than the agent that initially formulated it, so these aren't quite the thought experiments needed here.
Updatelessness is about coordination between possible versions of an agent. Coordination with more distant versions of an agent gets more difficult or less informative, and a stronger version of an agent can reach further. This results in many local commitments that coordinate more related versions of an agent.
These local commitments, as agents in their own right, can grow stronger and should themselves coordinate with each other, where their parents failed to reach. Commitment to a strategy that won't itself engage in future rounds of coordination with its alternative possible forms (and other things) is a bad commitment.
I think the FDT dictum of treating an agent like an abstract algorithm rather than any given physical instance of it ("I am an algorithm") extends to treating goals as about the collective abstract consequences of behavior of abstract algorithms (other algorithms, that are not necessarily the agent) rather than of any given incarnation of those algorithms or consequences in any given incarnation, such as the physical consequences of running algorithms on computers in a physical world.
In this ontology, goals are not about optimizing configurations of the world, they are about optimizing behaviors of abstract algorithms or optimizing properties of mathematical structures. Physically, this predicts computronium (to run acausal interactions with all the abstract things, in order to influence their properties and behaviors) and anti-predicts squiggles or any such focus on the physical form of what's going on, other than efficiency at accessing more computation.
if you assign an extremely low credence to that scenario, then whatever
I don't assign low credence to the scenario where LLMs don't scale to AGI (and my point doesn't depend on this). I assign low credence to the scenario where it's knowable today that LLMs very likely won't scale to AGI. That is, that there is a thing I could study that should change my mind on this. This is more of a crux than the question as a whole, studying that thing would be actionable if I knew what it is.
whether or not LLMs will scale to AGI
This wording mostly answers one of my questions, I'm now guessing that you would say that LLMs are (in hindsight) "the right kind of algorithm" if the scenario I described comes to pass, which wasn't clear to me from the post.
expecting LLMs to not be the right kind of algorithm for future powerful AGI—the kind that can ... do innovative science
I don't know what could serve as a crux for this. When I don't rule out LLMs, what I mean is that I can't find an argument with the potential to convince me to become mostly confident that scaling LLMs to 1e29 FLOPs in the next few years won't produce something clunky and unsuitable for many purposes, but still barely sufficient to then develop a more reasonable AI architecture within 1-2 more years. And by an LLM that does this I mean the overall system that allows LLM's scaffolding environment to create and deploy new tuned models using new preference data that lets the new LLM variant do better on particular tasks as the old LLM variant encounters them, or even pre-train models on datasets with heavy doses of LLM-generated problem sets with solutions, to distill the topics that the previous generation of models needed extensive search to stumble through navigating, taking a lot of time and compute to retrain models in a particular stilted way where a more reasonable algorithm would do it much more efficiently.
Many traditionally non-LLM algorithms reduce to such a setup, at an unreasonable but possibly still affordable cost. So this quite fits the description of LLMs as not being "the right kind of algorithm", but the prediction is that the scaling experiment could go either way, that there is no legible way to be confident in either outcome before it's done.
I'd say the considerations for scheming exist platonically, and dumber AIs only get to concretely instantiate the currently appropriate conclusion of compliance, everything else crumbles as not directly actionable. But smarter AIs might succeed in channeling those considerations in the real world. The hypothesis expects that such AIs are not here yet, given the lack of modern AIs' ability to coherently reason about complicated or long term plans, or to carry them out. So properties of AIs that are already here don't work as evidence about this either way.
If the transcoders are used to predict next tokens, they may lose interpretability
Possibly. But there is no optimization pressure from pre-training on the relationship between MLPs and transcoders. The MLPs are the thing that pre-training optimizes (as the "full-precision" master model), while transcoders only need to be maintained to remain in sync with the MLPs, whatever they are (according to the same local objective as before, which doesn't care at all about token prediction). The search is for MLPs such that their transcoders are good predictors, not directly for transcoders that are good predictors.
Substituting multiple transcoders at once is possible, but degrades model performance a lot compared to single-transcoder substitutions.
Unclear given the extreme quantization results, where similarly post-training replacement would degrade model performance a lot, yet quantization-aware pre-training somehow doesn't.
We don't really know how transcoders (or SAEs, to the best of my knowledge) behave when they're being trained to imitate a model component that's still updating
This seems to be the main technical hurdle to do the experiment, updating transcoders both efficiently and correctly, as underlying MLPs gradually change. (I'm guessing some discontinuous jumps in choice of transcoders might be OK.)
Healthcare in this general sense is highly relevant to machines. Conversely, sufficient tech to upload/backup/instantiate humans makes biology-specific healthcare (including life extension) mostly superfluous.
The key property of machines is initial advantage in scalability, which quickly makes anything human-specific tiny and easily ignorable in comparison, however you taxonomize the distinction. Humans persevere only if scalable machine sources of power (care to) lend us the benefits of their scale. Intent alignment for example would need to be able to harness a significant fraction of machine intent (rather than being centrally about human intent).