Was on Vivek Hebbar's team at MIRI, now working with Adrià Garriga-Alonso on various empirical alignment projects.
I'm looking for projects in interpretability, activation engineering, and control/oversight; DM me if you're interested in working with me.
I have signed no contracts or agreements whose existence I cannot mention.
But when I hear Anthropic people (and most AI safety people) talk about AI welfare, the vibe is like it would be unacceptable to incur a [1% or 5% or so] risk of a deployment accidentally creating AI suffering worse than 10^[6 or 9 or so] suffering-filled human lives.
You have to be a pretty committed scope-sensitive consequentialist to disagree with this. What if they actually risked torturing 1M or 1B people? That seems terrible and unacceptable, and by assumption AI suffering is equivalent to human suffering. I think our societal norms are such that unacceptable things regularly become acceptable when the stakes are clear so you may not even lose much utility from this emphasis on avoiding suffering.
It seems perfectly compatible with good decision-making that there are criteria A and B, A is much more important and therefore prioritized over B, and 2 out of 19 sections are focused on B. The real question is whether the organization's leadership is able to make difficult tradeoffs, reassessing and questioning requirements as new information comes in. For example, in the 1944 Norwegian sabotage of a Nazi German heavy water shipment, stopping the Nazi nuclear program was the first priority. The mission went ahead with reasonable effort to minimize casualties and 14 civilians died anyway, less than it could have been. It would not really have alarmed me to see a document discussing 19 efforts with 2 being avoidance of casualties, nor to know that the planners regularly talked with the vibe that 10-100 civilian casualties should be avoided, as long as someone had their eye on the ball.
I'm not thinking of a specific task here, but I think there are two sources of hope. One is that humans are agentic above and beyond what is required to do novel science, e.g. we have biological drives, goals other than doing the science, often the desire to use any means to achieve our goals rather than whitelisted means, and the ability and desire to stop people from interrupting us. Another is that learning how to safely operate agents at a slightly superhuman level will be progress towards safely operating nanotech-capable agents, which could also require control, oversight, steering, or some other technique. I don't think limiting agency will be sufficient unless the problem is easy, and then it would have other possible solutions.
I'm glad to see this post curated. It seems increasingly likely that we need it will be useful to carefully construct agents that have only what agency is required to accomplish a task, and the ideas here seem like the first steps.
I agree, there were some good papers, and mechinterp as a field is definitely more advanced. What I meant to say was that many of the mechinterp papers accepted to the conference weren't very good.
Quick takes from ICML 2024 in Vienna:
How can the mistakes rationalists are making be expressed in the language of Bayesian rationalism? Priors, evidence, and posteriors are fundamental to how probability works.
I am pro-corrigibility in general but there are parts of this post I think are unclear, not rigorous enough to make sense to me, or I disagree with. Hopefully this is a helpful critique, and maybe parts get answered in future posts.
You give an informal definition of "corrigible" as (C1):
an agent that robustly and cautiously reflects on itself as a flawed tool and focusing on empowering the principal to fix its flaws and mistakes.
I have some basic questions about this.
We considered that "catastrophic" might have that connotation, but we couldn't think of a better name and I still feel okay about it. Our intention with "catastrophic" was to echo the standard ML term of "catastrophic forgetting", not a global catastrophe. In catastrophic forgetting the model completely forgets how to do task A after it is trained on task B, it doesn't do A much worse than random. So we think that "catastrophic Goodhart" gives the correct idea to people who come from ML.
The natural question is then: why didn't we study circumstances in which optimizing for a proxy gives you utility in the limit? Because it isn't true under the assumptions we are making. We wanted to study regressional Goodhart, and this naturally led us to the independence assumption. Previous work like Zhuang et al and Skalse et al has already formalized the extremal Goodhart / "use the atoms for something else" argument that optimizing for one goal would be bad for another goal, and we thought the more interesting part was showing that bad outcomes are possible even when error and utility are independent. Under the independence assumption, it isn't possible to get less than 0 utility.
To get utility in the frame where proxy = error + utility, you would need to assume something about the dependence between error and utility, and we couldn't think of a simple assumption to make that didn't have too many moving parts. I think extremal Goodhart is overall more important, but it's not what we were trying to model.
Lastly, I think you're imagining "average" outcome as a random policy, which is an agent incapable of doing significant harm. The utility of the universe is still positive because you can go about your life. But in a different frame, random is really bad. Right now we pretrain models and then apply RLHF (and hopefully soon, better alignment techniques). If our alignment techniques produce no more utility than the prior, this means the model is no more aligned than the base model, which is a bad outcome for OpenAI. Superintelligent models might be arbitrarily capable of doing things, so the prior might be better thought of as irreversibly putting the world in a random state, which is a global catastrophe.
I started a dialogue with @Alex_Altair a few months ago about the tractability of certain agent foundations problems, especially the agent-like structure problem. I saw it as insufficiently well-defined to make progress on anytime soon. I thought the lack of similar results in easy settings, the fuzziness of the "agent"/"robustly optimizes" concept, and the difficulty of proving things about a program's internals given its behavior all pointed against working on this. But it turned out that we maybe didn't disagree on tractability much, it's just that Alex had somewhat different research taste, plus thought fundamental problems in agent foundations must be figured out to make it to a good future, and therefore working on fairly intractable problems can still be necessary. This seemed pretty out of scope and so I likely won't publish.
Now that this post is out, I feel like I should at least make this known. I don't regret attempting the dialogue, I just wish we had something more interesting to disagree about.
This post and the remainder of the sequence were turned into a paper accepted to NeurIPS 2024. Thanks to LTFF for funding the retroactive grant that made the initial work possible, and further grants supporting its development into a published work including new theory and experiments. @Adrià Garriga-alonso was also very helpful in helping write the paper and interfacing with the review process.