This post starts out pretty gloomy but ends up with some points that I feel pretty positive about. Day to day, I'm more focussed on the positive points, but awareness of the negative has been crucial to forming my priorities, so I'm going to start with those. It's mostly addressed to the EA community, but is hopefully somewhat of interest to LessWrong and the Alignment Forum as well.
I think AGI is going to be developed soon, and quickly. Possibly (20%) that's next year, and most likely (80%) before the end of 2029. These are not things you need to believe for yourself in order to understand my view, so no worries if you're not personally convinced of this.
(For what it's worth, I did arrive at...
In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment.
I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is:
Yes! Which side do you want to be on? Want to do it in person, or in this comment thread?
Thanks to Evan Hubinger for funding this project and for introducing me to predictive models, Johannes Treutlein for many fruitful discussions on related topics, and Dan Valentine for providing valuable feedback on my code implementation.
In September 2023, I received four months of funding through Manifund to extend my initial results on avoiding self-fulfilling prophecies in predictive models. Eleven months later, the project was finished, and the results were submitted as a conference paper.
The project was largely successful, in that it showed both theoretically and experimentally how incentives for predictive accuracy can be structed so that maximizing them does not manipulate outcomes to be more predictable. The mechanism at play is essentially pitting predictive agents against each other in a zero-sum competition, which allows for the circumvention of impossibility results...
Yes, if predictors can influence the world in addition to making a prediction, they can go make their predictions more accurate. The nice thing about working with predictive models is that by default the only action they can take is making predictions.
AI safety via market making, which Evan linked in another comment, touches on the analogy where agents are making predictions but can also influence the outcome. You might be interested in reading through it.
Is this market really only at 63%? I think you should take the over.
Lately, I have been thinking of interpretability research as falling into five different tiers of rigor.
This is when researchers claim they have succeeded in interpreting a model by definition or based on analyzing results and asserting hypotheses about them. This is a key part of the scientific method. But by itself, it is not good science. Previously in this sequence, I have argued that this standard is fairly pervasive.
This is when researchers develop an interpretation, use it to make some (usually simple) prediction, and then show that this prediction validates. This is at least doing science, but it...
Thanks for the comment. My probabilities sum to 165%, which would translate to me saying that I would expect, on average, the next paper to do 1.65 things from the list, which to me doesn't seem too crazy. I think that this DOES match my expectations.
But I also think you make a good point. If the next paper comes out, does only one thing, and does it really well, I commit to not complaining too much.
The high level claim seems pretty true to me. Come to the GDM alignment team, it's great over here! It seems quite important to me that all AGI labs have good safety teams
Thanks for writing the post!
One strategy for mitigating risk from schemers (that is, egregiously misaligned models that intentionally try to subvert your safety measures) is behavioral red-teaming (BRT). The basic version of this strategy is something like: Before you deploy your model, but after you train it, you search really hard for inputs on which the model takes actions that are very bad. Then you look at the scariest model actions you found, and if these contain examples that are strong evidence that the model is scheming (or has other behavioral problems), you don't deploy it.
BRT is likely to be a useful component of a good overall plan for handling risk from scheming. But though I think it’s plausible that BRT is able to give us strong evidence that our models are scheming,...
Overall, making a safety case from these techniques seems really rough. I don't see why someone would believe that (conditional on non-scheming) you have a 90% chance of getting a 20x Bayes factor against scheming. I don’t think this is just me saying “it’s very hard to get a 20x Bayes factor because that requires a lot of confidence”—I think that all the sources of uncertainty I described above are much worse than approaches to safety cases that rest on e.g. either inability or on control
In this post, when you say getting evidence against scheming, you mo...
I wonder if work on AI for epistemics could be great for mitigating the "gradually cede control of the Earth to AGI" threat model. A large majority of economic and political power is held by people who would strongly oppose human extinction, so I expect that "lack of political support for stopping human extinction" would be less of a bottleneck than "consensus that we're heading towards human extinction" and "consensus on what policy proposals will solve the problem". Both of these could be significantly accelerated by AI. Normally, one of my biggest conce... (read more)