Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Popular Comments

Recent Discussion

This post starts out pretty gloomy but ends up with some points that I feel pretty positive about.  Day to day, I'm more focussed on the positive points, but awareness of the negative has been crucial to forming my priorities, so I'm going to start with those.  It's mostly addressed to the EA community, but is hopefully somewhat of interest to LessWrong and the Alignment Forum as well.

My main concerns

I think AGI is going to be developed soon, and quickly.  Possibly (20%) that's next year, and most likely (80%) before the end of 2029. These are not things you need to believe for yourself in order to understand my view, so no worries if you're not personally convinced of this.

(For what it's worth, I did arrive at...

I wonder if work on AI for epistemics could be great for mitigating the "gradually cede control of the Earth to AGI" threat model. A large majority of economic and political power is held by people who would strongly oppose human extinction, so I expect that "lack of political support for stopping human extinction" would be less of a bottleneck than "consensus that we're heading towards human extinction" and "consensus on what policy proposals will solve the problem". Both of these could be significantly accelerated by AI. Normally, one of my biggest conce... (read more)

2Chris_Leong
I'd strongly bet that when you break this down in more concrete detail, a flaw in your plan will emerge.  The balance of industries serving humans vs. AI's is a suspiciously high level of abstraction.
9RobertM
Do you have a mostly disjoint view of AI capabilities between the "extinction from loss of control" scenarios and "extinction by industrial dehumanization" scenarios?  Most of my models for how we might go extinct in next decade from loss of control scenarios require the kinds of technological advancement which make "industrial dehumanization" redundant, with highly unfavorable offense/defense balances, so I don't see how industrial dehumanization itself ends up being the cause of human extinction if we (nominally) solve the control problem, rather than a deliberate or accidental use of technology that ends up killing all humans pretty quickly. Separately, I don't understand how encouraging human-specific industries is supposed to work in practice.  Do you have a model for maintaining "regulatory capture" in a sustained way, despite having no economic, political, or military power by which to enforce it?  (Also, even if we do succeed at that, it doesn't sound like we get more than the Earth as a retirement home, but I'm confused enough about the proposed equilibrium that I'm not sure that's the intended implication.)
4Vladimir Nesov
Healthcare in this general sense is highly relevant to machines. Conversely, sufficient tech to upload/backup/instantiate humans makes biology-specific healthcare (including life extension) mostly superfluous. The key property of machines is initial advantage in scalability, which quickly makes anything human-specific tiny and easily ignorable in comparison, however you taxonomize the distinction. Humans persevere only if scalable machine sources of power (care to) lend us the benefits of their scale. Intent alignment for example would need to be able to harness a significant fraction of machine intent (rather than being centrally about human intent).

In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment.

I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is:

  • Nate thinks there are deep reasons that training an AI to do needle-moving scientific research (including alignment) would be dangerous. The overwhelmingly likely result of such a training attempt (by default, i.e., in the absence of
...
3Vivek Hebbar
Do you want to try playing this game together sometime?

Yes! Which side do you want to be on? Want to do it in person, or in this comment thread?

3Vivek Hebbar
Why are we doing RL if we just want imitation?  Why not SFT on expert demonstrations? Also, if 10 episodes suffices, why is so much post-training currently done on base models?

Thanks to Evan Hubinger for funding this project and for introducing me to predictive models, Johannes Treutlein for many fruitful discussions on related topics, and Dan Valentine for providing valuable feedback on my code implementation. 

In September 2023, I received four months of funding through Manifund to extend my initial results on avoiding self-fulfilling prophecies in predictive models. Eleven months later, the project was finished, and the results were submitted as a conference paper. 

The project was largely successful, in that it showed both theoretically and experimentally how incentives for predictive accuracy can be structed so that maximizing them does not manipulate outcomes to be more predictable. The mechanism at play is essentially pitting predictive agents against each other in a zero-sum competition, which allows for the circumvention of impossibility results...

0harsimony
Oh that makes sense! If the predictors can influence the world in addition to making a prediction, they would also have an incentive to change the world in ways that make their predictions more accurate than their opponents right? For example, if everyone else thinks Bob is going to win the presidency, one of the predictors can bribe Bob to drop out and then bet on Alice winning the presidency. Is there work on this? To be fair, it seems like every AI safety proposal has to deal with something like this.

Yes, if predictors can influence the world in addition to making a  prediction, they can go make their predictions more accurate. The nice thing about working with predictive models is that by default the only action they can take is making predictions. 

AI safety via market making, which Evan linked in another comment, touches on the analogy where agents are making predictions but can also influence the outcome. You might be interested in reading through it.

2Rubi Hudson
Having re-read the posts and thought about it some more, I do think zero-sum competition could be applied to logical inductors to resolve the futarchy hack. It would require minor changes to the formalism to accommodate, but I don't see how those changes would break anything else.
3Rubi Hudson
I think the tie-in to market-making, and other similar approaches like debate, is in interpreting the predictions. While the examples in this post were only for the two-outcome case, we would probably want predictions over orders of magnitude more outcomes for the higher informational density. Since evaluating distributions over a double digit number of outcomes already starts posing problems (sometimes even high single digits), a process to direct a decision maker's attention is necessary.  I've been thinking of a proposal like debate, where both sides go back and forth proposing clusters of outcomes based on shared characteristics. Ideally, in equilibrium, the first debater should propose the fewest number of clusters such that splitting them further doesn't change the decision maker's mind. This could also be thought of in terms of market-making, where rather than the adversary proposing a string, they propose a further subdivision of existing clusters.  I like the use case of understanding predictions for debate/market-making, because the prediction itself acts as a ground truth. Then, there's no need to ancitipate/reject a ton of counterarguments based on potential lies, rather arguments are limited to selectively revealing the truth. It is probably important that the predictors are separate models from the analyzer to avoid contamination of the objectives. The proof of Theorem 6, which skips to the end of the search process, needs to use a non-zero sum prediction for that result. As an aside, I also did some early work on decision markets, distinct from your post on market-making, since the Othman and Sandholm had an impossibility result for those too. However, but the results were ultimately trivial. Once you can use zero-sum competition to costlessly get honest conditional predictions, then as soon as you can pair off entrants to the market it becomes efficient. But the question then arises of why use a decision market in the first place instead of just q

Is this market really only at 63%? I think you should take the over. 

Only 63%? I think you should take the over.

Five tiers of rigor for safety-oriented interpretability work

Lately, I have been thinking of interpretability research as falling into five different tiers of rigor. 

1. Pontification

This is when researchers claim they have succeeded in interpreting a model by definition or based on analyzing results and asserting hypotheses about them. This is a key part of the scientific method. But by itself, it is not good science. Previously in this sequence, I have argued that this standard is fairly pervasive.

2. Basic Science

This is when researchers develop an interpretation, use it to make some (usually simple) prediction, and then show that this prediction validates. This is at least doing science, but it...

6Rohin Shah
Uhh... if we (GDM mech interp team) saw good results on any one of the eight things on your list, we'd probably write a paper just about that thing, rather than waiting to get even more results. And of course we might write an SAE paper that isn't about downstream uses (e.g. I'm also keen on general scientific validation of SAEs), or a paper reporting negative results, or a paper demonstrating downstream use that isn't one of your eight items, or a paper looking at downstream uses but not comparing against baselines. So just on this very basic outside view, I feel like the sum of your probabilities should be well under 100%, at least conditional on the next paper coming out of GDM. (I don't feel like it would be that different if the next paper comes from OpenAI / Anthropic.) The problem here is "next SAE paper to come out" is a really fragile resolution criterion that depends hugely on unimportant details like "what the team decided was a publishable unit of work". I'd recommend you instead make time-based predictions (i.e. how likely are each of those to happen by some specific date).

Thanks for the comment. My probabilities sum to 165%, which would translate to me saying that I would expect, on average, the next paper to do 1.65 things from the list, which to me doesn't seem too crazy. I think that this DOES match my expectations. 

But I also think you make a good point. If the next paper comes out, does only one thing, and does it really well, I commit to not complaining too much. 

The high level claim seems pretty true to me. Come to the GDM alignment team, it's great over here! It seems quite important to me that all AGI labs have good safety teams

Thanks for writing the post!

1Mark Xu
I was intending to warn about the possibility of future perception of corruption, e.g. after a non-existential AI catastrophe. I do not think anyone currently working at safety teams is percieved as that "corrupted", although I do think there is mild negative sentiment among some online communities (some parts of twitter, reddit, etc.).
6Evan Hubinger
Some possible counterpoints: * Centralization might actually be good if you believe there are compounding returns to having lots of really strong safety researchers in one spot working together, e.g. in terms of having other really good people to work with, learn from, and give you feedback. * My guess would be that Anthropic resources its safety teams substantially more than GDM in terms of e.g. compute per researcher (though I'm not positive of this). * I think the object-level research productivity concerns probably dominate, but if you're thinking about influence instead, it's still not clear to me that GDM is better. GDM is a much larger, more bureaucratic organization, which makes it a lot harder to influence. So influencing Anthropic might just be much more tractable.
2Mark Xu
Basically (2), very small amounts of (1) (perhaps qualitatively similar to the amount of (1) you would apply to e.g. people joining US AISI or UK AISI)

One strategy for mitigating risk from schemers (that is, egregiously misaligned models that intentionally try to subvert your safety measures) is behavioral red-teaming (BRT). The basic version of this strategy is something like: Before you deploy your model, but after you train it, you search really hard for inputs on which the model takes actions that are very bad. Then you look at the scariest model actions you found, and if these contain examples that are strong evidence that the model is scheming (or has other behavioral problems), you don't deploy it.

BRT is likely to be a useful component of a good overall plan for handling risk from scheming. But though I think it’s plausible that BRT is able to give us strong evidence that our models are scheming,...

Overall, making a safety case from these techniques seems really rough. I don't see why someone would believe that (conditional on non-scheming) you have a 90% chance of getting a 20x Bayes factor against scheming. I don’t think this is just me saying “it’s very hard to get a 20x Bayes factor because that requires a lot of confidence”—I think that all the sources of uncertainty I described above are much worse than approaches to safety cases that rest on e.g. either inability or on control

In this post, when you say getting evidence against scheming, you mo... (read more)

Load More