Recent Discussion

In the post introducing mesa optimization, the authors defined an optimizer as

a system [that is] internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.

The paper continues by defining a mesa optimizer as an optimizer that was selected by a base optimizer.

However, there are a number of issues with this definition, as some have already pointed out.

First, I think by this definition humans are clearly not mes... (Read more)

Here's a related post that came up on Alignment Forum a few months back: Does Agent-like Behavior Imply Agent-like Architecture?

The Effective Altruism Foundation (EAF) is focused on reducing risks of astronomical suffering, or s-risks, from transformative artificial intelligence (TAI). S-risks are defined as as events that would bring about suffering on an astronomical scale, vastly exceeding all suffering that has existed on earth so far[1]. As has been discussed elsewhere, s-risks might arise by malevolence, by accident, or in the course of conflict.

We believe that s-risks arising from conflict are among the most important, tractable, and neglected of these. In particular, strategic threats by powerful AI agents or A

... (Read more)

I find myself someone confused by s-risks as defined here; it's easy to generate clearly typical cases that very few would want, and hard to figure out where the boundaries are, and thus hard to figure out how much I should imagine this motivation impacting the research.

That is, consider the "1950s sci-fi prediction," where a slightly-more-competent version of humanity manages to colonize lots of different planets in ways that make them sort of duplicates of Earth. This seems like it would count as an s-risk if each planet has comparable levels of sufferin

... (Read more)(Click to expand thread. ⌘/CTRL+F to Expand All)Cmd/Ctrl F to expand all comments on this post

One of the most pleasing things about probability and expected utility theory is that there are many coherence arguments that suggest that these are the “correct” ways to reason. If you deviate from what the theory prescribes, then you must be executing a dominated strategy. There must be some other strategy that never does any worse than your strategy, but does strictly better than your strategy with certainty in at least one situation. There’s a good explanation of these arguments here.

We shouldn’t expect mere humans to be able to notice any failures of coherence in a superintelligent agent,... (Read more)

8Vanessa Kosoy2dReviewIn this essay, Rohin sets out to debunk what ey perceive as a prevalent but erroneous idea in the AI alignment community, namely: "VNM and similar theorems imply goal-directed behavior". This is placed in the context of Rohin's thesis that solving AI alignment is best achieved by designing AI which is not goal-directed. The main argument is: "coherence arguments" imply expected utility maximization, but expected utility maximization does not imply goal-directed behavior. Instead, it is a vacuous constraint, since any agent policy can be regarded as maximizing the expectation of some utility function. I have mixed feelings about this essay. On the one hand, the core argument that VNM and similar theorems do not imply goal-directed behavior is true. To the extent that some people believed the opposite, correcting this mistake is important. On the other hand, (i) I don't think the claim Rohin is debunking is the claim Eliezer had in mind in those sources Rohin cites (ii) I don't think that the conclusions Rohin draws or at least implies are the right conclusions. The actual claim that Eliezer was making (or at least my interpretation of it) is, coherence arguments imply that if we assume an agent is goal-directed then it must be an expected utility maximizer, and therefore EU maximization is the correct mathematical model to apply to such agents. Why do we care about goal-directed agents in the first place? The reason is, on the one hand goal-directed agents are the main source of AI risk, and on the other hand, goal-directed agents are also the most straightforward approach to solving AI risk. Indeed, if we could design powerful agents with the goals we want, these agents would protect us from unaligned AIs and solve all other problems as well (or at least solve them better than we can solve them ourselves). Conversely, if we want to protect ourselves from unaligned AIs, we need to generate very sophisticated long-term plans of action in the physical world, possi
5Rohin Shah2dHmm, perhaps I believed this when I wrote the sequence (I don't think so, but maybe?), but I certainly don't believe it now. I believe something more like: * Humans have goals and want AI systems to help them achieve them; this implies that the human-AI system as a whole should be goal-directed. * One particular way to do this is to create a goal-directed AI system, and plug in a goal that (we think) we want. Such AI systems are well-modeled as EU maximizers with "simple" utility functions. * But there could plausibly be AI systems that are not themselves goal-directed, but nonetheless the resulting human-AI system is sufficiently goal-directed. For example, a "genie" that properly interprets your instructions based on what you mean and not what you say seems not particularly goal-directed, but when combined with a human giving instructions becomes goal-directed. * One counterargument is that in order to be competitive, you must take the human out of the loop. I don't find this compelling, for a few reasons. First, you can interpolate between lots of human feedback (the human says "do X for a minute" every minute to the "genie") and not much human feedback (the human says "pursue my CEV forever") depending on how competitive you need to be. This allows you to tradeoff between competitiveness and how much of the goal-directedness remains in the human. Second, you can help the human to provide more efficient and effective feedback (see e.g. recursive reward modeling [] ). Finally, laws and regulations can be effective at reducing competition. * Nonetheless, it's not obvious how to create such non-goal-directed AI, and the AI community seems very focused on building goal-directed AI, and so there's a good chance we will build goal-directed AI and will need to focus on alignment of goal-directed AI syste
4Vanessa Kosoy1dI think that the discussion might be missing a distinction between different types or degrees of goal-directedness. For example, consider Dialogic [] Reinforcement Learning [] . Does it describe a goal-directed agent? On the one hand, you could argue it doesn't, because this agent doesn't have fixed preferences and doesn't have consistent beliefs over time. On the other hand, you could argue it does, because this agent is still doing long-term planning in the physical world. So, I definitely agree that aligned AI systems will only be goal-directed in the weaker sense that I alluded to, rather than in the stronger sense, and this is because the user is only goal-directed in the weak sense emself. If we're aiming at "weak" goal-directedness (which might be consistent with your position?), does it mean studying strong goal-directedness is redundant? I think that answer is, clearly no. Strong goal-directed systems are a simpler special case on which to hone our theories of intelligence. Trying to understand weak goal-directed agents without understanding strong goal-directed agents seems to me like trying to understand molecules without understanding atoms. On the other hand, I am skeptical about solutions to AI safety that require the user doing a sizable fraction of the actual planning. I think that planning does not decompose into an easy part and a hard part (which is not essentially planning in itself) in a way which would enable such systems to be competitive with fully autonomous planners. The strongest counterargument to this position, IMO, is the proposal to use counterfatual oracles [] or recursively amplified [] versions thereof
On the other hand, I am skeptical about solutions to AI safety that require the user doing a sizable fraction of the actual planning.

Agreed (though we may be using the word "planning" differently, see below).

If we're aiming at "weak" goal-directedness (which might be consistent with your position?)

I certainly agree that we will want AI systems that can find good actions, where "good" is based on long-term consequences. However, I think counterfactual oracles and recursive amplification also meet this criterion; I'm n... (Read more)(Click to expand thread. ⌘/CTRL+F to Expand All)Cmd/Ctrl F to expand all comments on this post

8Rohin Shah2dReviewA year later, I continue to agree with this post; I still think its primary argument is sound and important. I'm somewhat sad that I still think it is important; I thought this was an obvious-once-pointed-out point, but I do not think the community actually believes it yet. I particularly agree with this sentence of Daniel's review [] : "Constraining the types of valid arguments" is exactly the right way to describe the post. Many responses to the post have been of the form "this is missing the point of EU maximization arguments", and yes, the post is deliberately missing that point. The post is not saying that arguments for AI risk are wrong, just that they are based on intuitions and not math. While I do think that we are likely to build goal-directed agents, I do not think the VNM theorem and similar arguments support that claim: they simply describe how a goal-directed agent should think. However, talks like AI Alignment: Why It’s Hard, and Where to Start [] and posts like Coherent decisions imply consistent utilities [] seem to claim that "VNM and similar theorems" implies "goal-directed agents". While there has been some disagreement [] over whether this claim is actually present, it doesn't really matter -- readers come away with that impression. I see this post as correcting that claim; it would have been extremely useful for me to read this post a little over two years ago, and anecdotally I have heard that others have found it useful as well. I am somewhat worried that if readers who read this post in isolation will get the wrong impression, since it really was meant
Acknowledgements & References
01d14 min readShow Highlight


As noted in the document, several sections of this agenda drew on writings by Lukas Gloor, Daniel Kokotajlo, Anni Leskelä, Caspar Oesterheld, and Johannes Treutlein. Thank you very much to David Althaus, Tobias Baumann, Alexis Carlier, Alex Cloud, Max Daniel, Michael Dennis, Lukas Gloor, Adrian Hutter, Daniel Kokotajlo, János Kramár, David Krueger, Anni Leskelä, Matthijs Maas, Linh Chi Nguyen, Richard Ngo, Caspar Oesterheld, Mahendra Prasad, Rohin Shah, Carl Shulman, Stefan Torges, Johannes Treutlein, and Jonas Vollmer for comments on drafts of this document. Thank you also to

... (Read more)

I have previously advocated for finding a mathematically precise theory for formally approaching AI alignment. Most recently I couched this in terms of predictive coding and longer ago I was thinking in terms of a formalized phenomenology, but further discussions have helped me realize that, while I consider those approaches useful and they helped me discover my position, they are not the heart of what I think is import. The heart, modulo additional paring down that may come as a result of discussions sparked by this post, is that human values are rooted in valence and thus if we want to build... (Read more)

This was definitely an interesting and persuasive presentation of the idea. I think this goes to the same place as learning from behavior in the end, though.

For behavior: In the ancestral environment, we behaved like we wanted nourishing food and reproduction. In the modern environment we behave like we want tasty food and sex. Given a button that pumps heroin into our brain, we might behave like we want heroin pumped into our brains.

For valence, the set of preferences that optimizing valence cashes out to depends on the environment. We, in the modern envi

... (Read more)(Click to expand thread. ⌘/CTRL+F to Expand All)Cmd/Ctrl F to expand all comments on this post

I’m working on a theory of abstraction suitable as a foundation for embedded agency and specifically multi-level world models. I want to use real-world examples to build a fast feedback loop for theory development, so a natural first step is to build a starting list of examples which capture various relevant aspects of the problem.

These are mainly focused on causal abstraction, in which both the concrete and abstract model are causal DAGs with some natural correspondence between counterfactuals on the two. (There are some exceptions, though.) The list isn’t very long; I’ve... (Read more)

A tangent:

It sounds like there's some close ties to logical inductors here, both in terms of the flavor of the problem, and some difficulties I expect in translating theory into practice.

A logical inductor is kinda like an approximation. But it's more accurate to call it lots and lots of approximations - it tries to keep track of every single approximation within some large class, which is essential to the proof that it only does finitely worse than any approximation within that class.

A hierarchical model doesn't naturally fall out of such a mixture, it se

... (Read more)(Click to expand thread. ⌘/CTRL+F to Expand All)Cmd/Ctrl F to expand all comments on this post
2G Gordon Worley III2dSomewhat related to the electrical circuits example, there might be something similar in software engineering, with levels being something like (depending on the programming paradigm): * CPU instructions * byte code or op code or assembly * AST * programming language instructions * statements * functions * modules and classes * patterns and DSLs * processes * applications/products
1johnswentworth2dYes definitely. I've omitted examples from software and math because there's no "fuzziness" to it; that kind of abstraction is already better-understood than the more probabilistically-flavored use-cases I'm aiming for. But the theory should still apply to those cases, as the limiting case where probabilities are 0 or 1, so they're useful as a sanity check.
Load More