In the post introducing mesa optimization, the authors defined an optimizer as
a system [that is] internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.
The paper continues by defining a mesa optimizer as an optimizer that was selected by a base optimizer.
However, there are a number of issues with this definition, as some have already pointed out.
First, I think by this definition humans are clearly not mes... (Read more)
The Effective Altruism Foundation (EAF) is focused on reducing risks of astronomical suffering, or s-risks, from transformative artificial intelligence (TAI). S-risks are defined as as events that would bring about suffering on an astronomical scale, vastly exceeding all suffering that has existed on earth so far. As has been discussed elsewhere, s-risks might arise by malevolence, by accident, or in the course of conflict.
We believe that s-risks arising from conflict are among the most important, tractable, and neglected of these. In particular, strategic threats by powerful AI agents or A... (Read more)
One of the most pleasing things about probability and expected utility theory is that there are many coherence arguments that suggest that these are the “correct” ways to reason. If you deviate from what the theory prescribes, then you must be executing a dominated strategy. There must be some other strategy that never does any worse than your strategy, but does strictly better than your strategy with certainty in at least one situation. There’s a good explanation of these arguments here.
We shouldn’t expect mere humans to be able to notice any failures of coherence in a superintelligent agent,... (Read more)
As noted in the document, several sections of this agenda drew on writings by Lukas Gloor, Daniel Kokotajlo, Anni Leskelä, Caspar Oesterheld, and Johannes Treutlein. Thank you very much to David Althaus, Tobias Baumann, Alexis Carlier, Alex Cloud, Max Daniel, Michael Dennis, Lukas Gloor, Adrian Hutter, Daniel Kokotajlo, János Kramár, David Krueger, Anni Leskelä, Matthijs Maas, Linh Chi Nguyen, Richard Ngo, Caspar Oesterheld, Mahendra Prasad, Rohin Shah, Carl Shulman, Stefan Torges, Johannes Treutlein, and Jonas Vollmer for comments on drafts of this document. Thank you also to... (Read more)
I have previously advocated for finding a mathematically precise theory for formally approaching AI alignment. Most recently I couched this in terms of predictive coding and longer ago I was thinking in terms of a formalized phenomenology, but further discussions have helped me realize that, while I consider those approaches useful and they helped me discover my position, they are not the heart of what I think is import. The heart, modulo additional paring down that may come as a result of discussions sparked by this post, is that human values are rooted in valence and thus if we want to build... (Read more)
I’m working on a theory of abstraction suitable as a foundation for embedded agency and specifically multi-level world models. I want to use real-world examples to build a fast feedback loop for theory development, so a natural first step is to build a starting list of examples which capture various relevant aspects of the problem.
These are mainly focused on causal abstraction, in which both the concrete and abstract model are causal DAGs with some natural correspondence between counterfactuals on the two. (There are some exceptions, though.) The list isn’t very long; I’ve... (Read more)