Alex Flint

Aspiring monastic and AI safety researcher

Focus: you are allowed to be bad at accomplishing your goals

Lastly, I pick a distance between policies. If the two policies are deterministic, a Hamming distance will do. If they are stochastic, maybe some vector distance based on the Kullback-Leibler divergence.

I think it might actually be very difficult to come up with a distance metric between policies that corresponds even reasonably well to behavioral similarity. I imagine that flipping the sign on a single crucial parameter in a neural net could completely change its behavior, or at least break it sufficiently that it goes from highly goal oriented behavior to random/chaotic behavior.

By analogy, imagine trying to come up with a distance metric between python source files in a way that captures behavioral similarity. Very subtle changes to source code can completely alter behavior, while drastic refactorings can leave behavior unchanged.

Ultimately we'd like to be able to handle cases where we're using network architectures that permit arbitrary Turing machines to emerge as policies, in which case determining behavioral similarity by comparing source code is equivalent to the halting problem.

The ground of optimization

My biggest objection to this definition is that it inherently requires time

Fascinating - but why is this an objection? Is it just the inelegance of not being able to look at a single time slice and answer the question of whether optimization is happening?

One class of cases which definitely seem like optimization but do not satisfy this property at all: one-shot non-iterative optimization.

Yes this is a fascinating case! I'd like to write a whole post about it. Here are my thoughts:

- First, just as a fun fact, not that it's actually extremely rare to see any non-iterative optimization in practical usage. When we solve linear equations, we could use gaussian elimination but it's so unstable that in practice we use, most likely, the SVD, which is iterative. When we solve a system of polynomial equation we could use something like a Grobner basis or the resultant, but it's so unstable that in practice we something like a companion matrix method, which comes down to an eigenvalue decomposition, which is again iterative.
- Consider finding the roots of a simple quadratic equation (ie solving a cubic optimization problem). We can use the quadratic equation to do this. But ultimately this comes down to computing a square root, which is typically (though not necessarily) solved with an iterative method.
- That these methods (for solving linear systems, polynomial systems, and quadratic equations) have at their heart an iterative optimization algorithm is not accidental. The iterative methods involved are not some small or sideline part of what's going on. In fact when you solve a system of polynomial equations using a companion matrix, you spend a lot of energy rearranging the system into a form where it can be solved via an eigenvalue decomposition, and then the eigenvalue decomposition itself is very much operating on the full problem. It's not some unimportant side operation. I find this fascinating.
- Nevertheless it
*is*possible to solve linear systems, polynomial systems etc with non-iterative methods. - These methods are definitely considered "optimization" by any normal use of that term. So in this way my definition doesn't quite line up with the common language use of the word "optimization".
- But these non-iterative methods actually do not have the core property that I described in the square-root-of-two example. If I reach in and flip a bit while a Guassian elimination is running, the algorithm does not in any sense recover. Since the algorithm is just performing a linear sequence of steps, the error just grows and grows as the computation unfolds. This is the opposite of what happens if I reach in and flip a bit while an SVD is being computed: in this case the error will be driven back to zero by the iterative optimization algorithm.
- You might say that my focus on error-correction simply doesn't capture the common language use of the term optimization, as demonstrated by the fact that non-iterative optimization algorithms do not have this error-correcting property. You would be correct!
- But perhaps my real response is that fundamentally I'm interested in these processes that somewhat mysteriously drive the state of the world towards a target configuration, and keep doing so despite perturbations. I think these are central to what AI and agency are. The term "optimizing system" might not be quite right, but it seems close enough to be compelling.

Thanks for the question - I clarified my own thinking while writing up this response.

The ground of optimization

Thank you Ben. Reading this really filled me with joy and gives me energy to write more. Thank you for your curation work - it's a huge part of why there is this place for such high quality discussion of topics like this, for which I'm very grateful.

The ground of optimization

suppose you have a box with a rock in it, in an otherwise empty universe [...]

Yes you're right, this system would be described by a constant utility function, and yes this is analogous to the case where the target configuration set contains all configurations, and yes this should not be considered optimization. In the target set formulation, we can measure the degree of optimization by the size of the target set relative to the size of the basin of attraction. In your rock example, the sets have the same size, so it would make sense to say that the degree of optimization is zero.

This discussion is updating me in the direction that a preference ordering formulation is possible, but that we need some analogy for "degree of optimization" that captures how "tight" or "constrained" the system's evolution is relative to the size of the basin of attraction. We need a way to say that a constant utility function corresponds to a degree of optimization equal to zero. We also need a way to handle the case where our utility function assigns utility proportional to entropy, so again we can describe all physical systems as optimizing systems and thermodynamics ensures that we are correct. This utility function would be extremely flat and wide, with most configurations receiving near-identical utility (since the high entropy configurations constitute the vast majority of all possible configurations). I'm sure there is some way to quantify this - do you know of any appropriate measure?

The challenge here is that in order to actually deal with the case you mentioned originally -- the goal of moving as fast as possible -- we need a measure that is not based on the size or curvature of some local maxima of the utility function. If we are working with local maxima then we are really still working with systems that evolve towards a specific destination (although there still may be advantages to thinking this way rather than in terms of a binary set).

My preferred solution to this is just to stop trying to define optimisation in terms of

outcomes, and start defining it in terms ofcomputationdone by systems

Nice - I'd love to hear more about this

The ground of optimization

Yeah I agree that duality is not a good measure of whether a system contains something like an AI. There is one kind of AI that we can build that is highly dualistic. Most present-day AI systems are quite dualistic, because they are predicated on having some robust compute infrastructure that is separate from and mostly unperturbed by the world around it. But there is every reason to go beyond these dualistic designs, for precisely the reason you point to: such systems do tend to be somewhat brittle.

I think it's quite feasible to build highly robust AI systems, although doing so will likely require more than just hardening (making it really unlikely for the system to be perturbed). What we really want is an AI system where the core AI itself tends to evolve back to a stable configuration despite perturbations to its core infrastructure. My sense is that this will actually require a significant shift in how we think about AI -- specifically moving from the agent model to something that captures what is good and helpful in the agent model but discards the dualistic view of things.

The ground of optimization

Great examples! Thank you.

- Consider adding a big black hole in the middle of a galaxy. Does this turn the galaxy into a system optimising for a really big black hole in the middle of the galaxy?

Yes this would qualify as an optimizing system by my definition. In fact just placing a large planet close to a bunch of smaller planets would qualify as an optimizing system if the eventual result is to collapse the mass of the smaller planets into the larger planet.

This seems to me to be a lot like a ball rolling down a hill: a black hole doesn't seem alive or agentic, and it doesn't really respond in any meaningful way to hurdles put in its way, but yes it does qualify as an optimizing system. For this reason my definition isn't yet a very good definition of what agency is, or what post-agency concept we should adopt. I like Rohin's comment on how we might view agency in this framework.

- Imagine that I have the goal of travelling as fast as possible. However, there is no set of states which you can point to as the "target states", since whatever state I'm in, I'll try to go even faster. This is another argument for, as I argue below, defining an optimising system in terms of increasing some utility function (rather than moving towards target states).

Yes it's true that using a set of target states rather than an ordering over states means that we can't handle cases where there is a direction of optimization but not a "destination". But if we use an ordering over states then we run into the following problem: how can we say whether a system is robust to perturbations? Is it just that the system continues to climb the preference gradient despite perturbations? But now every system is an optimizing system, because we can always come up with some preference ordering that explains a system as an optimizing system. So then we can say "well it should be an ordering over states with a compact representation" or "it should be more compact than competing explanations". This may be okay but it seems quite dicey to me.

It actually seems quite important to me that the definition point to systems that "get back on track" even when you push them around. It may be possible to do this with an ordering over states and I'd love to discuss this more.

The ground of optimization

Well we could always just set the last digit to 0 as a post-processing step to ensure perfect repeatability. But point taken, you're right that most numerical algorithms are not quite as perfectly stable as I claimed.

The ground of optimization

Thank you for the pointer to this terminology. It seems relevant and I wasn't aware of the terminology before.

My take on CHAI’s research agenda in under 1500 words

While this is the agenda that Stuart talks most about, other work also happens at CHAI

Yes good point - I'll clarify and link to ARCHES.

The reason I'm excited about CIRL is because it provides a formalization of assistance games in the sequential decision-making setting ... There should soon be a paper that more directly explains the case for the formalism

Yeah this is a helpful perspective, and great to hear re upcoming paper. I have definitely spoken to some folks that think of CHAI as the "cooperative inverse reinforcement learning lab" so I wanted to make the point that CIRL != CHAI.

All models are wrong; some are useful

Well keep in mind that we're using the agent model twice: once in our own understanding of the AI systems we build, and then a second time in the AI system's understanding of what a human is. We can update the former as needed, but if we want the AI system to be able to update its understanding of what a human is then we need to work out how to make that assumption updateable in the algorithms we deploy.

So when I hear "X is misspecified, so it might misbehave"; I want to hear more about how exactly it will misbehave before I'm convinced I should care.

Very fair request. I will hopefully be writing more on this topic in the specific case of the agent assumption soon.

More generally, it seems like "help X" or "assist X" only means something when you view X as pursuing some goal

Well would you agree that it's possible to help a country? A country seems pretty far away from being an agent, although perhaps it could be said to have goals. Yet it does seem possible to provide e.g. economic advice or military assistance to a country in a way that helps country without simply helping each of the separate individuals.

How about helping some primitive organism, such as a jellyfish or amoeba? I guess you could impute goals onto such organisms...

How about helping a tree? It actually seems pretty straightforward to me how to help a tree (bring water and nutrients to it, clean off parasites from the bark, cut away any dead branches), but does an individual tree really have goals?

Thanks for the very thoughtful comment Rohin. I was on retreat last week after I published the article and upon returning to computer usage I was delighted by the engagement from you and others.

I like this.

We'll presumably need to give O some information about the goal / target configuration set for each task. We could say that a robot capable of moving a vase around is a little bit general since we can have it solve the tasks of placing the vase at many different locations by inputting some latitude/longitude into some appropriate memory location. But this means we're actually pasting in a different object O for each task T -- each of the objects differs in those memory locations into which we're pasting the latitude/longitude. It might be helpful to think of a "agent schema" function that maps goals to objects, so we take the goal part of the task, compute the object O for that goal, then paste this object into the environment.

It's also important that O be able to solve the task for a reasonably broad range of environments.

Perhaps we could look at it this way: take a system containing a human that is trying to get something done. This is presumably an optimizing system as humans often robustly move their environment towards some desired target configuration set. Then an inner-aligned AI is an object O such that adding it to this environment does not change the target configuration set, but does change the speed and/or robustness of convergence to that target configuration set.

Yup very difficult to say much about intentions using the pure outside view approach of this framework. Perhaps we could say that an intent-aligned AI is an inner-aligned AI modulo less robustness. Or perhaps we could say that an intent-aligned AI is an AI that would achieve the goal in a large set of benign environments, but might not achieve it in the presence of unlikely mistakes, unlikely environmental conditions, or the presence of other powerful basins of attraction.

But this doesn't really get at the spirit of Paul's idea, which I think is about really looking inside the AI and understanding its goals.