Kaj Sotala


Fixing The Good Regulator Theorem

Appreciate this post! I had seen the good regulator theorem referenced every now and then, but wasn't sure what exactly the relevant claims were, and wouldn't have known how to go through the original proof myself. This is helpful.

(E.g. the result was cited by Frith & Metzinger as part of their argument that, as an agent seeks to avoid being punished by society, this constitutes an attempt to regulate society's behavior; and for the regulation be successful, the agent needs to internalize a model of the society's preferences, which once internalized becomes something like a subagent which then regulates the agent in turn and causes behaviors such as self-punishment. It sounds like the math of the theorem isn't very strongly relevant for that particular argument, though some form of the overall argument still sounds plausible to me regardless.)

The Case for a Journal of AI Alignment

IMO, a textbook would either overlook big chunks of the field or look more like an enumeration of approaches than a unified resource.

Textbooks that cover a number of different approaches without taking a position on which one is the best are pretty much the standard in many fields. (I recall struggling with it in some undergraduate psychology courses, as previous schooling didn't prepare me for a textbook that would cover three mutually exclusive theories and present compelling evidence in favor of each. Before moving on and presenting three mutually exclusive theories about some other phenomenon on the very next page.)

Why Subagents?

The "many decisions can be thought of as a committee requiring unanimous agreement" model felt intuitively right to me, and afterwards I've observed myself behaving in ways which seem compatible with it, and thought of this post.

The ethics of AI for the Routledge Encyclopedia of Philosophy

You probably know of these already, but just in case: lukeprog wrote a couple of articles on the history of AI risk thought [1, 2] going back to 1863. There's also the recent AI ethics article in the Stanford Encyclopedia of Philosophy.

I'd also like to imagine that my paper on superintelligence and astronomical suffering might say something that someone might consider important, but that is of course a subjective question. :-)

AGI safety from first principles: Introduction

because who's talking about medium-size risks from AGI?

Well, I have talked about them... :-)

The capability claim is often formulated as the possibility of an AI achieving a decisive strategic advantage (DSA). While the notion of a DSA has been implicit in many previous works, the concept was first explicitly defined by Bostrom (2014, p. 78) as “a level of technological and other advantages sufficient to enable [an AI] to achieve complete world domination.”

However, assuming that an AI will achieve a DSA seems like an unnecessarily strong form of the capability claim, as an AI could cause a catastrophe regardless. For instance, consider a scenario where an AI launches an attack calculated to destroy human civilization. If the AI was successful in destroying humanity or large parts of it, but the AI itself was also destroyed in the process, this would not count as a DSA as originally defined. Yet, it seems hard to deny that this outcome should nonetheless count as a catastrophe.

Because of this, this chapter focuses on situations where an AI achieves (at least) a major strategic advantage (MSA), which we will define as “a level of technological and other advantages sufficient to pose a catastrophic risk to human society.” A catastrophic risk is one that might inflict serious damage to human well-being on a global scale and cause 10 million or more fatalities (Bostrom & Ćirković 2008).

What Decision Theory is Implied By Predictive Processing?

I read you to be asking "what decision theory is implied by predictive processing as it's implemented in human brains". It's my understanding that while there are attempts to derive something like a "decision theory formulated entirely in PP terms", there are also serious arguments for the brain actually having systems that are just conventional decision theories and not directly derivable from PP.

Let's say you try, as some PP theorists do, to explain all behavior as free energy minimization as opposed to expected utility maximization. Ransom et al. (2020) (current sci-hub) note that this makes it hard to explain cases where the mind acts according to a prediction that has a low probability of being true, but a high cost if it were true. 

For example, the sound of rustling grass might be indicative either of the wind or of a lion; if wind is more likely, then predictive processing says that wind should become the predominant prediction. But for your own safety it can be better to predict that it's a lion, just in case. "Predict a lion" is also what standard Bayesian decision theory would recommend, and it seems like the correct solution... but to get that correct solution, you need to import Bayesian decision theory as an extra ingredient, it doesn't fall naturally out of the predictive processing framework.

That sounds to me like PP, or at least PP as it exists, is something that's compatible with implementing different decision theories, rather than one that implies a specific decision theory by itself.

Mesa-Search vs Mesa-Control

It sounds a bit absurd: you've already implemented a sophisticated RL algorithm, which keeps track of value estimates for states and actions, and propagates these value estimates to steer actions toward future value. Why would the learning process re-implement a scheme like that, nested inside of the one you implemented? Why wouldn't it just focus on filling in the values accurately?

I've thought of two possible reasons so far.

  1. Perhaps your outer RL algorithm is getting very sparse rewards, and so does not learn very fast. The inner RL could implement its own reward function, which gives faster feedback and therefore accelerates learning. This is closer to the story in Evan's mesa-optimization post, just replacing search with RL.
  2. More likely perhaps (based on my understanding), the outer RL algorithm has a learning rate that might be too slow, or is not sufficiently adaptive to the situation. The inner RL algorithm adjusts its learning rate to improve performance.

Possibly obvious, but just to point it out: both of these seem like they also describe the case of genetic evolution vs. brains.

Matt Botvinick on the spontaneous emergence of learning algorithms

Good point, I wasn't thinking of social effects changing the incentive landscape.

Matt Botvinick on the spontaneous emergence of learning algorithms
Or e.g. that it always leads to the organism optimizing for a set of goals which is unrecognizably different from the base objective. I don't think you see these things, and I'm interested in figuring out how evolution prevented them.

As I understand it, Wang et al. found that their experimental setup trained an internal RL algorithm that was more specialized for this particular task, but was still optimizing for the same task that the RNN was being trained on? And it was selected exactly because it did that very goal better. If the circumstances changed so that the more specialized behavior was no longer appropriate, then (assuming the RNN's weights hadn't been frozen) the feedback to the outer network would gradually end up reconfiguring the internal algorithm as well. So I'm not sure how it even could end up with something that's "unrecognizably different" from the base objective - even after a distributional shift, the learned objective would probably still be recognizable as a special case of the base objective, until it updated to match the new situation.

The thing that I would expect to see from this description, is that humans who were e.g. practicing a particular skill might end up becoming overspecialized to the circumstances around that skill, and need to occasionally relearn things to fit a new environment. And that certainly does seem to happen. Likewise for more general/abstract skills, like "knowing how to navigate your culture/technological environment", where older people's strategies are often more adapted to how society used to be rather than how it is now - but still aren't incapable of updating.

Catastrophic misalignment seems more likely to happen in the case of something like evolution, where the two learning algorithms operate on vastly different timescales, and it takes a very long time for evolution to correct after a drastic distributional shift. But the examples in Wang et al. lead me to think that in the brain, even the slower process operates on a timescale that's on the order of days rather than years, allowing for reasonably rapid adjustments in response to distributional shifts. (Though it's plausible that the more structure there is in a need of readjustment, the slower the reconfiguration process will be - which would fit the behavioral calcification that we see in e.g. some older people.)

Load More