Kaj Sotala


Why Subagents?

The "many decisions can be thought of as a committee requiring unanimous agreement" model felt intuitively right to me, and afterwards I've observed myself behaving in ways which seem compatible with it, and thought of this post.

The ethics of AI for the Routledge Encyclopedia of Philosophy

You probably know of these already, but just in case: lukeprog wrote a couple of articles on the history of AI risk thought [1, 2] going back to 1863. There's also the recent AI ethics article in the Stanford Encyclopedia of Philosophy.

I'd also like to imagine that my paper on superintelligence and astronomical suffering might say something that someone might consider important, but that is of course a subjective question. :-)

AGI safety from first principles: Introduction

because who's talking about medium-size risks from AGI?

Well, I have talked about them... :-)

The capability claim is often formulated as the possibility of an AI achieving a decisive strategic advantage (DSA). While the notion of a DSA has been implicit in many previous works, the concept was first explicitly defined by Bostrom (2014, p. 78) as “a level of technological and other advantages sufficient to enable [an AI] to achieve complete world domination.”

However, assuming that an AI will achieve a DSA seems like an unnecessarily strong form of the capability claim, as an AI could cause a catastrophe regardless. For instance, consider a scenario where an AI launches an attack calculated to destroy human civilization. If the AI was successful in destroying humanity or large parts of it, but the AI itself was also destroyed in the process, this would not count as a DSA as originally defined. Yet, it seems hard to deny that this outcome should nonetheless count as a catastrophe.

Because of this, this chapter focuses on situations where an AI achieves (at least) a major strategic advantage (MSA), which we will define as “a level of technological and other advantages sufficient to pose a catastrophic risk to human society.” A catastrophic risk is one that might inflict serious damage to human well-being on a global scale and cause 10 million or more fatalities (Bostrom & Ćirković 2008).

What Decision Theory is Implied By Predictive Processing?

I read you to be asking "what decision theory is implied by predictive processing as it's implemented in human brains". It's my understanding that while there are attempts to derive something like a "decision theory formulated entirely in PP terms", there are also serious arguments for the brain actually having systems that are just conventional decision theories and not directly derivable from PP.

Let's say you try, as some PP theorists do, to explain all behavior as free energy minimization as opposed to expected utility maximization. Ransom et al. (2020) (current sci-hub) note that this makes it hard to explain cases where the mind acts according to a prediction that has a low probability of being true, but a high cost if it were true. 

For example, the sound of rustling grass might be indicative either of the wind or of a lion; if wind is more likely, then predictive processing says that wind should become the predominant prediction. But for your own safety it can be better to predict that it's a lion, just in case. "Predict a lion" is also what standard Bayesian decision theory would recommend, and it seems like the correct solution... but to get that correct solution, you need to import Bayesian decision theory as an extra ingredient, it doesn't fall naturally out of the predictive processing framework.

That sounds to me like PP, or at least PP as it exists, is something that's compatible with implementing different decision theories, rather than one that implies a specific decision theory by itself.

Mesa-Search vs Mesa-Control

It sounds a bit absurd: you've already implemented a sophisticated RL algorithm, which keeps track of value estimates for states and actions, and propagates these value estimates to steer actions toward future value. Why would the learning process re-implement a scheme like that, nested inside of the one you implemented? Why wouldn't it just focus on filling in the values accurately?

I've thought of two possible reasons so far.

  1. Perhaps your outer RL algorithm is getting very sparse rewards, and so does not learn very fast. The inner RL could implement its own reward function, which gives faster feedback and therefore accelerates learning. This is closer to the story in Evan's mesa-optimization post, just replacing search with RL.
  2. More likely perhaps (based on my understanding), the outer RL algorithm has a learning rate that might be too slow, or is not sufficiently adaptive to the situation. The inner RL algorithm adjusts its learning rate to improve performance.

Possibly obvious, but just to point it out: both of these seem like they also describe the case of genetic evolution vs. brains.

Matt Botvinick on the spontaneous emergence of learning algorithms

Good point, I wasn't thinking of social effects changing the incentive landscape.

Matt Botvinick on the spontaneous emergence of learning algorithms
Or e.g. that it always leads to the organism optimizing for a set of goals which is unrecognizably different from the base objective. I don't think you see these things, and I'm interested in figuring out how evolution prevented them.

As I understand it, Wang et al. found that their experimental setup trained an internal RL algorithm that was more specialized for this particular task, but was still optimizing for the same task that the RNN was being trained on? And it was selected exactly because it did that very goal better. If the circumstances changed so that the more specialized behavior was no longer appropriate, then (assuming the RNN's weights hadn't been frozen) the feedback to the outer network would gradually end up reconfiguring the internal algorithm as well. So I'm not sure how it even could end up with something that's "unrecognizably different" from the base objective - even after a distributional shift, the learned objective would probably still be recognizable as a special case of the base objective, until it updated to match the new situation.

The thing that I would expect to see from this description, is that humans who were e.g. practicing a particular skill might end up becoming overspecialized to the circumstances around that skill, and need to occasionally relearn things to fit a new environment. And that certainly does seem to happen. Likewise for more general/abstract skills, like "knowing how to navigate your culture/technological environment", where older people's strategies are often more adapted to how society used to be rather than how it is now - but still aren't incapable of updating.

Catastrophic misalignment seems more likely to happen in the case of something like evolution, where the two learning algorithms operate on vastly different timescales, and it takes a very long time for evolution to correct after a drastic distributional shift. But the examples in Wang et al. lead me to think that in the brain, even the slower process operates on a timescale that's on the order of days rather than years, allowing for reasonably rapid adjustments in response to distributional shifts. (Though it's plausible that the more structure there is in a need of readjustment, the slower the reconfiguration process will be - which would fit the behavioral calcification that we see in e.g. some older people.)

Matt Botvinick on the spontaneous emergence of learning algorithms
That said, if mesa-optimization is a standard feature[4] of brain architecture, it seems notable that humans don’t regularly experience catastrophic inner alignment failures.

What would it look like if they did?

$1000 bounty for OpenAI to show whether GPT3 was "deliberately" pretending to be stupider than it is
I don't feel at all tempted to do that anthropomorphization, and I think it's weird that EY is acting as if this is a reasonable thing to do.

"It's tempting to anthropomorphize GPT-3 as trying its hardest to make John smart" seems obviously incorrect if it's explicitly phrased that way, but e.g. the "Giving GPT-3 a Turing Test" post seems to implicitly assume something like it:

This gives us a hint for how to stump the AI more consistently. We need to ask questions that no normal human would ever talk about.

Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.

Q: How many eyes does my foot have?
A: Your foot has two eyes.

Q: How many eyes does a spider have?
A: A spider has eight eyes.

Q: How many eyes does the sun have?
A: The sun has one eye.

Q: How many eyes does a blade of grass have?
A: A blade of grass has one eye.
Now we’re getting into surreal territory. GPT-3 knows how to have a normal conversation. It doesn’t quite know how to say “Wait a moment… your question is nonsense.” It also doesn’t know how to say “I don’t know.”

The author says that this "stumps" GPT-3, which "doesn't know how to" say that it doesn't know. That's as if GPT-3 was doing its best to give "smart" answers, and just was incapable of doing so. But Nick Cammarata showed that if you just give GPT-3 a prompt where nonsense answers are called out as such, it will do just that.

Building brain-inspired AGI is infinitely easier than understanding the brain
If some circuit in the brain is doing something useful, then it's humanly feasible to understand what that thing is and why it's useful, and to write our own CPU code that does the same useful thing.
In other words, the brain's implementation of that thing can be super-complicated, but the input-output relation cannot be that complicated—at least, the useful part of the input-output relation cannot be that complicated.

Robin Hanson makes a similar argument in "Signal Processors Decouple":

The bottom line is that to emulate a biological signal processor, one need only identify its key internal signal dimensions and their internal mappings – how input signals are mapped to output signals for each part of the system. These key dimensions are typically a tiny fraction of its physical degrees of freedom. Reproducing such dimensions and mappings with sufficient accuracy will reproduce the function of the system.
This is proven daily by the 200,000 people with artificial ears, and will be proven soon when artificial eyes are fielded. Artificial ears and eyes do not require a detailed weather-forecasting-like simulation of the vast complex physical systems that are our ears and eyes. Yes, such artificial organs do not exactly reproduce the input-output relations of their biological counterparts. I expect someone with one artificial ear and one real ear could tell the difference. But the reproduction is close enough to allow the artificial versions to perform most of the same practical functions.
We are confident that the number of relevant signal dimensions in a human brain is vastly smaller than its physical degrees of freedom. But we do not know just how many are those dimensions. The more dimensions, the harder it will be to emulate them. But the fact that human brains continue to function with nearly the same effectiveness when they are whacked on the side of the head, or when flooded with various odd chemicals, shows they have been designed to decouple from most other physical brain dimensions.
The brain still functions reasonably well even flooded with chemicals specifically designed to interfere with neurotransmitters, the key chemicals by which neurons send signals to each other! Yes people on “drugs” don’t function exactly the same, but with moderate drug levels people can still perform most of the functions required for most jobs.
Load More