johnswentworth

# Sequences

From Atoms To Agents
"Why Not Just..."
Basic Foundations for Agent Models

# Wiki Contributions

You can show that, in order for an agent to persist, it needs to have the capacity to observe and learn about its environment. The math is a more complex than I want to get into here...

Do you have a citation for this? I went looking for the supposed math behind that claim a couple years back, and found one section of one Friston paper which had an example system which did not obviously generalize particularly well, and also used a kinda-hand-wavy notion of "Markov blanket" that didn't make it clear what precisely was being conditioned on (a critique which I would extend to all of the examples you list). And that was it; hundreds of excited citations chained back to that one spot. If anybody's written an actual explanation and/or proof somewhere, that would be great.

This is particularly interesting if we take  and  to be two different models, and take the indices 1, 2 to be different values of another random variable  with distribution  given by . In that case, the above inequality becomes:

Note to self: this assumes P[Y] = Q[Y].

I wasn't imagining that the human knew the best answer to any given subproblem, but nonetheless that did flesh out a lot more of what it means (under your mental model) for a human to "understand a subproblem", so that was useful.

I'll try again:

I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it's missing the more central application of these techniques: situations where the AIs are taking many actions solving many subproblems, where humans would eventually understand any particular action how well the AI's plan/action solves any particular subproblem if they spent a whole lot of time investigating it, but where that amount of time taken to oversee any action subproblem is prohibitively large. In such cases, the point of scalable oversight is to allow them to oversee actions subproblems at a much lower cost in terms of human time--to push out the Pareto frontier of oversight quality vs cost.

(... and presumably an unstated piece here is that "understanding how well the AI's plan/action solves a particular subproblem" might include recursive steps like "here's a sub-sub-problem, assume the AI's actions do a decent job solving that one", where the human might not actually check the sub-sub-problem.)

Does that accurately express the intended message?

Based on this example and your other comment, it sounds like the intended claim of the post could be expressed as:

I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it's missing the more central application of these techniques: situations where the AIs are taking many actions solving many subproblems, where humans would eventually understand any particular action any particular subproblem and its solution if they spent a whole lot of time investigating it, but where that amount of time taken to oversee any action subproblem is prohibitively large. In such cases, the point of scalable oversight is to allow them to oversee actions subproblems at a much lower cost in terms of human time--to push out the Pareto frontier of oversight quality vs cost.

Does that accurately express the intended message?

... situations where the AIs are taking many actions, where humans would eventually understand any particular action if they spent a whole lot of time investigating it...

Can you give an example (toy example is fine) of:

• an action one might want to understand
• what plan/strategy/other context that action is a part of
• what it would look like for a human to understand the action

?

Mostly I'm confused what it would even mean to understand an action. Like, if I imagine a maze-solving AI, and I see it turn left at a particular spot (or plan to turn left), I'm not sure what it would even look like to "understand" that left-turn separate from understanding its whole maze-plan.

One example: you know that thing where I point at a cow and say "cow", and then the toddler next to me points at another cow and is like "cow?", and I nod and smile? That's the thing we want to understand. How the heck does the toddler manage to correctly point at a second cow, on their first try, with only one example of me saying "cow"? (Note that same question still applies if they take a few tries, or have heard me use the word a few times.)

The post basically says that the toddler does a bunch of unsupervised structure learning, and then has a relatively small set of candidate targets, so when they hear the word once they can assign the word to the appropriate structure. And then we're interested in questions like "what are those structures?", and interoperability helps narrow down the possibilities for what those structures could be.

... and I don't think I've yet fully articulated the general version of the problem here, but the cow example is at least one case where "just take the magic box to be the identity function" fails to answer our question.

But it seems pretty plausible that a major reason why humans arrive at these 'objective' 3rd-person world-models is because humans have a strong incentive to think about the world in ways that make communication possible.

This is an interesting point which I had not thought about before, thank you. Insofar as I have a response already, it's basically the same as this thread: it seems like understanding of interoperable concepts falls upstream of understanding non-interoperable concepts on the tech tree, and also there's nontrivial probability that non-interoperable concepts just aren't used much even by Solomonoff inductors (in a realistic environment).

I definitely have substantial probability on the possibility that AIs will use a bunch of alien (i.e. non-interoperable or hard-to-interoperate) concepts. And in worlds where that's true, I largely agree that those are the most important (i.e. hardest/rate-limiting) part of the technical problems of AI safety.

That said:

• I have substantial probability that AIs basically don't use a bunch of non-interoperable concepts (or converge to more interoperable concepts as capabilities grow, or ...). In those worlds, I expect that "how to understand human concepts" is the rate-limiting part of the problem.
• Even in worlds where AIs do use lots of alien concepts, it feels like understanding human concepts is "earlier on the tech tree" than figuring out what to do with those alien concepts. Like, it is a hell of a lot easier to understand those alien concepts by first understanding human concepts and then building on that understanding, than by trying to jump straight to alien concepts.

That particular paragraph was intended to be about two humans. The application to AI safety is less direct than "take Alice to be a human, and Bob to be an AI" or something like that.

First: if the random variables include latents which extend some distribution, then values of those latents are not necessarily representable as events over the underlying distribution. Events are less general. (Related: updates allowed under radical probabilism can be represented by assignments of values to latents.)

Second: I want formulations which feel like they track what's actually going on in my head (or other peoples' heads) relatively well. Insofar as a Bayesian model makes sense for the stuff going on in my head at all, it feels like there's a whole structure of latent variables, and semantics involves assignments of values to those variables. Events don't seem to match my mental structure as well. (See How We Picture Bayesian Agents for the picture in my head here.)