johnswentworth - AI Alignment Forum

Integrating Hidden Variables Improves Approximation

This is particularly interesting if we take and $Q$ to be two different models, and take the indices 1, 2 to be different values of another random variable $Y$ with distribution $P [Y]$ given by $(λ, 1 - λ)$ . In that case, the above inequality becomes:

Note to self: this assumes P[Y] = Q[Y].

Scalable oversight as a quantitative rather than qualitative problem

johnswentworth21d82

I wasn't imagining that the human knew the best answer to any given subproblem, but nonetheless that did flesh out a lot more of what it means (under your mental model) for a human to "understand a subproblem", so that was useful.

I'll try again:

I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it's missing the more central application of these techniques: situations where the AIs ~~are taking many actions~~ solving many subproblems, where humans would eventually understand ~~any particular action~~ how well the AI's plan/action solves any particular subproblem if they spent a whole lot of time investigating it, but where that amount of time taken to oversee any ~~action~~ subproblem is prohibitively large. In such cases, the point of scalable oversight is to allow them to oversee ~~actions~~ subproblems at a much lower cost in terms of human time--to push out the Pareto frontier of oversight quality vs cost.

(... and presumably an unstated piece here is that "understanding how well the AI's plan/action solves a particular subproblem" might include recursive steps like "here's a sub-sub-problem, assume the AI's actions do a decent job solving that one", where the human might not actually check the sub-sub-problem.)

Does that accurately express the intended message?

Scalable oversight as a quantitative rather than qualitative problem

johnswentworth21d72

Based on this example and your other comment, it sounds like the intended claim of the post could be expressed as:

I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it's missing the more central application of these techniques: situations where the AIs ~~are taking many actions~~ solving many subproblems, where humans would eventually understand ~~any particular action~~ any particular subproblem and its solution if they spent a whole lot of time investigating it, but where that amount of time taken to oversee any ~~action~~ subproblem is prohibitively large. In such cases, the point of scalable oversight is to allow them to oversee ~~actions~~ subproblems at a much lower cost in terms of human time--to push out the Pareto frontier of oversight quality vs cost.

Does that accurately express the intended message?

Scalable oversight as a quantitative rather than qualitative problem

johnswentworth21d86

... situations where the AIs are taking many actions, where humans would eventually understand any particular action if they spent a whole lot of time investigating it...

Can you give an example (toy example is fine) of:

an action one might want to understand
what plan/strategy/other context that action is a part of
what it would look like for a human to understand the action

Mostly I'm confused what it would even mean to understand an action. Like, if I imagine a maze-solving AI, and I see it turn left at a particular spot (or plan to turn left), I'm not sure what it would even look like to "understand" that left-turn separate from understanding its whole maze-plan.

Towards a Less Bullshit Model of Semantics

johnswentworth25d32

One example: you know that thing where I point at a cow and say "cow", and then the toddler next to me points at another cow and is like "cow?", and I nod and smile? That's the thing we want to understand. How the heck does the toddler manage to correctly point at a second cow, on their first try, with only one example of me saying "cow"? (Note that same question still applies if they take a few tries, or have heard me use the word a few times.)

The post basically says that the toddler does a bunch of unsupervised structure learning, and then has a relatively small set of candidate targets, so when they hear the word once they can assign the word to the appropriate structure. And then we're interested in questions like "what are those structures?", and interoperability helps narrow down the possibilities for what those structures could be.

... and I don't think I've yet fully articulated the general version of the problem here, but the cow example is at least one case where "just take the magic box to be the identity function" fails to answer our question.

Towards a Less Bullshit Model of Semantics

johnswentworth25d20

But it seems pretty plausible that a major reason why humans arrive at these 'objective' 3rd-person world-models is because humans have a strong incentive to think about the world in ways that make communication possible.

This is an interesting point which I had not thought about before, thank you. Insofar as I have a response already, it's basically the same as this thread: it seems like understanding of interoperable concepts falls upstream of understanding non-interoperable concepts on the tech tree, and also there's nontrivial probability that non-interoperable concepts just aren't used much even by Solomonoff inductors (in a realistic environment).

Towards a Less Bullshit Model of Semantics

johnswentworth25d20

I definitely have substantial probability on the possibility that AIs will use a bunch of alien (i.e. non-interoperable or hard-to-interoperate) concepts. And in worlds where that's true, I largely agree that those are the most important (i.e. hardest/rate-limiting) part of the technical problems of AI safety.

That said:

I have substantial probability that AIs basically don't use a bunch of non-interoperable concepts (or converge to more interoperable concepts as capabilities grow, or ...). In those worlds, I expect that "how to understand human concepts" is the rate-limiting part of the problem.
Even in worlds where AIs do use lots of alien concepts, it feels like understanding human concepts is "earlier on the tech tree" than figuring out what to do with those alien concepts. Like, it is a hell of a lot easier to understand those alien concepts by first understanding human concepts and then building on that understanding, than by trying to jump straight to alien concepts.

Towards a Less Bullshit Model of Semantics

johnswentworth25d20

That particular paragraph was intended to be about two humans. The application to AI safety is less direct than "take Alice to be a human, and Bob to be an AI" or something like that.

Towards a Less Bullshit Model of Semantics

johnswentworth25d20

First: if the random variables include latents which extend some distribution, then values of those latents are not necessarily representable as events over the underlying distribution. Events are less general. (Related: updates allowed under radical probabilism can be represented by assignments of values to latents.)

Second: I want formulations which feel like they track what's actually going on in my head (or other peoples' heads) relatively well. Insofar as a Bayesian model makes sense for the stuff going on in my head at all, it feels like there's a whole structure of latent variables, and semantics involves assignments of values to those variables. Events don't seem to match my mental structure as well. (See How We Picture Bayesian Agents for the picture in my head here.)

Towards a Less Bullshit Model of Semantics

johnswentworth25d34

So really, rather than "the set of semantic targets is small", I should say something like "the set of semantic targets with significant prior probability is small", or something like that. Unclear exactly what the right operationalization is there, but I think I buy the basic point.

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wiki Contributions

Comments