johnswentworth

Sequences

From Atoms To Agents
"Why Not Just..."
Basic Foundations for Agent Models

Wiki Contributions

Comments

I'm not sure what motivation for worst-case reasoning you're thinking about here. Maybe just that there are many disjunctive ways things can go wrong other than bad capability evals and the AI will optimize against us?

This getting very meta, but I think my Real Answer is that there's an analogue of You Are Not Measuring What You Think You Are Measuring for plans. Like, the system just does not work any of the ways we're picturing it at all, so plans will just generally not at all do what we imagine they're going to do.

(Of course the plan could still in-principle have a high chance of "working", depending on the problem, insofar as the goal turns out to be easy to achieve, i.e. most plans work by default. But even in that case, the planner doesn't have counterfactual impact; just picking some random plan would have been about as likely to work.)

The general solution which You Are Not Measuring What You Think You Are Measuring suggested was "measure tons of stuff", so that hopefully you can figure out what you're actually measuring. The analogy of that technique for plans would be: plan for tons of different scenarios, failure modes, and/or goals. Find plans (or subplans) which generalize to tons of different cases, and there might be some hope that it generalizes to the real world. The plan can maybe be robust enough to work even though the system does not work at all the ways we imagine.

But if the plan doesn't even generalize to all the low-but-not-astronomically-low-probability possibilities we've thought of, then, man, it sure does seem a lot less likely to generalize to the real system. Like, that pretty strongly suggests that the plan will work only insofar as the system operates basically the way we imagined.

And for this exact failure mode, I think that improvements upon various relatively straight forward capability evals are likely to quite compelling as the most leveraged current interventions, but I'm not confident.

Personally, my take on basically-all capabilities evals which at all resemble the evals developed to date is You Are Not Measuring What You Think You Are Measuring; I expect them to mostly just not measure whatever turns out to matter in practice.

Yes, there is a story for a canonical factorization of , it's just separate from the story in this post.

Sounds like we need to unpack what "viewing  as a latent which generates " is supposed to mean.

I start with a distribution . Let's say  is a bunch of rolls of a biased die, of unknown bias. But I don't know that's what  is; I just have the joint distribution of all these die-rolls. What I want to do is look at that distribution and somehow "recover" the underlying latent variable (bias of the die) and factorization, i.e. notice that I can write the distribution as , where  is the bias in this case. Then when reasoning/updating, we can usually just think about how an individual die-roll interacts with , rather than all the other rolls, which is useful insofar as  is much smaller than all the rolls.

Note that  is not supposed to match ; then the representation would be useless. It's the marginal  which is supposed to match .

The lightcone theorem lets us do something similar. Rather all the 's being independent given , only those 's sufficiently far apart are independent, but the concept is otherwise similar. We express  as  (or, really, , where  summarizes info in  relevant to , which is hopefully much smaller than all of ).

 is conceptually just the whole bag of abstractions (at a certain scale), unfactored.

If you have sets of variables that start with no mutual information (conditioning on ), and they are so far away that nothing other than  could have affected both of them (distance of at least ), then they continue to have no mutual information (independent).

Yup, that's basically it. And I agree that it's pretty obvious once you see it - the key is to notice that distance  implies that nothing other than  could have affected both of them. But man, when I didn't know that was what I should look for? Much less obvious.

I don't understand why the distribution of  must be the same as the distribution of . It seems like it should hold for arbitrary .

It does, but then  doesn't have the same distribution as the original graphical model (unless we're running the sampler long enough to equilibrate). So we can't view  as a latent generating that distribution.

But this theorem is only telling you that you can throw away information that could never possibly have been relevant.

Not quite - note that the resampler itself throws away a ton of information about  while going from  to . And that is indeed information which "could have" been relevant, but almost always gets wiped out by noise. That's the information we're looking to throw away, for abstraction purposes.

So the reason this is interesting (for the thing you're pointing to) is not that it lets us ignore information from far-away parts of  which could not possibly have been relevant given , but rather that we want to further throw away information from  itself (while still maintaining conditional independence at a distance).

Ah, no, I suppose that part is supposed to be handled by whatever approximation process we define for ? That is, the "correct" definition of the "most minimal approximate summary" would implicitly constrain the possible choices of boundaries for which  is equivalent to ?

Almost. The hope/expectation is that different choices yield approximately the same , though still probably modulo some conditions (like e.g. sufficiently large ).

What's the  here? Is it meant to be ?

System size, i.e. number of variables.

The new question is: what is the upper bound on bits of optimization gained from a bit of observation? What's the best-case asymptotic scaling? The counterexample suggests it's roughly exponential, i.e. one bit of observation can double the number of bits of optimization. On the other hand, it's not just multiplicative, because our xor example at the top of this post showed a jump from 0 bits of optimization to 1 bit from observing 1 bit.

Alright, I think we have an answer! The conjecture is false.

Counterexample: suppose I have a very-high-capacity information channel (N bit capacity), but it's guarded by a uniform random n-bit password. O is the password, A is an N-bit message and a guess at the n-bit password. Y is the N-bit message part of A if the password guess matches O; otherwise, Y is 0.

Let's say the password is 50 bits and the message is 1M bits. If A is independent of the password, then there's a  chance of guessing the password, so the bitrate will be about  * 1M  , or about one-billionth of a bit in expectation.

If A "knows" the password, then the capacity is about 1M bits. So, the delta from knowing the password is a lot more than 50 bits. It's a a multiplier of , rather than an addition of 50 bits.

This is really cool! It means that bits of observation can give a really ridiculously large boost to a system's optimization power. Making actions depend on observations is a potentially-very-big game, even with just a few bits of observation.

Credit to Yair Halberstadt in the comments for the attempted-counterexamples which provided stepping stones to this one.

Eliminating G

The standard definition of channel capacity makes no explicit reference to the original message ; it can be eliminated from the problem. We can do the same thing here, but it’s trickier. First, let’s walk through it for the standard channel capacity setup.

Standard Channel Capacity Setup

In the standard setup,  cannot depend on , so our graph looks like

… and we can further remove  entirely by absorbing it into the stochasticity of .

Now, there are two key steps. First step: if  is not a deterministic function of , then we can make  a deterministic function of  without reducing . Anywhere  is stochastic, we just read the random bits from some independent part of  instead;  will have the same joint distribution with any parts of  which  was reading before, but will also potentially get some information about the newly-read bits of  as well.

Second step: note from the graphical structure that  mediates between  and . Since  is a deterministic function of  and  mediates between  and , we have .

Furthermore, we can achieve any distribution  (to arbitrary precision) by choosing a suitable function .

So, for the standard channel capacity problem, we have , and we can simplify the optimization problem:

Note that this all applies directly to our conjecture, for the part where actions do not depend on observations.

That’s how we get the standard expression for channel capacity. It would be potentially helpful to do something similar in our problem, allowing for observation of .

Our Problem

The step about determinism of  carries over easily: if  is not a deterministic function of  and , then we can change  to read random bits from an independent part of . That will make  a deterministic function of  and  without reducing .

The second step fails:  does not mediate between  and .

However, we can define a “Policy” variable

 is also a deterministic function of , and  does mediate between G and Y. And we can achieve any distribution over policies (to arbitrary precision) by choosing a suitable function .

So, we can rewrite our problem as

In the context of our toy example:  has two possible values,  and . If  takes the first value, then  is deterministically 0; if  takes the second value, then  is deterministically 1. So, taking the distribution  to be 50/50 over those two values, our generalized “channel capacity” is at least 1 bit. (Note that we haven’t shown that no  achieves higher value in the maximization problem, which is why I say “at least”.)

Back to the general case: our conjecture can be expressed as

where the first optimization problem uses the factorization

and the second optimization problem uses the factorization

Load More