(To restate the obvious, all of the stuff here is extremely WIP and rambling.)
I've often talked about the case where an unaligned model learns a description of the world + the procedure for reading out "what the camera sees" from the world. In this case, I've imagined an aligned model starting from the unaligned model and then extracting additional structure.
It now seems to me that the ideal aligned behavior is to learn only the "description of the world" and then have imitative generalization take it from there, identifying the correspondence between the world we know and the learned model. That correspondence includes in particular "what the camera sees."
The major technical benefit of doing it this way is that we end up with a higher prior probability on the aligned model than the unaligned model---the aligned one doesn't have to specify how to read out observations. And specifying how to read out observations doesn't really make it easier to find that correspondence.
We still need to specify how the "human" in imitative generalization actually finds this correspondence. So this doesn't fundamentally change any of the stuff I've recently been thinking about, but I think that the framing is becoming clearer and it's more likely we can find our way to the actually-right way to do it.
It now seems to me that a core feature of the situation that lets us pull out a correspondence is that you can't generally have two equally-valid correspondences for a given model---the standards for being a "good correspondence" are such that it would require crazy logical coincidence, and in fact this seems to be the core feature of "goodness." For example, you could have multiple "correspondences" that effectively just recompute everything from scratch, but by exactly the same token those are bad correspondences.
(This obviously only happens once the space and causal structure is sufficiently rich. There may be multiple ways of seeing faces in clouds, but once your correspondence involves people and dogs and the people talking about how the dogs are running around, it seems much more constrained because you need to reproduce all of that causal structure, and the very fact that humans can make good judgments about whether there are dogs implies that everything is incredibly constrained.)
There can certainly be legitimate ambiguity or uncertainty. For example, there may be a big world with multiple places that you could find a given pattern of dogs barking at cats. Or there might be parts of the world model that are just clearly underdetermined (e.g. there are two identical twins and we actually can't tell which is which). In these cases the space of possible correspondences still seems effectively discrete, rather than being a massive space parameterized as neural networks or something. We'd be totally happy surfacing all of the options in these cases.
There can also be a bunch of inconsequential uncertainty, things that feel more like small deformations of the correspondence than moving to a new connected component in correspondence-space. Things like slightly adjusting the boundaries of objects or of categories.
I'm currently thinking about this in terms of: given two different correspondences, why is it that they manage to both fit the data? Options:
I don't know where all of this ends up, but I feel some pretty strong common-sense intuition like "If you had some humans looking at the model, they could recognize a good correspondence when they saw it" and for now I'm going to be following that to see where it goes.
I tentatively think the whole situation is basically the same for "intuition module outputs a set of premises and then a deduction engine takes it from there" as for a model of physics. That is, it's still the case that (assuming enough richness) the translation between the "intuition module"'s language and human language is going to be more or less pinned down uniquely, and we'll have the same kind of taxonomy over cases where two translations would work equally well.
Note that HumanAnswer and IntendedAnswer do different things. HumanAnswer spreads out its probability mass more, by first making an observation and then taking the whole distribution over worlds that were consistent with it.
Abstracting out Answer, let's just imagine that our AI outputs a distribution p over the space of trajectories S in the human ontology, and somehow we define a reward function r(p,ω) evaluated by the human in hindsight after getting the observation ω. The idea is that this is calculated by having the AI answer some questions about what it believes etc but we'll abstract that all out.
Then the conclusion in this post holds under some convexity assumption on r, since then spreading out your mass can't really hurt you (since the human has no way to prefer your pointy estimate). But e.g. if you just penalized p for being uncertain, then IntendedAnswer could easily outperform HumanAnswer. Similarly, if we require that p satisfy various conditional independence properties then we may rule out HumanAnswer.
The more precise bad behavior InstrumentalAnswer is to output the distribution argmaxpEω∼W′[r(p,ω)]. Of course nothing else is going to get a higher reward. This is about as simple as HumanAnswer. It could end up being slightly more computationally complex. I think everything I've said about this case still applies for InstrumentalAnswer, but it's relevant when I start talking about stuff like conditional independence requirements between the model's answers.
Actually if A --> B --> C and I observe some function of (A, B, C) it's just not generally the case that my beliefs about A and C are conditionally independent given my beliefs about B (e.g. suppose I observe A+C). This just makes it even easier to avoid the bad function in this case, but means I want to be more careful about the definition of the case to ensure that it's actually difficult before concluding that this kid of conditional independence structure is potentially useful.
This is also a way to think about the proposals in this post and the reply:
So are there some facts about conditional independencies that would privilege the intended mapping? Here is one option.
We believe that A' and C' should be independent conditioned on B'. One problem is that this isn't even true, because B' is a coarse-graining and so there are in fact correlations between A' and C' that the human doesn't understand. That said, I think that the bad map introduces further conditional correlations, even assuming B=B'. For example, if you imagine Y preserving some facts about A' and C', and if the human is sometimes mistaken about B'=B, then we will introduce extra correlations between the human's beliefs about A' and C'.
I think it's pretty plausible that there are necessarily some "new" correlations in any case where the human's inference is imperfect, but I'd like to understand that better.
So I think the biggest problem is that none of the human's believed conditional independencies actually hold---they are both precise, and (more problematically) they may themselves only hold "on distribution" in some appropriate sense.
This problem seems pretty approachable though and so I'm excited to spend some time thinking about it.
Causal structure is an intuitively appealing way to pick out the "intended" translation between an AI's model of the world and a human's model. For example, intuitively "There is a dog" causes "There is a barking sound." If we ask our neural net questions like "Is there a dog?" and it computes its answer by checking "Does a human labeler think there is a dog?" then its answers won't match the expected causal structure---so maybe we can avoid these kinds of answers.
What does that mean if we apply typical definitions of causality to ML training?
Here's an abstract example to think about these proposals, just a special case of the example from this post.
This is interesting to me for two reasons:
The speed prior still delegates to better search algorithms though. For example, suppose that someone is able to fill in a 1000 bit program using only 2^500 steps of local search. Then the local search algorithm has speed prior complexity 500 bits, so will beat the object-level program. And the prior we'd end up using is basically "2x longer = 2 more bits" instead of "2x longer = 1 more bit," i.e. we end up caring more about speed because we delegated.
The actual limit on how much you care about speed is given by whatever search algorithms work best. I think it's likely possible to "expose" what is going on to the outer optimizer (so that it finds a hypothesis like "This local search algorithm is good" and then uses it to find an object-level program, rather than directly finding a program that bundles both of them together). But I'd guess intuitively that it's just not even meaningful to talk about the "simplest" programs or any prior that cares less about speed than the optimal search algorithm.
In traditional settings, we are searching for a program M that is simpler than the property P. For example, the number of parameters in our model should be smaller than the size of the dataset we are trying to fit if we want the model to generalize. (This isn't true for modern DL because of subtleties with SGD optimizing imperfectly and implicit regularization and so on, but spiritually I think it's still fine..)
But this breaks down if we start doing something like imposing consistency checks and hoping that those change the result of learning. Intuitively it's also often not true for scientific explanations---even simple properties can be surprising and require explanation, and can be used to support theories that are much more complex than the observation itself.
The speed prior is calibrated such that this never happens if the learned optimizer is just using brute force---if it needs to search over 1 extra bit then it will take 2x longer, offsetting the gains.
That means that in the regime where P is simple, the speed prior is the "least you can reasonably care about speed"---if you care even less, you will just end up pushing the optimization into an inner process that is more concerned with speed and is therefore able to try a bunch of options.
(However, this is very mild, since the speed prior cares only a tiny bit about speed. Adding 100 bits to your program is the same as letting it run 2^100 times longer, so you are basically just optimizing for simplicity.)
To make this concrete, suppose that I instead used the kind-of-speed prior, where taking 4x longer is equivalent to using 1 extra bit of description complexity. And suppose that P is very simple relative to the complexities of the other objects involved. Suppose that the "object-level" program M has 1000 bits and runs in 2^2000 time, so has kind-of-speed complexity 2000 bits. A search that uses the speed prior will be able to find this algorithm in 2^3000 time, and so will have a kind-of-speed complexity of 1500 bits. So the kind-of-speed prior will just end up delegating to the speed prior.