Paul Christiano

paulfchristiano's Shortform

The speed prior still delegates to better search algorithms though. For example, suppose that someone is able to fill in a 1000 bit program using only 2^500 steps of local search. Then the local search algorithm has speed prior complexity 500 bits, so will beat the object-level program. And the prior we'd end up using is basically "2x longer = 2 more bits" instead of "2x longer = 1 more bit," i.e. we end up caring more about speed because we delegated.

The actual limit on how much you care about speed is given by whatever search algorithms work best. I think it's likely possible to "expose" what is going on to the outer optimizer (so that it finds a hypothesis like "This local search algorithm is good" and then uses it to find an object-level program, rather than directly finding a program that bundles both of them together). But I'd guess intuitively that it's just not even meaningful to talk about the "simplest" programs or any prior that cares less about speed than the optimal search algorithm.

paulfchristiano's Shortform

In traditional settings, we are searching for a program M that is simpler than the property P. For example, the number of parameters in our model should be smaller than the size of the dataset we are trying to fit if we want the model to generalize. (This isn't true for modern DL because of subtleties with SGD optimizing imperfectly and implicit regularization and so on, but spiritually I think it's still fine..)

But this breaks down if we start doing something like imposing consistency checks and hoping that those change the result of learning. Intuitively it's also often not true for scientific explanations---even simple properties can be surprising and require explanation, and can be used to support theories that are much more complex than the observation itself.

Some thoughts:

- It's quite plausible that in these cases we want to be doing something other than searching over programs. This is pretty clear in the "scientific explanation" case, and maybe it's the way to go for the kinds of alignment problems I've been thinking about recently.

A basic challenge with searching over programs is that we have to interpret the other data. For example, if "correspondence between two models of physics" is some kind of different object like a description in natural language, then some amplified human is going to have to be thinking about that correspondence to see if it explains the facts. If we search over correspondences, some of them will be "attacks" on the human that basically convince them to run a general computation in order to explain the data. So we have two options: (i) perfectly harden the evaluation process against such attacks, (ii) try to ensure that there is always some way to just directly do whatever the attacker convinced the human to do. But (i) seems quite hard, and (ii) basically requires us to put all of the generic programs in our search space.

- It's also quite plausible that we'll just give up on things like consistency conditions. But those come up frequently enough in intuitive alignment schemes that I at least want to give them a fair shake.

paulfchristiano's Shortform

The speed prior is calibrated such that this never happens if the learned optimizer is just using brute force---if it needs to search over 1 extra bit then it will take 2x longer, offsetting the gains.

That means that in the regime where P is simple, the speed prior is the "least you can reasonably care about speed"---if you care even less, you will just end up pushing the optimization into an inner process that is more concerned with speed and is therefore able to try a bunch of options.

(However, this is very mild, since the speed prior cares only a tiny bit about speed. Adding 100 bits to your program is the same as letting it run 2^100 times longer, so you are basically just optimizing for simplicity.)

To make this concrete, suppose that I instead used the kind-of-speed prior, where taking 4x longer is equivalent to using 1 extra bit of description complexity. And suppose that P is very simple relative to the complexities of the other objects involved. Suppose that the "object-level" program M has 1000 bits and runs in 2^2000 time, so has kind-of-speed complexity 2000 bits. A search that uses the speed prior will be able to find this algorithm in 2^3000 time, and so will have a kind-of-speed complexity of 1500 bits. So the kind-of-speed prior will just end up delegating to the speed prior.

paulfchristiano's Shortform

Suppose I am interested in finding a program M whose input-output behavior has some property P that I can probabilistically check relatively quickly (e.g. I want to check whether M implements a sparse cut of some large implicit graph). I believe there is some simple and fast program M that does the trick. But even this relatively simple M is much more complex than the specification of the property P.

Now suppose I search for the simplest program running in time T that has property P. If T is sufficiently large, then I will end up getting the program "Search for the simplest program running in time T' that has property P, then run that." (Or something even simpler, but the point is that it will make no reference to the intended program M since encoding P is cheaper.)

I may be happy enough with this outcome, but there's some intuitive sense in which something weird and undesirable has happened here (and I may get in a distinctive kind of trouble if P is an approximate evaluation). I think this is likely to be a useful maximally-simplified example to think about.

paulfchristiano's Shortform

We might be able to get similar advantages with a more general proposal like:

Fit a function f to a (Q, A) dataset with lots of questions about latent structure. Minimize the sum of some typical QA objective and the computational cost of verifying that f is consistent.

Then the idea is that matching the conditional probabilities from the human's model (or at least being consistent with what the human believes strongly about those conditional probabilities) essentially falls out of a consistency condition.

It's not clear how to actually formulate that consistency condition, but it seems like an improvement over the prior situation (which was just baking in the obviously-untenable requirement of exactly matching). It's also not clear what happens if this consistency condition is soft.

It's not clear what "verify that the consistency conditions are met" means. You can always do the same proposal as in the parent, though it's not really clear if that's a convincing verification. But I think that's a fundamental philosophical problem that both of these proposals need to confront.

It's not clear how to balance computational cost and the QA objective. But you are able to avoid most of the bad properties just by being on the Pareto frontier, and I don't think this is worse than the prior proposal.

Overall this approach seems like it could avoid making such strong structural assumptions about the underlying model. It also helps a lot with the overlapping explanations + uniformity problem. And it generally seems to be inching towards feeling plausible.

paulfchristiano's Shortform

Here's another approach to "shortest circuit" that is designed to avoid this problem:

- Learn a circuit that outputs an entire set of beliefs. (Or maybe some different architecture, but with ~0 weight sharing so that computational complexity = description complexity.)
- Impose a consistency requirement on those beliefs, even in cases where a human can't tell the right answer.
- Require 's beliefs about to match . We hope that this makes an explication of "'s beliefs."
- Optimize some combination of (complexity) vs (usefulness), or chart the whole pareto frontier, or whatever. I'm a bit confused about how this step would work but there are similar difficulties for the other posts in this genre so it's exciting if this proposal gets to that final step.

The "intended" circuit just follows along with the computation done by and then translates its internal state into natural language.

What about the problem case where computes some reasonable beliefs (e.g. using the instrumental policy, where the simplicity prior makes us skeptical about their generalization) that could just read off? I'll imagine those being written down somewhere on a slip of paper inside of 's model of the world.

- Suppose that the slip of paper is not relevant to predicting , i.e. it's a spandrel from the weight sharing. Then the simplest circuit just wants to cut it out. Whatever computation was done to write things down on the slip of paper can be done directly by , so it seems like we're in business.
- So suppose that the slip of paper is relevant for predicting , e.g. because someone looks at the slip of paper and then takes an action that affects . If (the correct) is itself depicted on the slip of paper, then we can again cut out the slip of paper itself and just run the same computation (that was done by whoever wrote something on the slip of paper). Otherwise, the answers produced by still have to contain both the items on the slip of paper as well as some facts that are causally downstream of the slip of paper (as well as hopefully some about the slip of paper itself). At that point it seems like we have a pretty good chance of getting a consistency violation out of .

Probably nothing like this can work, but I now feel like there are two live proposals for capturing the optimistic minimal circuits intuition---the one in this current comment, and in this other comment. I still feel like the aggressive speed penalization is doing something, and I feel like probably we can either find a working proposal in that space or else come up with some clearer counterexample.

paulfchristiano's Shortform

Recently I've been thinking about ML systems that generalize poorly (copying human errors) because of either re-using predictive models of humans or using human inference procedures to map between world models.

My initial focus was on preventing re-using predictive models of humans. But I'm feeling increasingly like there is going to be a single solution to the two problems, and that the world-model mismatch problem is a good domain to develop the kind of algorithm we need. I want to say a bit about why.

I'm currently thinking about dealing with world model mismatches by learning a correspondence between models using something other than a simplicity prior / training a neural network to answering questions. Intuitively we want to do something more like "lining up" the two models and seeing what parts correspond to which others. We have a lot of conditions/criteria for such alignments, so we don't necessarily have to just stick with simplicity. This comment fleshes out one possible approach a little bit.

If this approach succeeds, then it also directly applicable to avoiding re-using human models---we want to be lining up the internal computation of our model with concepts like "There is a cat in the room" rather than just asking the model to predict whether there is a cat however it wants (which it may do by copying a human labeler). And on the flip side, I think that the "re-using human models" problem is a good constraint to have in mind when thinking about ways to do this correspondence. (Roughly speaking, because something like computational speed or "locality" seems like a really central constraint for matching up world models, and doing that approach naively can greatly exacerbate the problems with copying the training process.)

So for now I think it makes sense for me to focus on whether learning this correspondence is actually plausible. If that succeeds then I can step back and see how that changes my overall view of the landscape (I think it might be quite a significant change), and if it fails then I hope to at least know a bit more about the world model mismatch problem.

I think the best analogy in existing practice is probably doing interpretability work---mapping up the AI's model to my model is kind of like looking at neurons and trying to make sense of what they are computing (or looking for neurons that compute something). And giving up on a "simplicity prior" is very natural when doing interpretability, instead using other considerations to determine whether a correspondence is good. It still seems kind of plausible that in retrospect my current work will look like it was trying to get a solid theoretical picture on what interpretability should do (including in the regime where the correspondence is quite complex, and when the goal is a much more complete level of understanding). I swing back and forth on how strong the analogy to interpretability seems / whether or not this is how it will look in retrospect. (But at any rate, my research methodology feels like a very different approach to similar questions.)

paulfchristiano's Shortform

Here's a slightly more formal algorithm along these lines:

- Assume that both the human's model and the AI's model are Bayesian networks where you compute the probability distribution over a node 's value based on the values of its parents . I'll write for the set of values that a node can take on (in either model), and for the joint values of a set of nodes .
- A correspondence tells you how to compute the value of each node in the human's model. This consistent of (i) a neighborhood in the AI's model which suffices to determine , (ii) a function .
- Both the AI's model and the human model contain some distinguished observation nodes. must be the identity on these nodes.
- An "explanation" of a correspondence consists of a set of nodes in the AI's model for each node in the human's model. The intuition is that we can run a simulation involving only these nodes in order to reproduce the probability distribution of given its parents' values.
- In particular, , and for all . In order to check whether reproduces the right distribution, we first sample values at random for all the nodes some of whose parents aren't in . Then we sample values for the remaining nodes. We can use to compute the corresponding values for and all of its parents. And then we can compute the conditional distributions for given each set of values for its parents.
- We require that the explanations exactly reproduce the conditional probability over given .
- The "cost" of the explanation of is the sum of the compute required to sample all the nodes in . The "cost" of the correspondence is the compute required to evaluate it.
- We search for the set of correspondences and explanations for which the total cost is minimized.
- (Maybe we also have some requirement where the correspondence agrees with some training data about . I'm not really sure about that.)

Reviewing how this behaves in each of the bad cases from the parent:

- It's very bad to define by computing the observation and then using the human's inference algorithm. The entire motivation for defining it this way was to save on description complexity, but is only being penalized for computational complexity. (This also forces every single to include the entire process that generates the observations, which seems like it should be an even bigger deal. But this feels less like the "right" reason and I think it might change for a future version of the scheme.)
- If there is a human in the simulation who knows the value of , it's extremely bad to define to be that human. This is because the explanation will need to include the entire process by which the human computes that value. This is guaranteed to be more expensive than just performing that compute in the correspondence itself.
- We
*could*define to compute the value of from scratch using the initial values of the simulation, using some more-efficient simulation. But we can't do this for the observation node , since it is required to map up with the observation node in the AI's model. So in order to explain the observation node we will need to have include the AI's entire model, or at least the parts leading up to the observation. This means we are basically paying for two full copies of the computation, and so we're not really benefiting from the fact that the second copy is more efficient. (We are also potentially overlapping a lot of computation between the different 's, but again I'm a bit less sure how robust that is and whether it will hold up in different formalizations.)

There are a lot of problems and missing details in this proposal:

- This requires exactly reproducing the conditional probabilities in the human's model. But that's totally unrealistic for anything short of a full simulation---and even for the full simulation it wouldn't reproduce the probabilities since the human model isn't accurate. So we need some way to choose a good enough explanation, i.e. a way of balancing the computational complexity of the explanation against the quality of the conditional probabilities that come out.
- We're sampling the inputs to uniformly at random. This seems unlikely to work in general. We could easily sample each node from its marginal, but most of the action is in the correlation. Allowing arbitrary correlations causes problems (since you could just specify the "human is accurate" correlation and then read off the correct answers from there). So I think probably some more flexible system is needed here; there are a lot of options but it's tricky.
- There is something problematic about the overlapping explanations . If they overlap you need to pay for all of them, but for the intended mapping there will often be quite significant overlap. This isn't inherently a problem, but I'm scared that it's going to introduce a lot of pressure towards some different correspondence that is able to avoid that problem. We need to penalize overlap because the case where the training data is embedded in the model---the
*main*problem with that model is that you need to separately explain every way in which the human is correct with highly overlapping explanations. If you didn't penalize those then you may just end up with the embedded explanations (for which is extremely cheap). - There is something tricky about uniformity in the model and in the implementations of .
- I'm still scared about the "recompute everything from scratch" failure mode. The model does need to have a single explanation that needs to include the whole model. But (i) it doesn't have to reproduce work, (ii) it can cut out all the stuff not on the path to the observation. So the obvious reason that this one loses is by the duplicated work in . Hopefully that's actually robust.
- We are making really strong structural assumptions on the models and the correspondence between them. We get
*some*things for free (because humans actually do have extra structure in our beliefs about the world that is properly part of the problem statement, and the AI's model is constrained by its architecture) but not nearly this much.

Overall I'm becoming significantly more optimistic that something like this will work (though still less likely than not). Trying to step back and see the big picture, it seems like there are three key active ingredients:

- Using "speed" instead of "simplicity" as the ~only requirement for these correspondences.
- Having separate correspondences for separate properties and not allowing them to share tons of computation with each other (to prevent re-running the whole simulation).
- Forcing the model to explain correlations, so that using an "embedded" copy of the answers (like a simulation of the data-generating process) forces you to reproduce the computation that produced that answer.

My next step would probably be looking at cases where these high-level ingredients aren't sufficient (e.g. are there cases where "generate obs then do inference in the human model" is actually cheaper?). If they look pretty good, then I'll spend some more time trying to fill in the details in a more plausible way.

Answering questions honestly instead of predicting human answers: lots of problems and some solutions

- I don't think you actually want to use supervised training for training , you want to use feedback of the form "Is this answer much wronger than that answer?" and then train the model to not produce definitely-wrong answers.
- Likewise the constraint would really want to be something softer (e.g. forcing to give plausible-looking answers to questions as evaluated by ).
- I think that most questions about what is useful / tacitly assumed / etc. can be easily handled on top of the "raw" ability to elicit the model's knowledge (if you like you could imagine having a debate about which answer is better all things considered, using to assess the model's beliefs about closed question)
- I do think there are a lot of problems along these lines that you'd want to think about a bunch in theory, and then later need to do a bunch of empirical work on. But unfortunately I also think there are a lot of "bigger fish to fry" that are very likely to sink this entire family of approaches. So the first order of business is understanding those and wandering our way to a general category of solution that might actually work.

This is interesting to me for two reasons: