# All of paulfchristiano's Comments + Replies

paulfchristiano's Shortform

(To restate the obvious, all of the stuff here is extremely WIP and rambling.)

I've often talked about the case where an unaligned model learns a description of the world + the procedure for reading out "what the camera sees" from the world. In this case, I've imagined an aligned model starting from the unaligned model and then extracting additional structure.

It now seems to me that the ideal aligned behavior is to learn only the "description of the world" and then have imitative generalization take it from there, identifying the correspondence between the ... (read more)

Answering questions honestly given world-model mismatches

Note that HumanAnswer and IntendedAnswer do different things. HumanAnswer spreads out its probability mass more, by first making an observation and then taking the whole distribution over worlds that were consistent with it.

Abstracting out Answer, let's just imagine that our AI outputs a distribution  over the space of trajectories  in the human ontology, and somehow we define a reward function  evaluated by the human in hindsight after getting the observation . The idea is that this is calculated by having the A... (read more)

paulfchristiano's Shortform

Actually if A --> B --> C and I observe some function of (A, B, C) it's just not generally the case that my beliefs about A and C are conditionally independent given my beliefs about B (e.g. suppose I observe A+C). This just makes it even easier to avoid the bad function in this case, but means I want to be more careful about the definition of the case to ensure that it's actually difficult before concluding that this kid of conditional independence structure is potentially useful.

paulfchristiano's Shortform

This is also a way to think about the proposals in this post and the reply:

• The human believes that A' and B' are related in a certain way for simple+fundamental reasons.
• On the training distribution, all of the functions we are considering reproduce the expected relationship. However, the reason that they reproduce the expected relationship is quite different.
• For the intended function, you can verify this relationship by looking at the link (A --> B) and the coarse-graining applied to A and B, and verify that the probabilities work out. (That is, I can r
paulfchristiano's Shortform

So are there some facts about conditional independencies that would privilege the intended mapping? Here is one option.

We believe that A' and C' should be independent conditioned on B'. One problem is that this isn't even true, because B' is a coarse-graining and so there are in fact correlations between A' and C' that the human doesn't understand. That said, I think that the bad map introduces further conditional correlations, even assuming B=B'. For example, if you imagine Y preserving some facts about A' and C', and if the human is sometimes mistaken ab... (read more)

2Paul Christiano2moActually if A --> B --> C and I observe some function of (A, B, C) it's just not generally the case that my beliefs about A and C are conditionally independent given my beliefs about B (e.g. suppose I observe A+C). This just makes it even easier to avoid the bad function in this case, but means I want to be more careful about the definition of the case to ensure that it's actually difficult before concluding that this kid of conditional independence structure is potentially useful.
paulfchristiano's Shortform

Causal structure is an intuitively appealing way to pick out the "intended" translation between an AI's model of the world and a human's model. For example, intuitively "There is a dog" causes "There is a barking sound." If we ask our neural net questions like "Is there a dog?" and it computes its answer by checking "Does a human labeler think there is a dog?" then its answers won't match the expected causal structure---so maybe we can avoid these kinds of answers.

What does that mean if we apply typical definitions of causality to ML training?

• If we define
2Paul Christiano2moThis is also a way to think about the proposals in this post and the reply [https://www.alignmentforum.org/posts/GxzEnkSFL5DnQEAsZ/paulfchristiano-s-shortform?commentId=swxCRdj3amrQjYJZD] : * The human believes that A' and B' are related in a certain way for simple+fundamental reasons. * On the training distribution, all of the functions we are considering reproduce the expected relationship. However, the reason that they reproduce the expected relationship is quite different. * For the intended function, you can verify this relationship by looking at the link (A --> B) and the coarse-graining applied to A and B, and verify that the probabilities work out. (That is, I can replace all of the rest of the computational graph with nonsense, or independent samples, and get the same relationship.) * For the bad function, you have to look at basically the whole graph. That is, it's not the case that the human's beliefs about A' and B' have the right relationship for arbitrary Ys, they only have the right relationship for a very particular distribution of Ys. So to see that A' and B' have the right relationship, we need to simulate the actual underlying dynamics where A --> B, since that creates the correlations in Y that actually lead to the expected correlations between A' and B'. * It seems like we believe not only that A' and B' are related in a certain way, but that the relationship should be for simple reasons, and so there's a real sense in which it's a bad sign if we need to do a ton of extra compute to verify that relationship. I still don't have a great handle on that kind of argument. I suspect it won't ultimately come down to "faster is better," though as a heuristic that seems to work surprisingly well. I think that this feels a bit more plausible to me as a story for why faster would be better (but only a bit). * It's not always going to be quite this cut and dried---depending on the structu
2Paul Christiano2moSo are there some facts about conditional independencies that would privilege the intended mapping? Here is one option. We believe that A' and C' should be independent conditioned on B'. One problem is that this isn't even true, because B' is a coarse-graining and so there are in fact correlations between A' and C' that the human doesn't understand. That said, I think that the bad map introduces further conditional correlations, even assuming B=B'. For example, if you imagine Y preserving some facts about A' and C', and if the human is sometimes mistaken about B'=B, then we will introduce extra correlations between the human's beliefs about A' and C'. I think it's pretty plausible that there are necessarily some "new" correlations in any case where the human's inference is imperfect, but I'd like to understand that better. So I think the biggest problem is that none of the human's believed conditional independencies actually hold---they are both precise, and (more problematically) they may themselves only hold "on distribution" in some appropriate sense. This problem seems pretty approachable though and so I'm excited to spend some time thinking about it.
paulfchristiano's Shortform

This is interesting to me for two reasons:

• [Mainly] Several proposals for avoiding the instrumental policy work by penalizing computation. But I have a really shaky philosophical grip on why that's a reasonable thing to do, and so all of those solutions end up feeling weird to me. I can still evaluate them based on what works on concrete examples, but things are slippery enough that plan A is getting a handle on why this is a good idea.
• In the long run I expect to have to handle learned optimizers by having the outer optimizer instead directly learn whatever
paulfchristiano's Shortform

The speed prior still delegates to better search algorithms though. For example, suppose that someone is able to fill in a 1000 bit program using only 2^500 steps of local search. Then the local search algorithm has speed prior complexity 500 bits, so will beat the object-level program. And the prior we'd end up using is basically "2x longer = 2 more bits" instead of "2x longer = 1 more bit," i.e. we end up caring more about speed because we delegated.

The actual limit on how much you care about speed is given by whatever search algorithms work best. I thin... (read more)

paulfchristiano's Shortform

In traditional settings, we are searching for a program M that is simpler than the property P. For example, the number of parameters in our model should be smaller than the size of the dataset we are trying to fit if we want the model to generalize. (This isn't true for modern DL because of subtleties with SGD optimizing imperfectly and implicit regularization and so on, but spiritually I think it's still fine..)

But this breaks down if we start doing something like imposing consistency checks and hoping that those change the result of learning. Intuitively... (read more)

paulfchristiano's Shortform

The speed prior is calibrated such that this never happens if the learned optimizer is just using brute force---if it needs to search over 1 extra bit then it will take 2x longer, offsetting the gains.

That means that in the regime where P is simple, the speed prior is the "least you can reasonably care about speed"---if you care even less, you will just end up pushing the optimization into an inner process that is more concerned with speed and is therefore able to try a bunch of options.

(However, this is very mild, since the speed prior cares only a tiny b... (read more)

2Paul Christiano2moThe speed prior still delegates to better search algorithms though. For example, suppose that someone is able to fill in a 1000 bit program using only 2^500 steps of local search. Then the local search algorithm has speed prior complexity 500 bits, so will beat the object-level program. And the prior we'd end up using is basically "2x longer = 2 more bits" instead of "2x longer = 1 more bit," i.e. we end up caring more about speed because we delegated. The actual limit on how much you care about speed is given by whatever search algorithms work best. I think it's likely possible to "expose" what is going on to the outer optimizer (so that it finds a hypothesis like "This local search algorithm is good" and then uses it to find an object-level program, rather than directly finding a program that bundles both of them together). But I'd guess intuitively that it's just not even meaningful to talk about the "simplest" programs or any prior that cares less about speed than the optimal search algorithm.
paulfchristiano's Shortform

Suppose I am interested in finding a program M whose input-output behavior has some property P that I can probabilistically check relatively quickly (e.g. I want to check whether M implements a sparse cut of some large implicit graph). I believe there is some simple and fast program M that does the trick. But even this relatively simple M is much more complex than the specification of the property P.

Now suppose I search for the simplest program running in time T that has property P. If T is sufficiently large, then I will end up getting the program "Search... (read more)

2Paul Christiano2moThis is interesting to me for two reasons: * [Mainly] Several proposals for avoiding the instrumental policy work by penalizing computation. But I have a really shaky philosophical grip on why that's a reasonable thing to do, and so all of those solutions end up feeling weird to me. I can still evaluate them based on what works on concrete examples, but things are slippery enough that plan A is getting a handle on why this is a good idea. * In the long run I expect to have to handle learned optimizers by having the outer optimizer instead directly learn whatever the inner optimizer would have learned. This is an interesting setting to look at how that works out. (For example, in this case the outer optimizer just needs to be able to represent the hypothesis "There is a program that has property P and runs in time T' " and then do its own search over that space of faster programs.)
2Paul Christiano2moIn traditional settings, we are searching for a program M that is simpler than the property P. For example, the number of parameters in our model should be smaller than the size of the dataset we are trying to fit if we want the model to generalize. (This isn't true for modern DL because of subtleties with SGD optimizing imperfectly and implicit regularization and so on, but spiritually I think it's still fine..) But this breaks down if we start doing something like imposing consistency checks and hoping that those change the result of learning. Intuitively it's also often not true for scientific explanations---even simple properties can be surprising and require explanation, and can be used to support theories that are much more complex than the observation itself. Some thoughts: 1. It's quite plausible that in these cases we want to be doing something other than searching over programs. This is pretty clear in the "scientific explanation" case, and maybe it's the way to go for the kinds of alignment problems I've been thinking about recently. A basic challenge with searching over programs is that we have to interpret the other data. For example, if "correspondence between two models of physics" is some kind of different object like a description in natural language, then some amplified human is going to have to be thinking about that correspondence to see if it explains the facts. If we search over correspondences, some of them will be "attacks" on the human that basically convince them to run a general computation in order to explain the data. So we have two options: (i) perfectly harden the evaluation process against such attacks, (ii) try to ensure that there is always some way to just directly do whatever the attacker convinced the human to do. But (i) seems quite hard, and (ii) basically requires us to put all of the generic programs in our search space. 2. It's also quite plausible th
3Paul Christiano2moThe speed prior [https://en.wikipedia.org/wiki/Speed_prior] is calibrated such that this never happens if the learned optimizer is just using brute force---if it needs to search over 1 extra bit then it will take 2x longer, offsetting the gains. That means that in the regime where P is simple, the speed prior is the "least you can reasonably care about speed"---if you care even less, you will just end up pushing the optimization into an inner process that is more concerned with speed and is therefore able to try a bunch of options. (However, this is very mild, since the speed prior cares only a tiny bit about speed. Adding 100 bits to your program is the same as letting it run 2^100 times longer, so you are basically just optimizing for simplicity.) To make this concrete, suppose that I instead used the kind-of-speed prior, where taking 4x longer is equivalent to using 1 extra bit of description complexity. And suppose that P is very simple relative to the complexities of the other objects involved. Suppose that the "object-level" program M has 1000 bits and runs in 2^2000 time, so has kind-of-speed complexity 2000 bits. A search that uses the speed prior will be able to find this algorithm in 2^3000 time, and so will have a kind-of-speed complexity of 1500 bits. So the kind-of-speed prior will just end up delegating to the speed prior.
paulfchristiano's Shortform

We might be able to get similar advantages with a more general proposal like:

Fit a function f to a (Q, A) dataset with lots of questions about latent structure. Minimize the sum of some typical QA objective and the computational cost of verifying that f is consistent.

Then the idea is that matching the conditional probabilities from the human's model (or at least being consistent with what the human believes strongly about those conditional probabilities) essentially falls out of a consistency condition.

It's not clear how to actually formulate that consiste... (read more)

paulfchristiano's Shortform

Here's another approach to "shortest circuit" that is designed to avoid this problem:

• Learn a circuit  that outputs an entire set of beliefs. (Or maybe some different architecture, but with ~0 weight sharing so that computational complexity = description complexity.)
• Impose a consistency requirement on those beliefs, even in cases where a human can't tell the right answer.
• Require 's beliefs about  to match . We hope that this makes  an explication of "'s beliefs."
• Optimize some combination of (complexit
paulfchristiano's Shortform

Recently I've been thinking about ML systems that generalize poorly (copying human errors) because of either re-using predictive models of humans or using human inference procedures to map between world models.

My initial focus was on preventing re-using predictive models of humans. But I'm feeling increasingly like there is going to be a single solution to the two problems, and that the world-model mismatch problem is a good domain to develop the kind of algorithm we need. I want to say a bit about why.

paulfchristiano's Shortform

Here's a slightly more formal algorithm along these lines:

• Assume that both the human's model  and the AI's model   are Bayesian networks where you compute the probability distribution over a node 's value based on the values of its parents . I'll write  for the set of values that a node  can take on (in either model), and  for the joint values of a set of nodes .
• A correspondence tells you how to compute the value of each node  in the human's model. Th
2Paul Christiano2moWe might be able to get similar advantages with a more general proposal like: Then the idea is that matching the conditional probabilities from the human's model (or at least being consistent with what the human believes strongly about those conditional probabilities) essentially falls out of a consistency condition. It's not clear how to actually formulate that consistency condition, but it seems like an improvement over the prior situation (which was just baking in the obviously-untenable requirement of exactly matching). It's also not clear what happens if this consistency condition is soft. It's not clear what "verify that the consistency conditions are met" means. You can always do the same proposal as in the parent, though it's not really clear if that's a convincing verification. But I think that's a fundamental philosophical problem that both of these proposals need to confront. It's not clear how to balance computational cost and the QA objective. But you are able to avoid most of the bad properties just by being on the Pareto frontier, and I don't think this is worse than the prior proposal. Overall this approach seems like it could avoid making such strong structural assumptions about the underlying model. It also helps a lot with the overlapping explanations + uniformity problem. And it generally seems to be inching towards feeling plausible.
• I don't think you actually want to use supervised training for training , you want to use feedback of the form "Is this answer much wronger than that answer?" and then train the model to not produce definitely-wrong answers.
• Likewise the  constraint would really want to be something softer (e.g. forcing  to give plausible-looking answers to questions as evaluated by ).
• I think that most questions about what is useful / tacitly assumed / etc. can be easily handled on top of the "raw" ability to elicit the model's knowle
1Joe_Collman2moOk, the softer constraints make sense to me, thanks. Using a debate withf+assessing simple closed questions makes sense, but it seems to me that only moves much of the problem rather than solving it. We start with "answering honestly vs predicting human answers" and end up with "judging honestly vs predicting human judgments". While "Which answer is better, Alice's or Bob's?" is a closed question, learning to answer the general case still requires applying a full model of human values - so it seems a judge-model is likely to be instrumental (or essentially equivalent: again, I'm not really sure what we'd mean by an intended model for the judge). But perhaps I'm missing something here; is predicting-the-judge less of a problem than the original? Are there better approaches than using debate which wouldn't have analogous issues?
Experimentally evaluating whether honesty generalizes

I do expect "explanations of what's going on in this sentence" to be a lot weaker than translations.

For that task, I expect that the model trained on coherence + similar tasks will outperform a 10x larger pre-trained model. If the larger pre-trained model gets context stuffing on similar tasks, but no coherence training, then it's less clear to me.

But I guess the point is that the differences between various degrees of successful-generalization will be relatively small compared to model size effects. It doesn't matter so much how good the transfer model is... (read more)

Experimentally evaluating whether honesty generalizes

The issue, then, is that the "fine-tuning for correctness" and "fine-tuning for coherence" processes are not really equivalent--fine-tuning for correctness is in fact giving GPT-3 additional information about tone, which improves its capabilities. In addition, GPT-3 might not "know" exactly what humans mean by the word tone, and so fine-tuning for correctness also helps GPT-3 to better understand the question.

Part of my hope is that "coherence" can do quite a lot of the "telling you what humans mean about tone." For example, you can basically force the mod... (read more)

paulfchristiano's Shortform

We could try to exploit some further structural facts about the parts of  that are used by . For example, it feels like the intended model is going to be leveraging facts that are further "upstream." For example, suppose an attacker observes that there is a cat in the room, and so writes out "There is a cat in the room" as part of a natural-language description of what it's going on that it hopes that  will eventually learn to copy. If  predicts the adversary's output, it must first predict that there is actu

2Paul Christiano2moHere's a slightly more formal algorithm along these lines: * Assume that both the human's modelWHand the AI's modelWAIare Bayesian networks where you compute the probability distribution over a nodev's value based on the values of its parentspa(v). I'll writeValues(v)for the set of values that a nodevcan take on (in either model), andValues(S)for the joint values of a set of nodesS. * A correspondence tells you how to compute the value of each nodevin the human's model. This consistent of (i) a neighborhood in the AI's modelNAI(v) which suffices to determinev, (ii) a functionfv:Values(NAI(v))→Values(v). * Both the AI's model and the human model contain some distinguished observation nodes.fmust be the identity on these nodes. * An "explanation" of a correspondence consists of a set of nodesEAI(v)in the AI's model for each nodevin the human's model. The intuition is that we can run a simulation involving only these nodes in order to reproduce the probability distribution ofvgiven its parents' values. * In particular,NAI(v)⊆EAI(v), andNAI(u)⊆EAI(v)for allu∈pa(v). In order to check whetherEAI(v)reproduces the right distribution, we first sample values at random for all the nodes some of whose parents aren't inEAI(v). Then we sample values for the remaining nodes. We can usefto compute the corresponding values forvand all of its parents. And then we can compute the conditional distributions forvgiven each set of values for its parents. * We require that the explanations exactly reproduce the conditional probability overValues(v)givenValues(pa(v)). * The "cost" of the explanation ofvis the sum of the compute required to sample all the nodes inEAI(v). The "cost" of the correspondencefvis the compute required to evaluate it. * We search for the set of correspondences and explanations for which the total cost is minimized. * (Maybe we also have some requirement where the correspondencefvagrees with some t

I like using Fritz.

It sounds like we are on basically the same page about what experiments would be interesting.

i) I'm interested in any good+scalable old engine. I think it's reasonable to focus on something easy, the most important constraint is that it is really state of the art and scales up pretty gracefully. I'd prefer 2000 or earlier.

ii) It would be great if where was at least a complete description (stuff like: these numbers were looked up from this source with links, the population was made of the following engines with implementations from this link, here's the big table of game results and the elo calculation, here was the code that was run to estimate no... (read more)

i) To pick a reference year, it seems reasonable to take the mid/late 1990s:
- Almost all chess engines before ~1996 lacked (or had serious inefficiencies) using multi-cores (very lengthy discussion here).
- Chess protocols became available, so that the engine and the GUI separated. That makes it straightforward to automate games for benchmarking.
- Modern engines should work on machines of that age, considering RAM constraints.
- The most famous human-computer games took place in 1997: Kasparov-Deep Blue. That's almost a quarter of a century ago (nice round n... (read more)

To clarify my stance on prizes:

• I will probably offer Gwern a $100 small prize for the link. • I will probably offer hippke a$1000 prize for the prior work.
• I would probably have offered hippke something like a $3000 prize if the experiment hadn't already been done. • The main thing to make the prize bigger would have been (i) doing the other half, of evaluating old engines on new hardware, (ii) more clarity about the numbers including publishing the raw data and ideally sufficiently detailed instructions for reproducing, (iii) more careful controls for memory, e ... (read more) 5hippke2moThank you for your interest: It's good to see people asking similar questions! Also thank-you for incentivizing research with rewards. Yes, I think closing the gaps will be straightforward. I still have the raw data, scripts, etc. to pick it up. i) old engines on new hardware - can be done; needs definition of which engines/hardware ii) raw data + reproduction - perhaps everything can be scripted and put on GitHub iii) controls for memory + endgame tables - can be done, needs definition of requirements iv) Perhaps the community can already agree on a set of experiments before they are performed, e.g. memory? I mean, I can look up "typical" values of past years, but I'm open for other values. 3Jaime Sevilla2moVery tangential to the discussion so feel free to ignore, but given that you have put some though before on prize structures I am curious about the reasoning for why you would award a different prize for something done in the past versus something done in the future How much chess engine progress is about adapting to bigger computers? Is your prediction that e.g. the behavior of chess will be unrelated to the behavior of SAT solving, or to factoring? Or that "those kinds of things" can be related to each other but not to image classification? Or is your prediction that the "new regime" for chess (now that ML is involved) will look qualitatively different than the old regime? There are problems where one paper reduces the compute requirements by 20 orders of magnitude. Or gets us from couldn't do X at all, to able to do X easily. I'm aware of very few examples of that occurring for p... (read more) Measuring hardware overhang From the graph it looks like stockfish is able to match the results of engines from ~2000 using ~1.5 orders of magnitude less compute. • Is that the right way to read this graph? • Do you have the numbers for SF8 evaluations so that I can use those directly rather than eyeballing from this graph? (I'm generally interested in whatever raw data you have.) Measuring hardware overhang Pulled from the wayback machine 1Oliver Habryka2moReplaced the image in the post with this image. 2Paul Christiano2moFrom the graph it looks like stockfish is able to match the results of engines from ~2000 using ~1.5 orders of magnitude less compute. * Is that the right way to read this graph? * Do you have the numbers for SF8 evaluations so that I can use those directly rather than eyeballing from this graph? (I'm generally interested in whatever raw data you have.) How much chess engine progress is about adapting to bigger computers? Thanks for the link (and thanks to hippke for doing the experiments), that's great. To clarify my stance on prizes: • I will probably offer Gwern a$100 small prize for the link.
• I will probably offer hippke a $1000 prize for the prior work. • I would probably have offered hippke something like a$3000 prize if the experiment hadn't already been done.
• The main thing to make the prize bigger would have been (i) doing the other half, of evaluating old engines on new hardware, (ii) more clarity about the numbers including publishing the raw data and ideally sufficiently detailed instructions for reproducing, (iii) more careful controls for memory, e
Measuring hardware overhang

Thanks for running these experiments!

I don't see the figure here, do you have a link to it?

3Paul Christiano2moPulled from the wayback machine
Experimentally evaluating whether honesty generalizes

I am substantially more optimistic about scalable oversight, whereas you think that (eventually) we will need to rely on some combination of scalable oversight + generalization of honesty OOD.

I'd still describe my optimistic take as "do imitative generalization." But when you really dig into what that means it seems very closely connected to generalization: (i) the reason why "just use this neural net" isn't a good hypothesis is that it generalizes poorly, (ii) for competitiveness reasons you still need to use hypotheses that look quite a lot like neural n... (read more)

1Jonathan Uesato2mo* Is it fair to describe the reason you view imitative generalization as necessary at all (instead of just debate/amplification) as "direct oversight is not indefinitely scalable"? [ETA: It seems equally/more valid to frame imitative generalization as a way of scaling direct oversight to handle inaccessible info, so this isn't a good framing.] * To check my understanding, you're saying that rather than rely on "some combination of scalable oversight + generalization of honesty OOD" you'd rather use something like imitative generalization (where if we can surface knowledge from neural networks to humans, then we don't need to rely on generalization of honesty OOD). Is this accurate? I agree with this argument. But it seems "if the answer is a deterministic [human-known] function of the subanswers" is a very strong condition, such that "(passes consistency check) + (subanswers are correct) ==> (answers are correct)" rarely holds in practice. Maybe the most common case is that we have some subanswers, but they don't uniquely define the right answer / there are (infinitely many) other subquestions we could have asked which aren't there. Not sure this point is too important though (I'd definitely want to pick it up again if I wanted to push on a direction relying on something like generalization-of-honesty). Got it, thanks! (I am slightly surprised, but happy to leave it here.)
Experimentally evaluating whether honesty generalizes

I think C->B is already quite hard for language models, maybe it's possible but still very clearly hard enough that it overwhelms the possible simplicity benefits from E over B (before even adding in the hardness of steps E->D->C). I would update my view a lot if I saw language models doing anything even a little bit like the C->b link.

I agree that eventually A loses to any of {B, C, D, E}. I'm not sure if E is harder than B to fix, but at any rate my starting point is working on the reasons that A loses to any of the alternatives (e.g. here, h... (read more)

Experimentally evaluating whether honesty generalizes

I'm curious what factors point to a significant difference regarding generalization between "decisions" and "unsupervised translation". Perhaps there is a more natural concept of "honesty" / "truth" for unsupervised translation, which makes it more likely. But this is very fuzzy to me, and I'm curious why (or if) it's clear to you there is a big difference between the two cases.

For honest translation from your world-model to mine (or at least sufficiently small parts of it), there is a uniform intended behavior. But for decisions there isn't any intended u... (read more)

Experimentally evaluating whether honesty generalizes

I'm most interested in this point. IIUC, the viewpoint you allude to here is something along the lines "There will be very important decisions we can't train directly for, but we'll be able to directly apply ML to these decisions by generalizing from feedback on easier decisions."

That's basically right, although I think the view is less plausible for "decisions" than for some kinds of reports. For example, it is more plausible that a mapping from symbols in an internal vocabulary to expressions in natural language would generalize than that correct decisio... (read more)

paulfchristiano's Shortform

How much hope is there for jointly representing  and ?

The most obvious representation in this case is to first specify , and then actually model the process of gradient descent that produces . This runs into a few problems:

1. Actually running gradient descent to find  is too expensive to do at every datapoint---instead we learn a hypothesis that does a lot of its work "up front" (shared across all the datapoints). I don't know what that would look like. The naive ways of doing it (redoing the shared initial com
paulfchristiano's Shortform

The difficulty of jointly representing  and  motivates my recent proposal, which avoids any such explicit representation. Instead it separately specifies  and , and then "gets back" bits by imposing a consistency condition that would have been satisfied only for a very small fraction of possible 's (roughly  of them).

But thinking about this neural network case also makes it easy to talk about why my recent proposal could run into severe computational problems:

• In order to calculate this
paulfchristiano's Shortform

Here's an example I've been thinking about today to investigate the phenomenon of re-using human models.

Suppose that the "right" way to answer questions is . And suppose that a human is a learned model  trained by gradient descent to approximate  (subject to architectural and computational constraints). This model is very good on distribution, but we expect it to fail off distribution. We want to train a new neural network to approximate , without inheriting the human's off-distribution failures (though the new net... (read more)

3Paul Christiano2moThe difficulty of jointly representingfθ1andHθ2motivates my recent proposal [https://www.alignmentforum.org/posts/QqwZ7cwEA2cxFEAun/teaching-ml-to-answer-questions-honestly-instead-of] , which avoids any such explicit representation. Instead it separately specifies θ1andθ2, and then "gets back" bits by imposing a consistency condition that would have been satisfied only for a very small fraction of possibleθ2's (roughlyexp(−|θ1|)of them). But thinking about this neural network case also makes it easy to talk about why my recent proposal could run into severe computational problems: * In order to calculate this loss function we need to evaluate how "special"θ2 is, i.e. how small is the fraction ofθ2's that are consistent withθ1 * In order to evaluate how specialθ2is, we basically need to do the same process of SGD that producesθ2---then we can compare the actual iterates to all of the places that it could have gone in a different direction, and conclude that almost all of the different settings of the parameters would have been much less consistent withθ1. * The implicit hope of my proposal is that the outer neural network is learning its human model using something like SGD, and so it can do this specialness-calculation for free---it will be considering lots of different human-models, and it can observe that almost all of them are much less consistent withθ1. * But the outer neural network could learn to model humans in a very different way, which may not involve representing a serious of iterates of "plausible alternative human models." For example, suppose that in each datapoint we observe a few of the bits ofθ2directly (e.g. by looking at a brain scan), and we fill in much ofθ2in this way before we ever start making good predictions about human behavior. Then we never need to consider any other plausible human-models. So in order to salvage a proposal like this, it seems like (at a minimum) the "specialness eva
paulfchristiano's Shortform

You basically just need full universality / epistemic competitiveness locally. This is just getting around "what are values?" not the need for competitiveness. Then the global thing is also epistemically competitive, and it is able to talk about e.g. how our values interact with the alien concepts uncovered by our AI (which we want to reserve time for since we don't have any solution better than "actually figure everything out 'ourselves'").

Almost all of the time I'm thinking about how to get epistemic competitiveness for the local interaction. I think that's the meat of the safety problem.

paulfchristiano's Shortform

I'm mostly worried about parameter sharing between the human models in the environment and the QA procedure (which leads the QA to generalize like a human instead of correctly). You could call that deception but I think it's a somewhat simpler phenomenon.

paulfchristiano's Shortform

The most fundamental reason that I don't expect this to work is that it gives up on "sharing parameters" between the extractor and the human model. But in many cases it seems possible to do so, and giving on up on that feels extremely unstable since it's trying to push against competitiveness (i.e. the model will want to find some way to save those parameters, and you don't want your intended solution to involve subverting that natural pressure).

Intuitively, I can imagine three kinds of approaches to doing this parameter sharing:

1. Introduce some latent struc
Experimentally evaluating whether honesty generalizes

I'm saying that if you e.g. reward your AI by having humans evaluate its answers, then the AI may build a predictive model of those human evaluations and then may pick actions that are good according to that model. And that predictive model will overlap substantially with predictive models of humans in other domains.

The "build a good predictive model of humans" is a step in all of your proposals A-D.

Then I'm saying that it's pretty simple to plan against it. It would be even simpler if you were doing supervised training, since then you are just outputting ... (read more)

3Daniel Kokotajlo2moThanks! I like your breakdown of A-E, let's use that going forward. It sounds like your view is: For "dumb" AIs that aren't good at reasoning, it's more likely that they'll just do B "directly" rather than do E-->D-->C-->B. Because the latter involves a lot of tricky reasoning which they are unable to do. But as we scale up our AIs and make them smarter, eventually the E-->D-->C-->B thing will be more likely than doing B "directly" because it works for approximately any long-term consequence (e.g. paperclips) and thus probably works for some extremely simple/easy-to-have goals, whereas doing B directly is an arbitrary/complex/specific goal that is thus unlikely. (1) What I was getting at with the "Steps for Dummies" example is that maybe the kind of reasoning required is actually pretty basic/simple/easy and we are already in the regime where E-->D-->C-->B dominates doing B directly. One way it could be easy is if the training data spells it out nicely for the AI. I'd be interested to hear more about why you are confident that we aren't in this regime yet. Relatedly, what sorts of things would you expect to see AIs doing that would convince you that maybe we are in this regime? (2) What about A? Doesn't the same argument for why E-->D-->C-->B dominates B eventually also work to show that it dominates A eventually?
Experimentally evaluating whether honesty generalizes

I'm willing to bet against that (very) strongly.

"do things that you've read are instrumentally convergent."

If it's going to be preferred, it really needs to be something simpler than that which leads it to deduce that heuristic (since that heuristic itself is not going to be simpler than directly trying to win at training). This is wildly out-of-domain generalization of much better reasoning than existing language models engage in.

Whereas there's nothing particularly exotic about building a model of the training process and using it to make predictions.

4Daniel Kokotajlo2moI'm not willing to bet yet, I feel pretty ignorant and confused about the issue. :) I'm trying to get more understanding of your model of how all this works. We've discussed: A. "Do things you've read are instrumentally convergent" B. "Tell the humans what they want to hear." C. "Try to win at training." D. "Build a model of the training process and use it to make predictions." It sounds like you are saying A is the most complicated, followed by B and C, and then D is the least complicated. (And in this case the AI will know that winning at training means telling the humans what they want to hear. Though you also suggested the AI wouldn't necessarily understand the dynamics of the training process, so idk.) To my fresh-on-this-problem eyes, all of these things seem equally likely to be the simplest. And I can tell a just-so story for why A would actually be the simplest; it'd be something like this: Suppose that somewhere in the training data there is a book titled "How to be a successful language model: A step-by-step guide for dummies." The AI has read this book many times, and understands it. In this case perhaps rather than having mental machinery that thinks "I should try to win at training. How do I do that in this case? Let's see... given what I know of the situation... by telling the humans what they want to hear!" it would instead have mental machinery that thinks "I should follow the Steps for Dummies. Let's see... given what I know of the situation... by telling the humans what they want to hear!" Because maybe "follow the steps for dummies" is a simpler, more natural concept for this dumb AI (given how prominent the book was in its training data) than "try to win at training." The just-so story would be that maybe something analogous to this actually happens, even though there isn't literally a Steps for Dummies book in the training data.
Experimentally evaluating whether honesty generalizes

I do think the LM-only version seems easier and probably better to start with.

How are we imagining prompting the multimodal Go+English AI with questions like "is this group alive or dead?" And how are we imagining training it so that it forms intermodal connections rather than just letting them atrophy?

The hope is that you can fiddle with these things to get it to answer some questions and then see whether it generalizes.

My first guess for an architecture would be producing a 19 x 19 grid of embeddings from the CNN, and then letting a transformer att... (read more)

paulfchristiano's Shortform

You could imitate human answers, or you could ask a human "Is answer  much better than answer ?" Both of these only work for questions that humans can evaluate (in hindsight), and then the point of the scheme is to get an adequate generalization to (some) questions that humans can't answer.

1Adam Shimi3moOk, so you optimize the circuit both for speed and for small loss on human answers/comparisons, hoping that it generalizes to more questions while not being complex enough to be deceptive. Is that what you mean?
paulfchristiano's Shortform

The hope is that a tampering large enough to corrupt the human's final judgment would get a score of ~0 in the local value learning. 0 is the "right" score since the tampered human by hypothesis has lost all of the actual correlation with value. (Note that at the end you don't need to "ask it to do simple stuff" you can just directly assign a score of 1.)

This hope does require the local oversight process to be epistemically competitive with the AI, in the sense that e.g. if the AI understands something subtle about the environment dynamics then the oversig... (read more)

1Adam Shimi3moSo you want a sort of partial universality sufficient to bootstrap the process locally (while not requiring the understanding of our values in fine details), giving us enough time for a deliberation that would epistemically dominate the AI in a global sense (and get our values right)? If that's about right, then I agree that having this would make your proposal work, but I still don't know how to get it. I need to read your previous posts on reading questions honestly.
paulfchristiano's Shortform

I think the biggest problem is that  can compute the instrumental policy (or a different policy that works well, or a fragment of it). Some possible reasons:

• Maybe some people in the world are incidentally thinking about the instrumental policy and  makes predictions about them.
• Maybe an adversary who computes a policy that performs well in order to try to attack the learning process (since  may just copy the adversary's policy in order to be fast if it works well on training, resulting in bad generalization).
• Maybe
2Paul Christiano2moHere's another approach to "shortest circuit" that is designed to avoid this problem: * Learn a circuitC(X)that outputs an entire set of beliefs. (Or maybe some different architecture, but with ~0 weight sharing so that computational complexity = description complexity.) * Impose a consistency requirement on those beliefs, even in cases where a human can't tell the right answer. * RequireC(X)'s beliefs aboutYto matchFθ(X). We hope that this makesCan explication of "Fθ's beliefs." * Optimize some combination of (complexity) vs (usefulness), or chart the whole pareto frontier, or whatever. I'm a bit confused about how this step would work but there are similar difficulties for the other posts in this genre so it's exciting if this proposal gets to that final step. The "intended" circuitCjust follows along with the computation done byFθand then translates its internal state into natural language. What about the problem case whereFθcomputes some reasonable beliefs (e.g. using the instrumental policy, where the simplicity prior makes us skeptical about their generalization) thatCcould just read off? I'll imagine those being written down somewhere on a slip of paper inside ofFθ's model of the world. * Suppose that the slip of paper is not relevant to predictingFθ(X), i.e. it's a spandrel from the weight sharing. Then the simplest circuitCjust wants to cut it out. Whatever computation was done to write things down on the slip of paper can be done directly byC, so it seems like we're in business. * So suppose that the slip of paper is relevant for predictingFθ(X), e.g. because someone looks at the slip of paper and then takes an action that affectsY. If (the correct)Yis itself depicted on the slip of paper, then we can again cut out the slip of paper itself and just run the same computation (that was done by whoever wrote something on the slip of paper). Otherwise, the answers produced byCstill have to contain both the
2Paul Christiano2moThe natural way to implement this is to penalizeEθ′not for the computation it does, but for all the computation needed to compute its output (including within Fθ.). The basic problem with this approach is that it incentivizesEθ′to do all of the computation ofFθfrom scratch in a way optimized for speed rather than complexity. I'd set this approach aside for a while because of this difficulty and the unnaturalness mentioned in the sibling (where we've given up on what seems to be an important form of parameter-sharing). Today I was thinking about some apparently-totally-different angles of attack for the ontology identification problem [https://www.alignmentforum.org/posts/SRJ5J9Tnyq7bySxbt/answering-questions-honestly-given-world-model-mismatches] , and this idea seems to have emerged again, with a potential strategy for fixing the "recomputeFθproblem". (In the context of ontology identification, the parameter-sharing objection no longer applies.) Here's the idea: * TrainFθas before. * Start with a bunch of facts and probabilistic relationships that the human knows, expressed in their own ontology. These might be facts like "Ice melts at 100 degrees" or "Dogs bark at cats" or whatever. * We are going to try to jointly learn (i) a correspondencecbetweenFθand the human's ontology, (ii) a set of "justifications" showing thatc(Fθ)satisfies all of the relationships the human expects. I'm imagining justifications like simulating ice at 100 degrees and observing that it indeed melts, or sampling situations with dogs and cats and verifying that the dogs bark. * The correspondencecis constrained to map "the observations" (a concept in the human's ontology) to the output ofFθ, but other than that there is no simplicity prior, it can be anything. * Our goal is to make the justifications as computationally simple as possible. (Which indirectly incentivizes us to makecas computationally simple as possible.) This still feels a bit weird, but
paulfchristiano's Shortform

It seems to me like "Reason about a perfect emulation of a human" is an extremely similar task to "reason about a human," to me it does not feel closely related to X-and-only-X efficient imitation. For example, you can make calibrated predictions about what a human would do using vastly less computing power than a human (even using existing techniques), whereas perfect imitation likely requires vastly more computing power.

1Vladimir Nesov2moThe point is that in order to be useful, a prediction/reasoning process should contain mesa-optimizers that perform decision making similar in a value-laden way to what the original humans would do. The results of the predictions should be determined by decisions of the people being predicted (or of people sufficiently similar to them), in the free-will-requires-determinism/you-are-part-of-physics sense. The actual cognitive labor of decision making needs to in some way be an aspect of the process of prediction/reasoning, or it's not going to be good enough. And in order to be safe, these mesa-optimizers shouldn't be systematically warped into something different (from a value-laden point of view), and there should be no other mesa-optimizers with meaningful influence in there. This just says that prediction/reasoning needs to be X-and-only-X in order to be safe. Thus the equivalence. Prediction of exact imitation in particular is weird because in that case the similarity measure between prediction and exact imitation is hinted to not be value-laden, which it might have to be in order for the prediction to be both X-and-only-X and efficient. This is only unimportant if X-and-only-X is the likely default outcome of predictive generalization, so that not paying attention to this won't result in failure, but nobody understands if this is the case. The mesa-optimizers in the prediction/reasoning similar to the original humans is what I mean by efficient imitations (whether X-and-only-X or not). They are not themselves the predictions of original humans (or of exact imitations), which might well not be present as explicit parts of the design of reasoning about the process of reflection as a whole, instead they are the implicit decision makers that determine what the conclusions of the reasoning say, and they are much more computationally efficient (as aspects of cheaper reasoning) than exact imitations. At the same time, if they are similar enough in a value-laden way
paulfchristiano's Shortform

I think the biggest difference is between actual and hypothetical processes of reflection. I agree that an "actual" process of reflection would likely ultimately involve most humans migrating to emulations for the speed and other advantages. (I am not sure that a hypothetical process necessarily needs efficient imitations, rather than AI reasoning about what actual humans---or hypothetical slow-but-faithful imitations---might do.)

3Vladimir Nesov3moI see getting safe and useful reasoning about exact imitations as a weird special case or maybe a reformulation of X-and-only-X efficient imitation. Anchoring to exact imitations in particular makes accurate prediction more difficult than it needs to be, as it's not the thing we care about, there are many irrelevant details that influence outcomes that accurate predictions would need to take into account. So a good "prediction" is going to be value-laden, with concrete facts about actual outcomes of setups built out of exact imitations being unimportant, which is about the same as the problem statement of X-and-only-X efficient imitation. If such "predictions" are not good enough by themselves, underlying actual process of reflection (people living in the world) won't save/survive this if there's too much agency guided by the predictions. Using an underlying hypothetical process of reflection (by which I understand running a specific program) is more robust, as AI might go very wrong initially, but will correct itself once it gets around to computing the outcomes of the hypothetical reflection with more precision, provided the hypothetical process of reflection is defined as isolated from the AI. I'm not sure what difference between hypothetical and actual processes of reflection you are emphasizing (if I understood what the terms mean correctly), since the actual civilization might plausibly move in into a substrate that is more like ML reasoning than concrete computation (let alone concrete physical incarnation), and thus become the same kind of thing as hypothetical reflection. The most striking distinction (for AI safety) seems to be the implication that an actual process of reflection can't be isolated from decisions of the AI taken based on insufficient reflection. There's also the need to at least define exact imitations or better yet X-and-only-X efficient imitation in order to define a hypothetical process of reflection, which is not as absolutely necess
paulfchristiano's Shortform

The most important complication is that the AI is no longer isolated from the deliberating humans. We don't care about what the humans "would have done" if the AI hadn't been there---we need our AI to keep us safe (e.g. from other AI-empowered actors), we will be trusting our AI not to mess with the process of deliberation, and we will likely be relying on our AI to provide "amenities" to the deliberating humans (filling the same role as the hypercomputer in the old proposal).

Going even further, I'd like to avoid defining values in terms of any kind of cou... (read more)

2Adam Shimi3moIf I follow correctly, the first step requires the humans to evaluate the output of narrow value learning, until this output becomes good enough to become universal with regard to the original AI and supervise it? I'm not sure I get why the AI wouldn't be incentivized to temper with the narrow value learning, à la Predict-o-matic [https://www.alignmentforum.org/posts/SwcyMEgLyd4C3Dern/the-parable-of-predict-o-matic] ? Depending on certain details, (like maybe the indescribable hellworld hypothesis [https://www.alignmentforum.org/posts/rArsypGqq49bk4iRr/can-there-be-an-indescribable-hellworld] ), maybe the AI can introduce changes to the partial imitations/deliberations that end up hidden and compounding until the imitations epistemically dominates the AI, and then it ask it to do simple stuff.
2Vladimir Nesov3moThe upside of humans in reality is that there is no need to figure out how to make efficient imitations that function correctly (as in X-and-only-X [https://intelligence.org/2018/05/19/challenges-to-christianos-capability-amplification-proposal/] ). To be useful, imitations should be efficient, which exact imitations are not. Yet for the role of building blocks of alignment machinery, imitations shouldn't have important systematic tendencies not found in the originals, and their absence is only clear for exact imitations (if not put in very unusual environments [https://www.lesswrong.com/posts/FSmPtu7foXwNYpWiB/on-the-limits-of-idealized-values?commentId=eWW7uqTFcfbchngwC] ). Suppose you already have an AI that interacts with the world, protects it from dangerous AIs, and doesn't misalign people living in it. Then there's time to figure out how to perform X-and-only-X efficient imitation, which drastically expands the design space, makes it more plausible that the kinds of systems that you wrote about a lot relying on imitations actually work as intended. In particular, this might include the kind of long reflection [https://www.lesswrong.com/posts/8xomBzAcwZ6WTC8QB/steven0461-s-shortform-feed-1?commentId=hf2nrzEeja2aQGsyu] that has all the advantages of happening in reality without wasting time and resources on straightforwardly happening in reality, or letting the bad things that would happen in reality actually happen. So figuring out object level values doesn't seem like a priority if you somehow got to the point of having an opportunity to figure out efficient imitation. (While getting to that point without figuring out object level values doesn't seem plausible, maybe there's a suggestion of a process that gets us there in the limit in here somewhere.)
paulfchristiano's Shortform

In the strategy stealing assumption I describe a policy we might want our AI to follow:

• Keep the humans safe, and let them deliberate(/mature) however they want.
• Maximize option value while the humans figure out what they want.
• When the humans figure out what they want, listen to them and do it.

Intuitively this is basically what I expect out of a corrigible AI, but I agree with Eliezer that this seems more realistic as a goal if we can see how it arises from a reasonable utility function.

So what does that utility function look like?

The most important complication is that the AI is no longer isolated from the deliberating humans. We don't care about what the humans "would have done" if the AI hadn't been there---we need our AI to keep us safe (e.g. from other AI-empowered actors), we will be trusting our AI not to mess with the process of deliberation, and we will likely be relying on our AI to provide "amenities" to the deliberating humans (filling the same role as the hypercomputer in the old proposal).

Going even further, I'd like to avoid defining values in terms of any kind of cou... (read more)

paulfchristiano's Shortform

Suppose that someone has trained a model  to predict  given , and I want to extend it to a question-answering model  that answers arbitrary questions in a way that reflects all of 's knowledge.

Two prototypical examples I am thinking of are:

•  runs a low-level model of physics. We want to extract high-level features of the world from the intermediate physical states, which requires e.g. a cat-classifier that operates directly on physical states rather than pixels.
•  performs logical deduct
3Paul Christiano2moThe most fundamental reason that I don't expect this to work is that it gives up on "sharing parameters" between the extractor and the human model. But in many cases it seems possible to do so, and giving on up on that feels extremely unstable since it's trying to push against competitiveness (i.e. the model will want to find some way to save those parameters, and you don't want your intended solution to involve subverting that natural pressure). Intuitively, I can imagine three kinds of approaches to doing this parameter sharing: 1. Introduce some latent structureL(e.g. semantics of natural language, what a cat "actually is") that is used to represent both humans and the intended question-answering policy. This is the diagramH←L→f+ 2. Introduce some consistency checkf?betweenHandL. This is the diagramH→f?←f+ 3. Somehow extractf+fromHor build it out of pieces derived fromH. This is the diagramH→f+. This is kind of like a special case of 1, but it feels pretty different. (You could imagine having slightly more general diagrams corresponding to any sort of d-connection betweenHandf+.) Approach 1 is the most intuitive, and it seems appealing because we can basically leave it up to the model to introduce the factorization (and it feels like there is a good chance that it will happen completely automatically). There are basically two challenges with this approach: * It's not clear that we can actually jointly compressHandf+. For example, what if we representHin an extremely low level way as a bunch of neurons firing; the neurons are connected in a complicated and messy way that learned to implement something likef+, but need not have any simple representation in terms off+. Even if such a factorization is possible, it's completely unclear how to argue about how hard it is to learn. This is a lot of what motivates the compression-based approaches---we can just say "His some mess, but you can count on it basically computingf+"
1Adam Shimi3moOne aspect of this proposal which I don't know how to do is evaluation the answers of the question-answerer. That looks too me very related to the deconfusion of universality that we discussed a few months ago, and without an answer to this, I feel like I don't even know how to run this silly approach.
Open question: are minimal circuits daemon-free?

I consider the argument in this post a reasonably convincing negative answer to this question---a minimal circuit may nevertheless end up doing learning internally and thereby generate deceptive learned optimizers.

This suggests a second informal clarification of the problem (in addition to Wei Dai's comment): can the search for minimal circuits itself be responsible for generating deceptive behavior? Or is it always the case that something else was the offender and the search for minimal circuits is an innocent bystander?

If the search for minimal circuits ... (read more)

Parameter counts in Machine Learning

Aside: I hadn't realized AlphaZero took 5 orders of magnitude more compute per parameter than AlexNet -- the horizon length concept would have predicted ~2 orders (since a full Go game is a couple hundred moves). I wonder what gets the extra 3 orders. Probably at least part of it comes from the difference between using a differentiable vs. non-differentiable objective function.

I think that in a forward pass, AlexNet uses about 10-15 flops per parameter (assuming 4 bytes per parameter and using this table), because it puts most of its parameters in the smal... (read more)

3Rohin Shah3moI also looked into number of training points very briefly, Googling suggests AlexNet used 90 epochs on ImageNet's 1.3 million train images, while AlphaZero played 44 million games for chess (I didn't quickly find a number for Go), suggesting that the number of images was roughly similar to the number of games. So I think probably the remaining orders of magnitude are coming from the tree search part of MCTS (which causes there to be > 200 forward passes per game).