This is a special post for quick takes by Paul Christiano. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
36 comments, sorted by Click to highlight new comments since:

In the strategy stealing assumption I describe a policy we might want our AI to follow:

  • Keep the humans safe, and let them deliberate(/mature) however they want.
  • Maximize option value while the humans figure out what they want.
  • When the humans figure out what they want, listen to them and do it.

Intuitively this is basically what I expect out of a corrigible AI, but I agree with Eliezer that this seems more realistic as a goal if we can see how it arises from a reasonable utility function.

So what does that utility function look like?

A first pass answer is pretty similar to my proposal from A Formalization of Indirect Normativity: we imagine some humans who actually have the opportunity to deliberate however they want and are able to review all of our AI's inputs and outputs. After a very long time, they evaluate the AI's behavior on a scale from [-1, 1], where 0 is the point corresponding to "nothing morally relevant happens," and that evaluation is the AI's utility.

The big difference is that I'm now thinking about what would actually happen, in the real world if the humans had the space and security to deliberate rather than formally defining a hypothetical process. I think that is going to end up being both safer and easier to implement, though it introduces its own set of complications.

Our hope is that the policy "keep the humans safe, then listen to them about what to do" is a good strategy for getting a high utility in this game, even if our AI is very unsure about what the humans would ultimately want. Then if our AI is sufficiently competent we can expect it to find a strategy at least this good.

The most important complication is that the AI is no longer isolated from the deliberating humans. We don't care about what the humans "would have done" if the AI hadn't been there---we need our AI to keep us safe (e.g. from other AI-empowered actors), we will be trusting our AI not to mess with the process of deliberation, and we will likely be relying on our AI to provide "amenities" to the deliberating humans (filling the same role as the hypercomputer in the old proposal).

Going even further, I'd like to avoid defining values in terms of any kind of counterfactual like "what the humans would have said if they'd stayed safe" because I think those will run into many of the original proposal's problems.

Instead we're going to define values in terms of what the humans actually conclude here in the real world. Of course we can't just say "Values are whatever the human actually concludes" because that will lead our agent to deliberately compromise human deliberation rather than protecting it.

Instead, we are going to add in something like narrow value leaning. Assume the human has some narrow preferences over what happens to them over the next hour. These aren't necessarily that wise. They don't understand what's happening in the "outside world" (e.g. "am I going to be safe five hours from now?" or "is my AI-run company acquiring a lot of money I can use when I figure out what I want?"). But they do assign low value to the human getting hurt, and assign high value to the human feeling safe and succeeding at their local tasks; they assign low value to the human tripping and breaking their neck, and high value to having the AI make them a hamburger if they ask for a hamburger; and so on. These preferences are basically dual to the actual process of deliberation that the human undergoes. There is a lot of subtlety about defining or extracting these local values, but for now I'm going to brush that aside and just ask how to extract the utility function from this whole process.

It's no good to simply use the local values, because we need our AI to do some lookahead (both to future timesteps when the human wants to remain safe, and to the far future when the human will evaluate how much option value the AI actually secured for them). It's no good to naively integrate local values over time, because a very low score during a brief period (where the human is killed and replaced by a robot accomplice) cannot be offset by any number of high scores in the future.

Here's my starting proposal:

  • We quantify the human's local preferences by asking "Look at the person you actually became. How happy are you with that person? Quantitatively, how much of your value was lost by replacing yourself with that person?" This gives us a loss on a scale from 0% (perfect idealization, losing nothing) to 100% (where all of the value is gone). Most of the values will be exceptionally small, especially if we look at a short period like an hour.
  • Eventually once the human becomes wise enough to totally epistemically dominate the original AI, they can assign a score to the AI's actions. To make life simple for now let's ignore negative outcomes and just describe value as a scalar from 0% (barren universe) to 100% (all of the universe is used in an optimal way). Or we might use this "final scale" in a different way (e.g. to evaluate the AI's actions rather than the actually assessing outcomes, assigning high scores to corrigible and efficient behavior and somehow quantifying deviations from that ideal).
  • The utility is the product of all of these numbers.

I think there are a lot of problems with this method of quantitative aggregation. But I think this direction is promising and I currently expect something along these lines will work.

Here's my starting proposal:

  • We quantify the human's local preferences by asking "Look at the person you actually became. How happy are you with that person? Quantitatively, how much of your value was lost by replacing yourself with that person?" This gives us a loss on a scale from 0% (perfect idealization, losing nothing) to 100% (where all of the value is gone). Most of the values will be exceptionally small, especially if we look at a short period like an hour.
  • Eventually once the human becomes wise enough to totally epistemically dominate the original AI, they can assign a score to the AI's actions. To make life simple for now let's ignore negative outcomes and just describe value as a scalar from 0% (barren universe) to 100% (all of the universe is used in an optimal way). Or we might use this "final scale" in a different way (e.g. to evaluate the AI's actions rather than the actually assessing outcomes, assigning high scores to corrigible and efficient behavior and somehow quantifying deviations from that ideal).
  • The utility is the product of all of these numbers.

If I follow correctly, the first step requires the humans to evaluate the output of narrow value learning, until this output becomes good enough to become universal with regard to the original AI and supervise it? I'm not sure I get why the AI wouldn't be incentivized to temper with the narrow value learning, à la Predict-o-matic? Depending on certain details, (like maybe the indescribable hellworld hypothesis), maybe the AI can introduce changes to the partial imitations/deliberations that end up hidden and compounding until the imitations epistemically dominates the AI, and then it ask it to do simple stuff.

The hope is that a tampering large enough to corrupt the human's final judgment would get a score of ~0 in the local value learning. 0 is the "right" score since the tampered human by hypothesis has lost all of the actual correlation with value. (Note that at the end you don't need to "ask it to do simple stuff" you can just directly assign a score of 1.)

This hope does require the local oversight process to be epistemically competitive with the AI, in the sense that e.g. if the AI understands something subtle about the environment dynamics then the oversight process also needs to understand that. And that's what we are trying to do with all of this business about training AIs to answer questions honestly. The point is just that you don't have to clear up any of the ambiguity about what the human wants, you just have to be able to detect someone tampering with deliberation. (And the operationalization of tampering doesn't have to be so complex.)

(I'm not sure if this made too much sense, I have a draft of a related comment that I'll probably post soon but overall expect to just leave this as not-making-much-sense for now.)

This hope does require the local oversight process to be epistemically competitive with the AI, in the sense that e.g. if the AI understands something subtle about the environment dynamics then the oversight process also needs to understand that. And that's what we are trying to do with all of this business about training AIs to answer questions honestly. The point is just that you don't have to clear up any of the ambiguity about what the human wants, you just have to be able to detect someone tampering with deliberation. (And the operationalization of tampering doesn't have to be so complex.)

So you want a sort of partial universality sufficient to bootstrap the process locally (while not requiring the understanding of our values in fine details), giving us enough time for a deliberation that would epistemically dominate the AI in a global sense (and get our values right)?

If that's about right, then I agree that having this would make your proposal work, but I still don't know how to get it. I need to read your previous posts on reading questions honestly. 

You basically just need full universality / epistemic competitiveness locally. This is just getting around "what are values?" not the need for competitiveness. Then the global thing is also epistemically competitive, and it is able to talk about e.g. how our values interact with the alien concepts uncovered by our AI (which we want to reserve time for since we don't have any solution better than "actually figure everything out 'ourselves'").

Almost all of the time I'm thinking about how to get epistemic competitiveness for the local interaction. I think that's the meat of the safety problem.

The upside of humans in reality is that there is no need to figure out how to make efficient imitations that function correctly (as in X-and-only-X). To be useful, imitations should be efficient, which exact imitations are not. Yet for the role of building blocks of alignment machinery, imitations shouldn't have important systematic tendencies not found in the originals, and their absence is only clear for exact imitations (if not put in very unusual environments).

Suppose you already have an AI that interacts with the world, protects it from dangerous AIs, and doesn't misalign people living in it. Then there's time to figure out how to perform X-and-only-X efficient imitation, which drastically expands the design space, makes it more plausible that the kinds of systems that you wrote about a lot relying on imitations actually work as intended. In particular, this might include the kind of long reflection that has all the advantages of happening in reality without wasting time and resources on straightforwardly happening in reality, or letting the bad things that would happen in reality actually happen.

So figuring out object level values doesn't seem like a priority if you somehow got to the point of having an opportunity to figure out efficient imitation. (While getting to that point without figuring out object level values doesn't seem plausible, maybe there's a suggestion of a process that gets us there in the limit in here somewhere.)

I think the biggest difference is between actual and hypothetical processes of reflection. I agree that an "actual" process of reflection would likely ultimately involve most humans migrating to emulations for the speed and other advantages. (I am not sure that a hypothetical process necessarily needs efficient imitations, rather than AI reasoning about what actual humans---or hypothetical slow-but-faithful imitations---might do.)

I see getting safe and useful reasoning about exact imitations as a weird special case or maybe a reformulation of X-and-only-X efficient imitation. Anchoring to exact imitations in particular makes accurate prediction more difficult than it needs to be, as it's not the thing we care about, there are many irrelevant details that influence outcomes that accurate predictions would need to take into account. So a good "prediction" is going to be value-laden, with concrete facts about actual outcomes of setups built out of exact imitations being unimportant, which is about the same as the problem statement of X-and-only-X efficient imitation.

If such "predictions" are not good enough by themselves, underlying actual process of reflection (people living in the world) won't save/survive this if there's too much agency guided by the predictions. Using an underlying hypothetical process of reflection (by which I understand running a specific program) is more robust, as AI might go very wrong initially, but will correct itself once it gets around to computing the outcomes of the hypothetical reflection with more precision, provided the hypothetical process of reflection is defined as isolated from the AI.

I'm not sure what difference between hypothetical and actual processes of reflection you are emphasizing (if I understood what the terms mean correctly), since the actual civilization might plausibly move in into a substrate that is more like ML reasoning than concrete computation (let alone concrete physical incarnation), and thus become the same kind of thing as hypothetical reflection. The most striking distinction (for AI safety) seems to be the implication that an actual process of reflection can't be isolated from decisions of the AI taken based on insufficient reflection.

There's also the need to at least define exact imitations or better yet X-and-only-X efficient imitation in order to define a hypothetical process of reflection, which is not as absolutely necessary for actual reflection, so getting hypothetical reflection at all might be more difficult than some sort of temporary stability with actual reflection, which can then be used to define hypothetical reflection and thereby guard from consequences of overly agentic use of bad predictions of (on) actual reflection.

It seems to me like "Reason about a perfect emulation of a human" is an extremely similar task to "reason about a human," to me it does not feel closely related to X-and-only-X efficient imitation. For example, you can make calibrated predictions about what a human would do using vastly less computing power than a human (even using existing techniques), whereas perfect imitation likely requires vastly more computing power.

The point is that in order to be useful, a prediction/reasoning process should contain mesa-optimizers that perform decision making similar in a value-laden way to what the original humans would do. The results of the predictions should be determined by decisions of the people being predicted (or of people sufficiently similar to them), in the free-will-requires-determinism/you-are-part-of-physics sense. The actual cognitive labor of decision making needs to in some way be an aspect of the process of prediction/reasoning, or it's not going to be good enough. And in order to be safe, these mesa-optimizers shouldn't be systematically warped into something different (from a value-laden point of view), and there should be no other mesa-optimizers with meaningful influence in there. This just says that prediction/reasoning needs to be X-and-only-X in order to be safe. Thus the equivalence. Prediction of exact imitation in particular is weird because in that case the similarity measure between prediction and exact imitation is hinted to not be value-laden, which it might have to be in order for the prediction to be both X-and-only-X and efficient.

This is only unimportant if X-and-only-X is the likely default outcome of predictive generalization, so that not paying attention to this won't result in failure, but nobody understands if this is the case.

The mesa-optimizers in the prediction/reasoning similar to the original humans is what I mean by efficient imitations (whether X-and-only-X or not). They are not themselves the predictions of original humans (or of exact imitations), which might well not be present as explicit parts of the design of reasoning about the process of reflection as a whole, instead they are the implicit decision makers that determine what the conclusions of the reasoning say, and they are much more computationally efficient (as aspects of cheaper reasoning) than exact imitations. At the same time, if they are similar enough in a value-laden way to the originals, there is no need for better predictions, much less for exact imitation, the prediction/reasoning is itself the imitation we'd want to use, without any reference to an underlying exact process. (In a story simulation, there are no concrete states of the world, only references to states of knowledge, yet there are mesa-optimizers who are the people inhabiting it.)

If prediction is to be value-laden, with value defined by reflection built out of that same prediction, the only sensible way to set this up seems to be as a fixpoint of an operator that maps (states of knowledge about) values to (states of knowledge about) values-on-reflection computed by making use of the argument values to do value-laden efficient imitation. But if this setup is not performed correctly, then even if it's set up at all, we are probably going to get bad fixpoints, as it happens with things like bad Nash equilibria etc. And if it is performed correctly, then it might be much more sensible to allow an AI to influence what happens within the process of reflection more directly than merely by making systematic distortions in predicting/reasoning about it, thus hypothetical processes of reflection wouldn't need the isolation from AI's agency that normally makes them safer than the actual process of reflection.

Suppose that someone has trained a model  to predict  given , and I want to extend it to a question-answering model  that answers arbitrary questions in a way that reflects all of 's knowledge.

Two prototypical examples I am thinking of are:

  •  runs a low-level model of physics. We want to extract high-level features of the world from the intermediate physical states, which requires e.g. a cat-classifier that operates directly on physical states rather than pixels.
  •  performs logical deductions from a set of statements produced by an opaque "intuition module." We want to output all of the intermediate inferences, which requires knowing how to translate between its intermediate language and natural language.

Here's an approach that seems kind of silly to me but is probably worth exploring:

  • Let  be all the activations of . In our examples, this includes (amongst other things) all of the physical states computed by  or all of the intermediate deductions.
  • Train an extraction model  that answers questions given only the activations of .
  • Define 
  • Optimize  aggressively for speed, e.g. choose the simplest circuit that produces good answers.

This feels appealing for the same reason that you might hope minimal circuits are not deceptive. For example, it seems like  would never bother re-using its model of a human because it would be faster to just hard-code all the facts about how humans answer questions.

In addition to feeling a bit silly, there are some obvious problems:

  • Maybe  does happen to compute the instrumental policy (e.g. because its thinking about a human who is thinking about the instrumental policy...) and so the fastest thing for  to do is just to look up the instrumental policy and use it. And more generally there could be other things for  to do that are faster than the intended model.
  • The reason why we expect  to be able to extract knowledge from  is that  needs to be able to extract that knowledge. For example,   needs to be able to read sentences written in its internal language in order to perform deduction with them, so we know there is some fast way to do that. However, there are some disanalogies between the problems faced by E and F. A small issue is that F is "cooperating" with itself rather than trying to read off values from a fixed intermediate. A potentially bigger issue is that they use different priors (and F may use lots of weight-sharing and be relatively computationally slow when utilizing its own intermediate representation), and it's totally unclear what architectures for E could potentially work.

Some random thoughts:

  • This approach has a vibe more like interpretability, where we first train a model and then search over interpretations of its activations, rather than other approaches I've described recently that are focused on jointly learning good predictions and question-answering. That said, in practice I do expect that we would need to learn E "as we go" in order for it to have a chance of "keeping up" with F (since otherwise F may have developed to a very arcane state from which SGD has no chance of learning a reasonable decoding).
  • It's kind of weird to use something like a speed prior over E rather than something that tries to be a closer match for the "real" human prior. I expect to use a more "real" prior in the end but at this point I don't really know what's going on and am just looking for something that gets traction.

The most fundamental reason that I don't expect this to work is that it gives up on "sharing parameters" between the extractor and the human model. But in many cases it seems possible to do so, and giving on up on that feels extremely unstable since it's trying to push against competitiveness (i.e. the model will want to find some way to save those parameters, and you don't want your intended solution to involve subverting that natural pressure).

Intuitively, I can imagine three kinds of approaches to doing this parameter sharing:

  1. Introduce some latent structure  (e.g. semantics of natural language, what a cat "actually is") that is used to represent both humans and the intended question-answering policy. This is the diagram 
  2. Introduce some consistency check  between  and . This is the diagram 
  3. Somehow extract  from  or build it out of pieces derived from . This is the diagram . This is kind of like a special case of 1, but it feels pretty different.

(You could imagine having slightly more general diagrams corresponding to any sort of d-connection between  and .)

Approach 1 is the most intuitive, and it seems appealing because we can basically leave it up to the model to introduce the factorization (and it feels like there is a good chance that it will happen completely automatically). There are basically two challenges with this approach:

  • It's not clear that we can actually jointly compress  and . For example, what if we represent  in an extremely low level way as a bunch of neurons firing; the neurons are connected in a complicated and messy way that learned to implement something like , but need not have any simple representation in terms of . Even if such a factorization is possible, it's completely unclear how to argue about how hard it is to learn. This is a lot of what motivates the compression-based approaches---we can just say " is some mess, but you can count on it basically computing " and then make simple arguments about competitiveness (it's basically just as hard as separately learning  and ).
  • If you overcame that difficulty, you'd still have to actually incentivize this kind of factorization in the model (rather than sharing parameters in the unintended way). It's unclear how to do that (maybe you're back to think about something speed-prior like, and this is just a way to address my concern about the speed-prior-like proposals), but this feels more tractable than the first problem.

I've been thinking about approach 2 over the last 2 months. My biggest concern is that it feels like you have to pay the bits of H back "as you learn them" with SGD, but you may learn them in such a way that you don't really get a useful consistency update until you've basically specified all of H. (E.g. suppose you are exposed to brain scans of humans for a long time before you learn to answer questions in a human-like way. Then at the end you want to use that to pay back the bits of the brain scans, but in order to do so you need to imagine lots of different ways the brain scans could have looked. But there's no tractable way to do that, because you have to fill in the the full brain scan before it really tells you about whether your consistency condition holds.)

Approach 3 is in some sense most direct. I think this naturally looks like imitative generalization, where you use a richer set of human answers to basically build  on top of your model. I don't see how to make this kind of thing work totally on its own, but I'm probably going to spend a bit of time thinking about how to combine it with approaches 1 and 2.

I think the biggest problem is that  can compute the instrumental policy (or a different policy that works well, or a fragment of it). Some possible reasons:

  • Maybe some people in the world are incidentally thinking about the instrumental policy and  makes predictions about them. 
  • Maybe an adversary who computes a policy that performs well in order to try to attack the learning process (since  may just copy the adversary's policy in order to be fast if it works well on training, resulting in bad generalization).
  • Maybe no one computes the instrumental policy itself, but they compute helpful building blocks and  can save compute by reusing those.
  • Maybe  wants to predict question-answering policies, and computing optimal responses to the training objective is a simple way for it to forecast what they will do. (It's totally unclear if this is actually more efficient than the intended policy.) Then  can copy that.

I don't know if any of those particular failures are too likely. But overall it seems really bad to rely on  never computing something inconvenient, and it definitely doesn't look like it's going to work in the worst case.

What are some possible outs, if in fact  computes something adversarial to try to make it easy for  to learn something bad?

  • At the time when  is running it doesn't know the question Q that  will need to answer. So in order to cause trouble it would need to compute answers for all of these questions. But I'm afraid that this is doable in many cases, since  could translate a big chunk of its internal state into natural language and then  can do some simple QA on top of that.
  • We could try to ensure that the correctness/consistency checks are really complex or challenging to meet, at least for some questions. If the complexity of these checks scales up with the complexity of  itself (e.g. by using amplification where the evaluator consults a model as big as ), then maybe it's impossible for  to meet them. This idea can be applied without the factorization into F and E, but it's probably useful to have this kind of factorization in mind as a possible tool when exploring approaches based on tricky-evaluations.
  • We could try to exploit some further structural facts about the parts of  that are used by . For example, it feels like the intended model is going to be leveraging facts that are further "upstream." For example, suppose an attacker observes that there is a cat in the room, and so writes out "There is a cat in the room" as part of a natural-language description of what it's going on that it hopes that  will eventually learn to copy. If  predicts the adversary's output, it must first predict that there is actually a cat in the room, which then ultimately flows downstream into predictions of the adversary's behavior. And so we might hope to prefer the "intended"  by having it preferentially read from the earlier activations (with shorter computational histories).

Overall this kind of approach feels like it's probably doomed, but it does capture part of the intuition for why we should "just" be able to learn a simple correspondence rather than getting some crazy instrumental policy. So I'm not quite ready to let it go yet. I'm particularly interested to push a bit on the third of these approaches.

Here's another approach to "shortest circuit" that is designed to avoid this problem:

  • Learn a circuit  that outputs an entire set of beliefs. (Or maybe some different architecture, but with ~0 weight sharing so that computational complexity = description complexity.)
  • Impose a consistency requirement on those beliefs, even in cases where a human can't tell the right answer.
  • Require 's beliefs about  to match . We hope that this makes  an explication of "'s beliefs."
  • Optimize some combination of (complexity) vs (usefulness), or chart the whole pareto frontier, or whatever. I'm a bit confused about how this step would work but there are similar difficulties for the other posts in this genre so it's exciting if this proposal gets to that final step.

The "intended" circuit  just follows along with the computation done by  and then translates its internal state into natural language.

What about the problem case where  computes some reasonable beliefs (e.g. using the instrumental policy, where the simplicity prior makes us skeptical about their generalization) that  could just read off? I'll imagine those being written down somewhere on a slip of paper inside of 's model of the world.

  • Suppose that the slip of paper is not relevant to predicting , i.e. it's a spandrel from the weight sharing. Then the simplest circuit  just wants to cut it out. Whatever computation was done to write things down on the slip of paper can be done directly by , so it seems like we're in business.
  • So suppose that the slip of paper is relevant for predicting , e.g. because someone looks at the slip of paper and then takes an action that affects . If (the correct)  is itself depicted on the slip of paper, then we can again cut out the slip of paper itself and just run the same computation (that was done by whoever wrote something on the slip of paper). Otherwise, the answers produced by  still have to contain both the items on the slip of paper as well as some facts that are causally downstream of the slip of paper (as well as hopefully some about the slip of paper itself). At that point it seems like we have a pretty good chance of getting a consistency violation out of .

Probably nothing like this can work, but I now feel like there are two live proposals for capturing the optimistic minimal circuits intuition---the one in this current comment, and in this other comment. I still feel like the aggressive speed penalization is doing something, and I feel like probably we can either find a working proposal in that space or else come up with some clearer counterexample.

We could try to exploit some further structural facts about the parts of  that are used by . For example, it feels like the intended model is going to be leveraging facts that are further "upstream." For example, suppose an attacker observes that there is a cat in the room, and so writes out "There is a cat in the room" as part of a natural-language description of what it's going on that it hopes that  will eventually learn to copy. If  predicts the adversary's output, it must first predict that there is actually a cat in the room, which then ultimately flows downstream into predictions of the adversary's behavior. And so we might hope to prefer the "intended"  by having it preferentially read from the earlier activations (with shorter computational histories).

The natural way to implement this is to penalize  not for the computation it does, but for all the computation needed to compute its output (including within .). The basic problem with this approach is that it incentivizes  to do all of the computation of  from scratch in a way optimized for speed rather than complexity. I'd set this approach aside for a while because of this difficulty and the unnaturalness mentioned in the sibling (where we've given up on what seems to be an important form of parameter-sharing).

Today I was thinking about some apparently-totally-different angles of attack for the ontology identification problem, and this idea seems to have emerged again, with a potential strategy for fixing the "recompute  problem". (In the context of ontology identification, the parameter-sharing objection no longer applies.)

Here's the idea:

  • Train  as before.
  • Start with a bunch of facts and probabilistic relationships that the human knows, expressed in their own ontology. These might be facts like "Ice melts at 100 degrees" or "Dogs bark at cats" or whatever.
  • We are going to try to jointly learn (i) a correspondence  between  and the human's ontology, (ii) a set of "justifications" showing that  satisfies all of the relationships the human expects. I'm imagining justifications like simulating ice at 100 degrees and observing that it indeed melts, or sampling situations with dogs and cats and verifying that the dogs bark.
  • The correspondence  is constrained to map "the observations" (a concept in the human's ontology) to the output of , but other than that there is no simplicity prior, it can be anything.
  • Our goal is to make the justifications as computationally simple as possible. (Which indirectly incentivizes us to make  as computationally simple as possible.)

This still feels a bit weird, but you could imagine it handling a bunch of cases in a promising way:

  •  has no reason to "reuse the human" or to use the human's inference procedure in order to save bits, because doing either of those things makes the justifications more computationally complex.
  • If there is a human embedded inside  who performs the correspondence and writes it down in an easy-to-read way,  has no motivation to read it: if  is defined in that way, then justifying facts will require simulating the human (even if that work was already done inside ). If  simply cut out the middle man and applied the correspondence itself, then it could save compute in the typical case (except when talking about facts about that human). This is subtle in a few ways but tentatively looks plausible to me.
  •  has no reason to ignore  and implement a new more-efficient-but-more-complex simulation , because (i) it ultimately needs to relate observations back to the output of , and many of its concepts are related to observations (e.g. what cats look like), (ii) that forces  and  to have the same behavior, (iii) the justification would then need to show that the the "observations" in  are the same as the observations in , which is computationally costly.

But right now it's a pretty vague proposal, because it's unclear what the nature of these facts or justifications are. If you set that up in a naive way, then the justification effectively just needs to simulate all of . That's a problem because it reintroduces the failure mode where you need to simulate the human, and therefore there's no extra cost to just simulating and then listening to whatever they say.

Overall I think that probably nothing like this works, but I'm still feeling a lot more optimistic than I was last week and want to explore it further. (This is partially for reasons not discussed in this comment, that several other approaches/motivations seem to converge on something similar.)

Here's a slightly more formal algorithm along these lines:

  • Assume that both the human's model  and the AI's model   are Bayesian networks where you compute the probability distribution over a node 's value based on the values of its parents . I'll write  for the set of values that a node  can take on (in either model), and  for the joint values of a set of nodes .
  • A correspondence tells you how to compute the value of each node  in the human's model. This consistent of (i) a neighborhood in the AI's model  which suffices to determine , (ii) a function .
  • Both the AI's model and the human model contain some distinguished observation nodes.  must be the identity on these nodes.
  • An "explanation" of a correspondence consists of a set of nodes   in the AI's model for each node  in the human's model. The intuition is that we can run a simulation involving only these nodes in order to reproduce the probability distribution of  given its parents' values.
  • In particular, , and  for all . In order to check whether  reproduces the right distribution, we first sample values at random for all the nodes some of whose parents aren't in . Then we sample values for the remaining nodes. We can use  to compute the corresponding values for  and all of its parents. And then we can compute the conditional distributions for  given each set of values for its parents.
  • We require that the explanations exactly reproduce the conditional probability over  given .
  • The "cost" of the explanation of  is the sum of the compute required to sample all the nodes in . The "cost" of the correspondence  is the compute required to evaluate it. 
  • We search for the set of correspondences and explanations for which the total cost is minimized.
  • (Maybe we also have some requirement where the correspondence  agrees with some training data about . I'm not really sure about that.)

Reviewing how this behaves in each of the bad cases from the parent:

  • It's very bad to define  by computing the observation and then using the human's inference algorithm. The entire motivation for defining it this way was to save on description complexity, but  is only being penalized for computational complexity. (This also forces every single  to include the entire process that generates the observations, which seems like it should be an even bigger deal. But this feels less like the "right" reason and I think it might change for a future version of the scheme.)
  • If there is a human in the simulation who knows the value of , it's extremely bad to define  to be that human. This is because the explanation  will need to include the entire process by which the human computes that value. This is guaranteed to be more expensive than just performing that compute in the correspondence  itself.
  • We could define  to compute the value of  from scratch using the initial values of the simulation, using some more-efficient simulation. But we can't do this for the observation node , since it is required to map up with the observation node in the AI's model. So in order to explain the observation node we will need to have  include the AI's entire model, or at least the parts leading up to the observation. This means we are basically paying for two full copies of the computation, and so we're not really benefiting from the fact that the second copy is more efficient. (We are also potentially overlapping a lot of computation between the different 's, but again I'm a bit less sure how robust that is and whether it will hold up in different formalizations.)

There are a lot of problems and missing details in this proposal:

  • This requires exactly reproducing the conditional probabilities in the human's model. But that's totally unrealistic for anything short of a full simulation---and even for the full simulation it wouldn't reproduce the probabilities since the human model isn't accurate. So we need some way to choose a good enough explanation, i.e. a way of balancing the computational complexity of the explanation against the quality of the conditional probabilities that come out.
  • We're sampling the inputs to  uniformly at random. This seems unlikely to work in general. We could easily sample each node from its marginal, but most of the action is in the correlation. Allowing arbitrary correlations causes problems (since you could just specify the "human is accurate" correlation and then read off the correct answers from there). So I think probably some more flexible system is needed here; there are a lot of options but it's tricky.
  • There is something problematic about the overlapping explanations . If they overlap you need to pay for all of them, but for the intended mapping there will often be quite significant overlap. This isn't inherently a problem, but I'm scared that it's going to introduce a lot of pressure towards some different correspondence that is able to avoid that problem. We need to penalize overlap because the case where the training data is embedded in the model---the main problem with that model is that you need to separately explain every way in which the human is correct with highly overlapping explanations. If you didn't penalize those then you may just end up with the embedded explanations (for which  is extremely cheap).
  • There is something tricky about uniformity in the model and in the implementations of .
  • I'm still scared about the "recompute everything from scratch" failure mode. The model does need to have a single explanation  that needs to include the whole model. But (i) it doesn't have to reproduce work, (ii) it can cut out all the stuff not on the path to the observation. So the obvious reason that this one loses is by the duplicated work in . Hopefully that's actually robust.
  • We are making really strong structural assumptions on the models and the correspondence between them. We get some things for free (because humans actually do have extra structure in our beliefs about the world that is properly part of the problem statement, and the AI's model is constrained by its architecture) but not nearly this much.

Overall I'm becoming significantly more optimistic that something like this will work (though still less likely than not). Trying to step back and see the big picture, it seems like there are three key active ingredients:

  • Using "speed" instead of "simplicity" as the ~only requirement for these correspondences.
  • Having separate correspondences for separate properties and not allowing them to share tons of computation with each other (to prevent re-running the whole simulation).
  • Forcing the model to explain correlations, so that using an "embedded" copy of the answers (like a simulation of the data-generating process) forces you to reproduce the computation that produced that answer.

My next step would probably be looking at cases where these high-level ingredients aren't sufficient (e.g. are there cases where "generate obs then do inference in the human model" is actually cheaper?). If they look pretty good, then I'll spend some more time trying to fill in the details in a more plausible way.

We might be able to get similar advantages with a more general proposal like:

Fit a function f to a (Q, A) dataset with lots of questions about latent structure. Minimize the sum of some typical QA objective and the computational cost of verifying that f is consistent.

Then the idea is that matching the conditional probabilities from the human's model (or at least being consistent with what the human believes strongly about those conditional probabilities) essentially falls out of a consistency condition.

It's not clear how to actually formulate that consistency condition, but it seems like an improvement over the prior situation (which was just baking in the obviously-untenable requirement of exactly matching). It's also not clear what happens if this consistency condition is soft.

It's not clear what "verify that the consistency conditions are met" means. You can always do the same proposal as in the parent, though it's not really clear if that's a convincing verification. But I think that's a fundamental philosophical problem that both of these proposals need to confront.

It's not clear how to balance computational cost and the QA objective. But you are able to avoid most of the bad properties just by being on the Pareto frontier, and I don't think this is worse than the prior proposal.

Overall this approach seems like it could avoid making such strong structural assumptions about the underlying model. It also helps a lot with the overlapping explanations + uniformity problem. And it generally seems to be inching towards feeling plausible.

One aspect of this proposal which I don't know how to do is evaluation the answers of the question-answerer. That looks too me very related to the deconfusion of universality that we discussed a few months ago, and without an answer to this, I feel like I don't even know how to run this silly approach.