Abram Demski

My Current Take on Counterfactuals

If PA is consistent, then the agent cannot prove U = -10 (or anything else inconsistent) under the assumption that the agent already crossed, and therefore Löb's theorem fails to apply. In this case, there is no weird certainty that crossing is doomed.

I think this is the wrong step. Why do you think this? Just because PA is consistent doesn't mean you can't prove weird things under assumption. Look at the structure of the proof. You're objecting to an assumption. ("Suppose PA proves that crossing -> U=-10") That's a pretty weird way to object to a proof. I'm allowed to make any assumptions I like.

My guess is that you are wrestling with Lobs theorem itself. Lobs theorem is pretty weird!

Speculations against GPT-n writing alignment papers

It seems to me that the last paragraph should update you to thinking that this plan is no worse than the default. IE: yes, this plan creates additional risk because there are complicated pathways a malign gpt-n could use to get arbitrary code run on a big computer. But if people are giving it that chance anyway, it does seem like a small increase in risk with a large potential gain. (Small, not zero, for the chance that your specific gpt-n instance somehow becomes malign when others are safe, eg if something about the task actually activated a subtle malignancy not present during other tasks).

So for me a crux would be, if it's not malign, how good could we expect the papers to actually be?

An Intuitive Guide to Garrabrant Induction

First, I'm not sure exactly why you think this is bad. Care to say more? My guess is that it just doesn't fit the intuitive notion that updates should be heading toward some state of maximal knowledge. But we do fit this intuition in other ways; specifically, logical inductors eventually trust their future opinions more than their present opinions.

Personally, I found this result puzzling but far from damning.

Second, I've actually done some unpublished work on this. There is a variation of the logical induction criterion which is more relaxed (admits more things as rational), such that constant is ok. Let's call this "weak logical induction". However, it's more similar to the original criterion than you might expect. (Credit to Sam Eisenstat for doing most of the work finding the proof.) In particular, iirc, any function from deductive process history to market prices (computable or not) which is a weak logical inductor *for any deductive process* is also a logical inductor in the original sense.

In other words, there is room to weaken the criterion, but doing so won't broaden the class of algorithms satisfying the criterion (unless you're happy to custom-tailor algorithms to specific deductive processes, which replaces induction with simple foreknowledge).

Putting it a different way, define "universal" LIC (ULIC) to be the property of satisfying the LIC for any deductive process. We can similarly define universal weak logical induction, UWLIC. It turns out that even though LIC and WLIC are different (WLIC allows constant inductors), their universal versions are not different (again, iirc. There could have been more technical assumptions on the theorem.).

I think the paper made a mistake by focusing on LIC rather than ULIC; Garrabrant induction is really only interesting because it's universal.

Did the paperv also make a mistake by using LIC rather than WLIC? Maybe. I see no intuitive reason why our notion of rationality should be LIC rather than WLIC. Broader is better, if the specificity doesn't get you anything you intuitively want. But the theorem I'm referring to shows that the damage is minimal, since we really want the universal versions anyway.

My AGI Threat Model: Misaligned Model-Based RL Agent

So it's still in the observation-utility paradigm I think, or at least it seems to me that it doesn't have an automatic incentive to wirehead. It

couldwant to wirehead, if the value function winds up seeing wireheading as desirable for any reason, but it doesn't have to. In the human example, some people are hedonists, but others aren't.

All sounds perfectly reasonable. I just hope you recognize that it's all a big mess (because it's difficult to see how to provide evidence in a way which will, at least eventually, rule out the wireheading hypothesis or any other problematic interpretations). As I imagine you're aware, I think we need stuff from my 'learning normativity' agenda to dodge these bullets.

In particular, I would hesitate to commit to the idea that *rewards* are the only type of feedback we submit.

FWIW, I'm now thinking of your "value function" as *expected* utility in Jeffrey-Bolker terms. We need not assume a utility function to speak of expected utility. This perspective is nice in that it's a generalization of what RL people mean by "value function" anyway: the value function *is exactly* the expected utility of the event "I wind up in this specific situation" (at least, it is if value iteration has converged). The Jeffrey-Bolker view just opens up the possibility of explicitly representing the value of more events.

So let's see if we can pop up the conversational stack.

I guess the larger topic at hand was: how do we define whether a value function is "aligned" (in an inner sense, so, when compared to an outer objective which is being used for training it)?

Well, I think it boils down to whether the current value function makes "reliably good predictions" about the values of events. Not just good predictions on average, but predictions which are never catastrophically bad (or at least, catastrophically bad with *very* low probability, in some appropriate sense).

If we think of the true value function as V(x), and our approximation as V(x), we want something like: under some distance metric, if there is a modification of V*(x) with catastrophic downsides, V(x) is *closer* to V*(x) than that modification. (OK that's a bit lame, but hopefully you get the general direction I'm trying to point in.)

Something like that?

My Current Take on Counterfactuals

OK, so, here is a question.

The abstract theory of InfraBayes (like the abstract theory of Bayes) elides computational concerns.

In reality, all of ML can more or less be thought of as using a big search for good models, where "good" means something approximately like MAP, although we can also consider more sophisticated variational targets. This introduces two different types of approximation:

- The optimization target is approximate.
- The optimization itself gives only approximate maxima.

What we want out of InfraBayes is a bounded regret guarantee (in settings where we previously didn't know how to get one). What we have is a picture of how to get that if we can actually do the generalized Bayesian update. What we might want is a picture of how to do that more generally, when we can't actually compute the full update.

Can we get such a thing with InfraBayes?

In other words, search is a very basic type of logical uncertainty. Currently, we don't have much of a model of that, except "Bayesian Search" (which does not provide any nice regret bounds that I know of, although I may be ignorant). We might need such a thing in order to get nice guarantees for systems which employ search internally. Can we get it?

Obviously, we can do the bayesian-search thing with InfraBayes substituted in, which already probably provides some kind of guarantee which couldn't be gotten otherwise. However, the challenge is to get the guarantee to carry all the way through to the end result.

My Current Take on Counterfactuals

What I'm referring to is that LI given *a notion of rational uncertain expectation* for the procrastination paradox -- so, less a positive result, more a framework for thinking about what behavior is reasonable.

However, I also think LIDT solves the problem in practical terms:

- In the pure procrastination-paradox problem, LIDT will eventually push the button if its logic is sound. If it did not, it would mean the conditional probability of ever pressing the button given
*not pressing it today*remains forever higher than the conditional probability of*ever pressing it today*. However, the expectation can be split into the probability it gets pushed today, and the probability that it gets pushed on any day later than today. The LI should eventually know that the conditional probability of ever pressing the button given pressing it today is arbitrarily close to 1. So in order to never press the button, the conditional probability of ever pressing it in the future (given not pressing today) would have to go to 1 (faster than the probability of it ever being pressed given pressing it today). I don't think this can happen, since there will be some nonzero limit probability that the button will never be pressed (that is, there will be supposing the button is in fact never pressed). - In a situation where there is some actual reason to procrastinate (there are other sources of utility), but we place very high value on eventually pressing the button, it may be that the button will never be pressed? However, this will only happen if we're subjectively confident that it will eventually be pressed, and always have something better to do in the mean time. The second part seems pretty difficult. So maybe we can also prove that we eventually press the button in this case, as well.

My basic argument is *we can model this sort of preference, so why rule it out as a possible human preference?* You may be philosophically confident in finitist/constructivist values, but are you * so confident *that you'd want to lock unbounded quantifiers out of the space of possible values for value learning?

Formal Inner Alignment, Prospectus

Just want to note that although it's been a week this is still in my thoughts, and I intend to get around to continuing this conversation... but possibly not for another two weeks.

Formal Inner Alignment, Prospectus

I think let's step back for a second, though. Suppose you were in the epistemic position "yes, this works in theory, with the realizability assumption, with no computational slowdown over MAP, but having spent 2-10 hours trying to figure out how to distill a neural network's epistemic uncertainty/submodel-mismatch, and having come up blank..." what's the conclusion here? I don't think it's "my main guess is that there's no way to apply this in practice".

A couple of separate points:

- My main worry continues to be the way bad actors have control over an io channel, rather than the slowdown issue.
- I feel like there's something a bit wrong with the 'theory/practice' framing at the moment. My position is that certain theoretical concerns (eg, embeddedness) have a tendency to translate to practical concerns (eg, approximating AIXI misses some important aspects of intelligence). Solving those 'in theory' may or may not translate to solving the practical issues 'in practice'. Some forms of in-theory solution, like setting the computer outside of the universe, are particularly unrelated to solving the practical problems. Your particular in-theory solution to embeddedness strikes me as this kind. I would contest whether it's even an in-theory solution to embeddedness problems; after all, are you theoretically saying that the computer running the imitation learning has no causal influence over the human being imitated? (This relates to my questions about whether the learner specifically requests demonstrations, vs just requiring the human to do demonstrations forever.) I don't really think of something like that as a "theoretical solution" to the realizability probelm at all. That's reserved for something like logical induction which has unrealistically high computational complexity, but does avoid a realizability assumption.

Even if you had spent all the time since my original post trying to figure out how to efficiently distill a neural network's epistemic uncertainty, it's potentially a hard problem! [...] I have never tried to claim that analogizing this approach to neural networks will be easy, but I don't think you want to wait to hear my formal ideas until I have figured out how to apply them to neural networks;

Yeah, this is a fair point.

and 10 hours of unsuccessful search isn't even close to the amount of time needed to demote that area from "most promising".

To be clear, people I know spent a lot more time than that thinking hard about the consensus algorithm, before coming to the strong conclusion that it was a fruitless path. I agree that this is worth spending >20 hours thinking about. I just perceive it to have hit diminishing returns. (This doesn't mean no one should ever think about it again, but it does seem worth communicating why the direction hasn't born fruit, at least to the extent that that line of research is happy being public.)

I think the question we are discussing here is: "yes, with the realizability assumption, existence of a benign model in the top set is substantially correlated over infinite time, enough so that all we need to look at is the relative weight of malign and benign models, BUT is the character of this correlation fundamentally different without the realizability assumption?"

Sounds right to me.

I don't see how this example makes that point. If the threshold of "unrealistic" is set in such a way that "realistic" models will only know most things about Sally, then this should apply equally to malign and benign models alike. (I think your example respects that, but just making it explicit). However, there should be a benign and malign model that knows about Sally's affinity for butter but not her allergy to flowers, and a benign and a malign model that knows the opposite. It seems to me that we still end up just considering the relative weight of benign and malign models that we might expect to see.

Ah, ok! Basically this is a new way of thinking about it for me, and I'm not sure what I think yet. My picture was that we argue that the top-weighted "good" (benign+correct) hypothesis can get unlucky, but should never get too unlucky, such that we can set N so that the good guy is always in the top N. Without realizability, we would have no particular reason to think "the good guy" (which is now just benign + *reasonably* correct) never drops below N on the list, for any N (because oscillations can be unbounded).

(A frugal hypothesis generating function instead of a brute force search over all reasonable models might miss out on, say, the benign version of the model that understands Sally's allergies; I do not claim to have identified an approach to hypothesis generation that reliably includes benign models. That problem could be one direction in the research agenda of analogizing this approach to state-of-the-art AI. And this example might also be worth thinking about in that project, but if we're just using the example to try to evaluate the effect of

justremoving the realizability assumption, but not removing the privilege of a brute search through reasonable models, then I stand by the choice to deem this paragraph parenthetical).

I don't really get why yet -- can you spell the (brute-force) argument out in more detail?

(going for now, will read+reply more later)

My Current Take on Counterfactuals

The continuity property is really important.

Right, I agree with this. The situation as I see it is that there's a concrete theory of rationality (logical induction) which I'm using in this way, and it is suggesting to me that

yourtheory (InfraBayes) can still be extended somewhat.My argument that we

wantthis particular extension is basically as follows: human values can be thought of as the endpoint of human philosophical deliberation about values. (I am thinking of logical induction as a formalization of philosophical deliberation over time.) This endpoint seems limit-computable, but not necessarily computable. Now, it's also possible that at this endpoint, humans would have a more compact (ie, computable) representation of values. However, why assume this?(My hope is that by appealing to deliberation like this, my argument has more force than if I was only relying on the strength of logical induction as a theory of rationality. The idea of deliberation gives us a general reason to expect that limit-computable is the right place to look.)

I'm not sure details matter very much here, but I'm provisionally happy to spell out LIDT as:

Concrete enough?