I agree inasmuch as we actually can model this sort of preferences, for a sufficiently strong meaning of "model". I feel that it's much harder to be confident about any detailed claim about human values than about the validity of a generic theory of rationality. Therefore, if the ultimate generic theory of rationality imposes some conditions on utility functions (while still leaving a very rich space of different utility functions), that will lead me to try formalizing human values within those constraints. Of course, given a candidate theory, we should poke around and see whether it can be extended to weaken the constraints.
Right, I agree with this. The situation as I see it is that there's a concrete theory of rationality (logical induction) which I'm using in this way, and it is suggesting to me that your theory (InfraBayes) can still be extended somewhat.
My argument that we want this particular extension is basically as follows: human values can be thought of as the endpoint of human philosophical deliberation about values. (I am thinking of logical induction as a formalization of philosophical deliberation over time.) This endpoint seems limit-computable, but not necessarily computable. Now, it's also possible that at this endpoint, humans would have a more compact (ie, computable) representation of values. However, why assume this?
(My hope is that by appealing to deliberation like this, my argument has more force than if I was only relying on the strength of logical induction as a theory of rationality. The idea of deliberation gives us a general reason to expect that limit-computable is the right place to look.)
What is LIDT exactly?
I'm not sure details matter very much here, but I'm provisionally happy to spell out LIDT as:
If PA is consistent, then the agent cannot prove U = -10 (or anything else inconsistent) under the assumption that the agent already crossed, and therefore Löb's theorem fails to apply. In this case, there is no weird certainty that crossing is doomed.
I think this is the wrong step. Why do you think this? Just because PA is consistent doesn't mean you can't prove weird things under assumption. Look at the structure of the proof. You're objecting to an assumption. ("Suppose PA proves that crossing -> U=-10") That's a pretty weird way to object to a proof. I'm allowed to make any assumptions I like.
My guess is that you are wrestling with Lobs theorem itself. Lobs theorem is pretty weird!
It seems to me that the last paragraph should update you to thinking that this plan is no worse than the default. IE: yes, this plan creates additional risk because there are complicated pathways a malign gpt-n could use to get arbitrary code run on a big computer. But if people are giving it that chance anyway, it does seem like a small increase in risk with a large potential gain. (Small, not zero, for the chance that your specific gpt-n instance somehow becomes malign when others are safe, eg if something about the task actually activated a subtle malignancy not present during other tasks).
So for me a crux would be, if it's not malign, how good could we expect the papers to actually be?
First, I'm not sure exactly why you think this is bad. Care to say more? My guess is that it just doesn't fit the intuitive notion that updates should be heading toward some state of maximal knowledge. But we do fit this intuition in other ways; specifically, logical inductors eventually trust their future opinions more than their present opinions.
Personally, I found this result puzzling but far from damning.
Second, I've actually done some unpublished work on this. There is a variation of the logical induction criterion which is more relaxed (admits more things as rational), such that constant is ok. Let's call this "weak logical induction". However, it's more similar to the original criterion than you might expect. (Credit to Sam Eisenstat for doing most of the work finding the proof.) In particular, iirc, any function from deductive process history to market prices (computable or not) which is a weak logical inductor for any deductive process is also a logical inductor in the original sense.
In other words, there is room to weaken the criterion, but doing so won't broaden the class of algorithms satisfying the criterion (unless you're happy to custom-tailor algorithms to specific deductive processes, which replaces induction with simple foreknowledge).
Putting it a different way, define "universal" LIC (ULIC) to be the property of satisfying the LIC for any deductive process. We can similarly define universal weak logical induction, UWLIC. It turns out that even though LIC and WLIC are different (WLIC allows constant inductors), their universal versions are not different (again, iirc. There could have been more technical assumptions on the theorem.).
I think the paper made a mistake by focusing on LIC rather than ULIC; Garrabrant induction is really only interesting because it's universal.
Did the paperv also make a mistake by using LIC rather than WLIC? Maybe. I see no intuitive reason why our notion of rationality should be LIC rather than WLIC. Broader is better, if the specificity doesn't get you anything you intuitively want. But the theorem I'm referring to shows that the damage is minimal, since we really want the universal versions anyway.
So it's still in the observation-utility paradigm I think, or at least it seems to me that it doesn't have an automatic incentive to wirehead. It could want to wirehead, if the value function winds up seeing wireheading as desirable for any reason, but it doesn't have to. In the human example, some people are hedonists, but others aren't.
All sounds perfectly reasonable. I just hope you recognize that it's all a big mess (because it's difficult to see how to provide evidence in a way which will, at least eventually, rule out the wireheading hypothesis or any other problematic interpretations). As I imagine you're aware, I think we need stuff from my 'learning normativity' agenda to dodge these bullets.
In particular, I would hesitate to commit to the idea that rewards are the only type of feedback we submit.
FWIW, I'm now thinking of your "value function" as expected utility in Jeffrey-Bolker terms. We need not assume a utility function to speak of expected utility. This perspective is nice in that it's a generalization of what RL people mean by "value function" anyway: the value function is exactly the expected utility of the event "I wind up in this specific situation" (at least, it is if value iteration has converged). The Jeffrey-Bolker view just opens up the possibility of explicitly representing the value of more events.
So let's see if we can pop up the conversational stack.
I guess the larger topic at hand was: how do we define whether a value function is "aligned" (in an inner sense, so, when compared to an outer objective which is being used for training it)?
Well, I think it boils down to whether the current value function makes "reliably good predictions" about the values of events. Not just good predictions on average, but predictions which are never catastrophically bad (or at least, catastrophically bad with very low probability, in some appropriate sense).
If we think of the true value function as V(x), and our approximation as V(x), we want something like: under some distance metric, if there is a modification of V*(x) with catastrophic downsides, V(x) is closer to V*(x) than that modification. (OK that's a bit lame, but hopefully you get the general direction I'm trying to point in.)
Something like that?
OK, so, here is a question.
The abstract theory of InfraBayes (like the abstract theory of Bayes) elides computational concerns.
In reality, all of ML can more or less be thought of as using a big search for good models, where "good" means something approximately like MAP, although we can also consider more sophisticated variational targets. This introduces two different types of approximation:
What we want out of InfraBayes is a bounded regret guarantee (in settings where we previously didn't know how to get one). What we have is a picture of how to get that if we can actually do the generalized Bayesian update. What we might want is a picture of how to do that more generally, when we can't actually compute the full update.
Can we get such a thing with InfraBayes?
In other words, search is a very basic type of logical uncertainty. Currently, we don't have much of a model of that, except "Bayesian Search" (which does not provide any nice regret bounds that I know of, although I may be ignorant). We might need such a thing in order to get nice guarantees for systems which employ search internally. Can we get it?
Obviously, we can do the bayesian-search thing with InfraBayes substituted in, which already probably provides some kind of guarantee which couldn't be gotten otherwise. However, the challenge is to get the guarantee to carry all the way through to the end result.
What I'm referring to is that LI given a notion of rational uncertain expectation for the procrastination paradox -- so, less a positive result, more a framework for thinking about what behavior is reasonable.
However, I also think LIDT solves the problem in practical terms:
My basic argument is we can model this sort of preference, so why rule it out as a possible human preference? You may be philosophically confident in finitist/constructivist values, but are you so confident that you'd want to lock unbounded quantifiers out of the space of possible values for value learning?
Just want to note that although it's been a week this is still in my thoughts, and I intend to get around to continuing this conversation... but possibly not for another two weeks.
I think let's step back for a second, though. Suppose you were in the epistemic position "yes, this works in theory, with the realizability assumption, with no computational slowdown over MAP, but having spent 2-10 hours trying to figure out how to distill a neural network's epistemic uncertainty/submodel-mismatch, and having come up blank..." what's the conclusion here? I don't think it's "my main guess is that there's no way to apply this in practice".
A couple of separate points:
Even if you had spent all the time since my original post trying to figure out how to efficiently distill a neural network's epistemic uncertainty, it's potentially a hard problem! [...] I have never tried to claim that analogizing this approach to neural networks will be easy, but I don't think you want to wait to hear my formal ideas until I have figured out how to apply them to neural networks;
Yeah, this is a fair point.
and 10 hours of unsuccessful search isn't even close to the amount of time needed to demote that area from "most promising".
To be clear, people I know spent a lot more time than that thinking hard about the consensus algorithm, before coming to the strong conclusion that it was a fruitless path. I agree that this is worth spending >20 hours thinking about. I just perceive it to have hit diminishing returns. (This doesn't mean no one should ever think about it again, but it does seem worth communicating why the direction hasn't born fruit, at least to the extent that that line of research is happy being public.)
I think the question we are discussing here is: "yes, with the realizability assumption, existence of a benign model in the top set is substantially correlated over infinite time, enough so that all we need to look at is the relative weight of malign and benign models, BUT is the character of this correlation fundamentally different without the realizability assumption?"
Sounds right to me.
I don't see how this example makes that point. If the threshold of "unrealistic" is set in such a way that "realistic" models will only know most things about Sally, then this should apply equally to malign and benign models alike. (I think your example respects that, but just making it explicit). However, there should be a benign and malign model that knows about Sally's affinity for butter but not her allergy to flowers, and a benign and a malign model that knows the opposite. It seems to me that we still end up just considering the relative weight of benign and malign models that we might expect to see.
Ah, ok! Basically this is a new way of thinking about it for me, and I'm not sure what I think yet. My picture was that we argue that the top-weighted "good" (benign+correct) hypothesis can get unlucky, but should never get too unlucky, such that we can set N so that the good guy is always in the top N. Without realizability, we would have no particular reason to think "the good guy" (which is now just benign + reasonably correct) never drops below N on the list, for any N (because oscillations can be unbounded).
(A frugal hypothesis generating function instead of a brute force search over all reasonable models might miss out on, say, the benign version of the model that understands Sally's allergies; I do not claim to have identified an approach to hypothesis generation that reliably includes benign models. That problem could be one direction in the research agenda of analogizing this approach to state-of-the-art AI. And this example might also be worth thinking about in that project, but if we're just using the example to try to evaluate the effect of just removing the realizability assumption, but not removing the privilege of a brute search through reasonable models, then I stand by the choice to deem this paragraph parenthetical).
I don't really get why yet -- can you spell the (brute-force) argument out in more detail?
(going for now, will read+reply more later)
The continuity property is really important.