The simplest version of the parenting idea includes an agent which is Bayes-optimal. Parenting would just be designed to help out a Bayesian reasoner, since there's not much you can say about to what extent a Bayesian reasoner will explore, or how much it will learn; it all depends on its prior. (Almost all policies are Bayes-optimal with respect to some (universal) prior). There's still a fundamental trade-off between learning and staying safe, so while the Bayes-optimal agent does not do as bad a job in picking a point on that trade-off as the asymptotically optimal agent, that doesn't quite allow us to say that it will pick the right point on the trade-off. As long as we have access to "parents" that might be able to guide an agent toward world-states where this trade-off is less severe, we might as well make use of them.
And I'd say it's more a conclusion, not a main one.
The last paragraph of the conclusion (maybe you read it?) is relevant to this.
Certainly for the true environment, the optimal policy exists and you could follow it. The only thing I’d say differently is that you’re pretty sure the laws of physics won’t change tomorrow. But more realistic forms of uncertainty doom us to either forego knowledge (and potentially good policies) or destroy ourselves. If one slowed down science in certain areas for reasons along the lines of the vulnerable world hypothesis, that would be taking the “safe stance” in this trade off.
How does one make even weaker guarantees of good behavior
I don't think there's really a good answer. Section 6 Theorem 4 is my only suggestion here.
Well, nothing in the paper has to do with MDPs! The results are for general computable environments. Does that answer the question?
In the scheme I described, the behavior can be described as "the agent tries to get the text 'you did what we wanted' to be sent to it." A great way to do this would be to intervene in the provision of text. So the scheme I described doesn't make any progress in avoiding the classic wireheading scenario. The second possibility I described, where there are some games played regarding how different parameters are trained (the RNN is only trained to predict observations, and then another neural network originates from a narrow hidden layer in the RNN and produces text predictions as output) has the exact same wireheading pathology too.
Changing the nature of the goal as a function of what text it sees also doesn't stop "take over world, and in particular, the provision of text" from being an optimal solution.
I still am uncertain if I'm missing some key detail in your proposal, but right now my impression is that it falls prey to the same sort of wireheading incentive that a standard reinforcement learner does.
I don't have a complete picture of the scheme. Is it: "From a trajectory of actions and observations, an English text sample is presented with each observation, and the agent has to predict this text alongside the observations, and then it acts according to some reward function like (and this is simplified) 1 if it sees the text 'you did what we wanted' and 0 otherwise"? If the scheme you're proposing is different than that, my guess is that you're imagining a recurrent neural network architecture and most of the weights are only trained to predict the observations, and then other weights are trained to predict the text samples. Am I in the right ballpark here?
I jumped off a small cliff into a lake once, and when I was standing on the rock, I couldn't bring myself to jump. I stepped back to let another person go, and then I stepped onto the rock and jumped immediately. I might be able to do something similar.
But I wouldn't be able to endorse such behavior while reflecting on it if I were in that situation, given my conviction that I am unable to change math. Indeed, I don't think it would be wise of me to cooperate in that situation. What I really mean when I say that I would rather be someone who cooperated in a twin prisoners dilemma is "conditioned the (somewhat odd) hypothetical that I will at some point end up in a high stakes twin prisoner's dilemma, I would rather it be the case that I am the sort of person who cooperates", which is really saying that I would rather play a twin prisoner's dilemma game against a cooperator than against a defector, which is just an obvious preference for a favorable event to befall me rather than an unfavorable one. In similar news, conditioned on my encountering a situation in the future where somebody checks to see if am I good person, and if I am, they destroy the world, then I would like to become a bad person. Conditioned on my encountering a situation in which someone saves the world if I am devout, I would like to become a devout person.
If I could turn off the part of my brain that forms the question "but why should I cooperate, when I can't change math?" that would be a path to becoming a reliable cooperator, but I don't see a path to silencing a valid argument in my brain without a lobotomy (short of possibly just cooperating really fast without thinking, and of course without forming the doubt "wait, why am I trying to do this really fast without thinking?").
If that's the case, then I assume that you defect in the twin prisoner's dilemma.
I do. I would rather be someone who didn't. But I don't see path to becoming that person without lobotomizing myself. And it's not a huge concern of mine, since I don't expect to encounter such a dilemma. (Rarely am I the one pointing out that a philosophical thought experiment is unrealistic. It's not usually the point of thought experiments to be realistic--we usually only talk about them to evaluate the consequences of different positions. But it is worth noting here that I don't see this as a major issue for me.) I haven't written this up because I don't think it's particularly urgent to explain to people why I think CDT is correct over FDT. Indeed, in one view, it would be cruel of me to do so! And I don't think it matters much for AI alignment.
Don't you think that's at least looking into?
This was partly why I decided to wade into the weeds, because absent a discussion of how plausible it is that we could affect things non-causally, yes, one's first instinct would be that we should look at least into it. And maybe, like, 0.1% of resources directed toward AI Safety should go toward whether we can change Math, but honestly, even that seems high. Because what we're talking about is changing logical facts. That might be number 1 on my list of intractable problems.
After all, CDT evaluates causal counterfactuals, which are just as much a fiction as logical counterfactuals.
This is getting subtle :) and it's hard to make sure our words mean things, but I submit that causal counterfactuals are much less fictitious than logical counterfactuals! I submit that it is less extravagant to claim we can affect this world than it is to claim that we can affect hypothetical worlds with which we are not in causal contact. No matter what action I pick, math stays the same. But it's not the case that no matter what action I pick, the world stays the same. (In the former case, which action I pick could in theory tell us something about what mathematical object the physical universe implements, but it doesn't change math.) In both cases, yes, there is only one action that I do take, but assuming we can reason both about causal and logical counterfactuals, we can still talk sensibly about the causal and logical consequences of picking actions I won't in fact end up picking. I don't have a complete answer to "how should we define causal/logical counterfactuals" but I don't think I need to for the sake of this conversation, as long as we both agree that we can use the terms in more or less the same way, which I think we are successfully doing.
I don't yet see why creating a CDT agent avoids catastrophe better than FDT.
I think running an aligned FDT agent would probably be fine. I'm just arguing that it wouldn't be any better than running a CDT agent (besides for the interim phase before Son-of-CDT has been created). And indeed, I don't think any new decision theories will perform any better than Son-of-CDT, so it doesn't seem to me to be a priority for AGI safety. Finally, the fact that no FDT agent has actually been fully defined certainly weighs in favor of just going with a CDT agent.
Ah. I agree that this proposal would not optimize causally inaccessible areas of the multiverse, except by accident. I also think that nothing we do optimizes causally inaccessible areas of the multiverse, and we could probably have a long discussion about that, but putting a pin in that,
Let's take things one at a time. First, let's figure out how to not destroy the real world, and then if we manage that, we can start thinking about how to maximize utility in logically possible hypothetical worlds, which we are unable to have any causal influence on.
Regarding the longer discussion, and sorry if this below my usual level of clarity: what do we have at our disposal to make counterfactual worlds with low utility inconsistent? Well, all that we humans have at our disposal is choices about actions. One can play with words, and say that we can choose not just what to do, but also who to be, and choosing who to be (i.e. editing our decision procedure) is supposed by some to have logical consequences, but I think that's a mistake. 1) Changing who we are is an action like any other. Actions don't have logical consequences, just causal consequences. 2) We might be changing which algorithm our brain executes, but we are not changing the output of any algorithm itself, the latter possibility being the thing with supposedly far-reaching (logical) consequences on hypothetical worlds outside of causal contact. In general, I'm pretty bearish on the ability of humans to change math.
Consider the CDT person who adopts FDT. They are probably interested in the logical consequences of the fact their brain in this world outputs certain actions. But no mathematical axioms have changed along the way, so no propositions have changed truth value. The fact that their brain now runs a new algorithm implies that (the math behind) physics ended up implementing that new algorithm. I don't see how it implies much else, logically. And I think the fact that no mathematical axioms have changes supports that intuition quite well!
The question of which low-utility worlds are consistent/logically possible is a property of Math. All of math follows from axioms. Math doesn't change without axioms changing. So if you have ambitions of rendering low-utility world inconsistent, I guess my question is this: which axioms of Math would you like to change and how? I understand you don't hope to causally affect this, but how could you even hope to affect this logically? (I'm struggling to even put words to that; the most charitable phrasing I can come up with, in case you don't like "affect this logically", is "manifest different logic", but I worry that phrasing is Confused.) Also, I'm capitalizing Math there because this whole conversation involves being Platonists about math, where Math is something that really exists, so you can't just invent a new axiomatization of math and say the world is different now.