Nate Soares


What I’ll be doing at MIRI
I have discussed with MIRI their decision to make their research non-disclosed-by-default and we agreed that my research agenda is a reasonable exception.

Small note: my view of MIRI's nondisclosed-by-default policy is that if all researchers involved with a research program think it should obviously be public then it should obviously be public, and that doesn't require a bunch of bureaucracy. I think this while simultaneously predicting that when researchers have a part of themselves that feels uncertain or uneasy about whether their research should be public, they will find that there are large benefits to instituting a nondisclosed-by-default policy. But the policy is there to enable researchers, not to annoy them and make them jump through hoops.

(Caveat: within ML, it's still rare for risk-based nondisclosure to be treated as a real option, and many social incentives favor publishing-by-default. I want to be very clear that within the context of those incentives, I expect many people to jump to "this seems obviously safe to me" when the evidence doesn't warrant it. I think it's important to facilitate an environment where it's not just OK-on-paper but also socially-hedonic to decide against publishing, and I think that these decisions often warrant serious thought. The aim of MIRI's disclosure policy is to remove undue pressures to make publication decisions prematurely, not to override researchers' considered conclusions.)

On motivations for MIRI's highly reliable agent design research

The second statement seems pretty plausible (when we consider human-accessible AGI designs, at least), but I'm not super confident of it, and I'm not resting my argument on it.

The weaker statement you provide doesn't seem like it's addressing my concern. I expect there are ways to get highly capable reasoning (sufficient for, e.g., gaining decisive strategic advantage) without understanding low-K "good reasoning"; the concern is that said systems are much more difficult to align.

On motivations for MIRI's highly reliable agent design research

As I noted when we chatted about this in person, my intuition is less "there is some small core of good consequentialist reasoning (it has “low Kolmogorov complexity” in some sense), and this small core will be quite important for AI capabilities" and more "good consequentialist reasoning is low-K and those who understand it will be better equipped to design AGI systems where the relevant consequentialist reasoning happens in transparent boxes rather than black boxes."

Indeed, if I thought one had to understand good consequentialist reasoning in order to design a highly capable AI system, I'd be less worried by a decent margin.

My current take on the Paul-MIRI disagreement on alignability of messy AI

Weighing in late here, I'll briefly note that my current stance on the difficulty of philosophical issues is (in colloquial terms) "for the love of all that is good, please don't attempt to implement CEV with your first transhuman intelligence". My strategy at this point is very much "build the minimum AI system that is capable of stabilizing the overall strategic situation, and then buy a whole lot of time, and then use that time to figure out what to do with the future." I might be more optimistic than you about how easy it will turn out to be to find a reasonable method for extrapolating human volition, but I suspect that that's a moot point either way, because regardless, thou shalt not attempt to implement CEV with humanity's very first transhuman intelligence.

Also, +1 to the overall point of "also pursue other approaches".

Paraconsistent Tiling Agents (Very Early Draft)

Nice work!

Minor note: in equation 1, I think the should be an .

I'm not all that familiar with paraconsistent logic, so many of the details are still opaque to me. However, I do have some intuitions about where there might be gremlins:

Solution 4.1 reads, "The agent could, upon realizing the contradiction, ..." You've got to be a bit careful here: the formalism you're using doesn't contain a reasoner that does something like "realize the contradiction." As stated, the agent is simply constructed to simply execute an action if it can prove ; it is not constructed to also reason about whether that proof was contradictory.

You could perhaps construct a system with an action condition of , but I expect that this will re-introduce many of the difficulties faced in a consistent logic (because this basically says "execute if consistently achieves ," and my current guess is that it's pretty hard to say "consistently" in a paraconsistent logic.

Or, in other words, I pretty strongly suspect that if you attempt to formalize a solution such as solution 4.1, you'll find lots of gremlins.

For similar reasons, I also expect solution 4.2 to be very difficult to formalize. What precisely is the action condition of an agent that "notices" when both and ? I don't know paraconsistent logic well enough yet to know how the obvious agent (with the action condition from two paragraphs above) behaves, but I'm guessing it's going to be a little difficult to work with.

Regardless, there do seem to be some promising aspects to the paraconsistent approach, and I'm glad you're looking into it!

Identity and quining in UDT

Thanks for the link! I appreciate your write-ups. A few points:

1. As you've already noticed, your anti-newcomb problem an instance of Dr. Nick Bone's "problematic problems". Benja actually gave a formalism of the general class of problems in the context of provability logic in a recent forum post. We dub these problems "evil problems," and I'm not convinced that your XDT is a sane way to deal with evil problems.

For one thing, every decision theory has an evil problem. As shown in the links above, even in if we consider "fair" games, there is always a problem that punishes a decision theory for acting like it does and rewards other decision problems for acting differently. XDT does not escape this problem. For example, consider the following scenario: there are two actions, 0 and 1. Any agent that takes the action which XDT does not take, scores ten points. All other agents score zero points. In this scenario, CDT scores 10, but XDT scores 0.

So while XDT two-boxes on its own anti-newcomb problem, it is still sometimes out-performed by CDT. Or, in other words, the sort of optimality that your XDT seems to be searching for is not a very good notion of optimality. (There are other notions of optimality that I'm more partial to, although they are not entirely satisfactory.) Finding the right notion of "optimality" is most of the problem, but I don't think the notion of optimality that XDT seems to be searching for is a very good one.

Specifically, this notion that "a good decision theory two-boxes on its anti-newcomb problem" strikes me as a terrible plan! Correct me if I'm wrong, but I think that the reasoning you're using goes something like this: (a) UDT does not perform optimally on its anti-newcomb problem. (b) The ideal decision theory would perform optimally on its anti-newcomb problem. (c) But given how the anti-newcomb problem is defined, that means that the ideal decision theory would two-box on its own newcomb problem. (d) Therefore I want to design an agent that two-boxes on its own anti-newcomb problem.

But this doesn't seem like the sort of reasoning that leads one to pick a sane decision theory: you can't build an agent that wins its own anti-newcomb problem (in the sense of getting $1001000) but you can build one that logically controls whether it gets $1000000 or $1000. The above reasoning process selects a decision theory that logically-causes the worse outcome, and I don't think that's the right move.

2. All these agents reason by conditioning on statements which are false (such as "what if the predecessor wrote my code except with this line prepended?"). The resulting agents will obviously fail on a large class of problems; in particular, they'll fail on problems where the payoff depends upon the facts that are violated.

For simple "unfair" games (in the sense defined in the links above) where this occurs, consider scenarios where the agent is paid if and only if the length of its program is exactly a certain length: clearly, agents (5) and (6) could be severely misled in games like these. If you're only trying to make agents that work well on "fair" games (where the obvious formalization of "fair" is "extensional" as defined above), then you should probably make that much more explicit :-)

For "fair" games where the counterfactuals considered by these agents will be misleading, consider modal combat type scenarios, where the agent is reasoning about other agents that are reasoning about the first agent's source code. In these cases, there seems to be no guarantee that the logical conditional (on a false statement) is going to give a sane counterfactual (e.g. one where the extra line of code was prepended both to the agent's actual source, and to the source code that the opponent is reading.) See also my post on why conditionals are not counterfactuals.

To make this point slightly more general, it seems like all of these agents are depending pretty heavily on the "logical conditional" black box working out correctly. If you assume that conditioning on a false logical fact magically gets you all the right counterfactuals, then these decision theories make more sense. However, these decision theories all strike me as explorations about what happens when you put the logical-counterfactual-black-box in various new scenarios. (What happens when we condition on the parent's output? What happens when we condition on the program having a line prepended? etc.) Whereas the type of progress that we've been trying to make in decision theory is mostly geared towards opening the black box: how, in theory, could we design a logical-counterfactual-box that reliably works as intended?

Your agents seem to be assuming that that part of the problem is solved, which doesn't seem to be the case. As such, I have the impression that the agents you define in this post, while interesting, aren't really attacking the core of the problem, which is this: how can one reason under false premises?

3. You say

It is often claimed that the use of logical uncertainty in UDT allows for agents in different universes to reach a Pareto optimal outcome using acausal trade. If this is the case, then agents which have the same utility function should cooperate acausally with ease.

but I'm very skeptical. First of all, it would sure be nice if we could formally show that UDT-type agents always end up making intuitively-good trades, but it turns out that that's a big hairy problem (Wei Dai pointed out this comment thread).

Secondly, what makes you think that the agent defined by equation (6) is a UDT? I am not even convinced that it trades with itself (in, say, a counterfactual mugging), nevermind other UDT agents.

You also said "this argument should also make the use of full input-output mappings redundant in usual UDT," and I think this indicates a misunderstanding of updatelessness. UDT doesn't have some magical trades-with-other-UDTs property; rather, UDT choosing strategies without regard for its inputs is the mechanism by which it is able to trade with counterfactual versions of itself. If you take UDT and alter it so that it considers its input (instead of all I/O maps), then you get TDT, which definitely fails to trade with counterfactual versions of itself.

You can't say "updateless decision theory trades with counterfactual versions of itself, therefore it would still do so if it we took away the updatelessness," because the updatelessness is how it's able to make those trades! For similar reasons, I'm quite unconvinced that the agents of equations (5) or (6) perform well in situations such as the counterfactual mugging.

4. And, finally, I'm not quite sure why you're so concerned with avoiding quining here. The big problems, as I see them, are things like "what sort of logical conditioning mechanism gives us good logical counterfactuals?", and "how do multiple slightly-assymmetric UDT agents actually divide trade gains?", and "how do we resolve the problem where sometimes it seems like agents with less computing power have some sort of logical-first-mover-advantage?", and so on.

The use of quining in UDT doesn't seem to have any fundamental bearing on these questions (and indeed, I think there's a pretty simple modification to Vladimir Slepnev's original formalism that has the agent reason according to a distribution over its source code, instead of assuming it has a perfect quine), and therefore I don't quite understand the malcontent.

With all that out of the way, I'd also like to say: Nice work! You're clearly doing lots of in-depth thinking about the big decision theory problems, and I definitely applaud the effort. There are certainly some places where our thinking has diverged, but it's also clear that you're able to think about these things on your own and generate novel & interesting ideas, and that's definitely something I want to encourage!

Single-bit reflective oracles are enough

Also, FYI, I tossed together reflective implementations of Solomonoff Induction and AIXI using Haskell, which you can find on the MIRI github. It's not very polished, but it typechecks.

Un-manipulable counterfactuals

We might be talking about different things when we talk about counterfactuals. Let me be more explicit:

Say an agent is playing against a copy of itself on the prisoner's dilemma. It must evaluate what happens if it cooperates, and what happens if it defects. To do so, it needs to be able to predict what the world would look like "if it took action A". That prediction is what I call a "counterfactual", and it's not always obvious how to construct one. (In the counterfactual world where the agent defects, is the action of its copy also set to 'defect', or is it held constant?)

In this scenario, how do you use a stochastic event to "construct a counterfactual"? (I can think of some easy ways of doing this, some of which are essentially equivalent to using CDT, but I'm not quite sure which one you want to discuss.)

Un-manipulable counterfactuals

Patrick and I discussed something like this at a previous MIRIx. I think the big problem is that (if I understand what you're suggesting) it basically just implements CDT.

For example, in Newcomb's problem, if X=1 implies Omega is correct and X=0 implies the agent won't necessarily act as predicted, and it acts conditioned on X=0, then it will twobox.

A different angle on UDT

Yeah, causation in logical uncertainty land would be nice. It wouldn't necessarily solve the whole problem, though. Consider the scenario

outcomes = [3, 2, 1, None]
strategies = {Hi, Med, Low}
A = lambda: Low
h = lambda: Hi
m = lambda: Med
l = lambda: Low
payoffs = {}
payoffs[h()] = 3
payoffs[m()] = 2
payoffs[l()] = 1
E = lambda: payoffs.get(A())

Now it's pretty unclear that (lambda: Low)()==Hi should logically cause E()=3.

When considering (lambda: Low)()==Hi, do we want to change l without A, A without l, or both? These correspond to answers None, 3, and 1 respectively.

Ideally, a causal-logic graph would be able to identify all three answers, depending on which question you're asking. (This actually gives an interesting perspective on whether or not CDT should cooperate with itself on a one-shot PD: it depends; do you think you "could" change one but not the other? The answer depends on the "could.") I don't think there's an objective sense in which any one of these is "correct," though.

Load More