Nate Soares

Sorted by New

# Wiki Contributions

Biology-Inspired AGI Timelines: The Trick That Never Works

My take on the exercise:

Is Humbali right that generic uncertainty about maybe being wrong, without other extra premises, should increase the entropy of one's probability distribution over AGI, thereby moving out its median further away in time?

Short version: Nah. For example, if you were wrong by dint of failing to consider the right hypothesis, you can correct for it by considering predictable properties of the hypotheses you missed (even if you don't think you can correctly imagine the true research pathway or w/e in advance). And if you were wrong in your calculations of the quantities you did consider, correction will regress you towards your priors, which are simplicity-based rather than maxent.

Long version: Let's set aside for the moment the question of what the "correct" maxent distribution on AGI timelines is (which, as others have noted, depends a bit on how you dice up the space of possible years). I don't think this is where the action is, anyway.

Let's suppose that we're an aspiring Bayesian considering that we may have made some mistakes in our calculations. Where might those mistakes have been? Perhaps:

1. We were mistaken about what we saw (and erroneously updated on observations that we did not make)?
2. We were wrong in our calculations of quantities of the form P(e|H) (the likelihoods) or P(H) (the priors), or the multiplications thereof?
3. We failed to consider a sufficiently wide space of hypotheses, in our efforts to complete our updating before the stars burn out?

Set aside for now that the correct answer is "it's #3, like we might stumble over #1 and #2 every so often but bounded reasoners are making mistake #3 day in and day out, it's obviously mostly #3", and take these one at a time:

Insofar as we were mistaken about what we saw, correcting our mistake should involve reverting an update (and then probably making a different update, because we saw something that we mistook, but set that aside). Reverting an update pushes us back towards our prior. This will often increase entropy, but not necessarily! (For example, if we thought we saw a counter-example to gravitation, that update might dramatically increase our posterior entropy, and reverting the update might revert us back to confident narrow predictions about phones falling.) Our prior is not a maxent prior but a simplicity prior (which is important if we ever want to learn anything at all).

Insofar as we were wrong in our calculations of various quantities, correcting our mistake depends on which direction we were wrong, and for which hypotheses. In practice, a reflectively stable reasoner shouldn't be able to predict the (magnitude-weighted) direction of their error in calculating P(e|H): if we know that we tend to overestimate that value when e is floobish, we can just bump down our estimate whenever e is floobish, until we stop believing such a thing (or, more intelligently, trace down the source of the systematic error and correct it, but I digress). I suppose we could imagine humbly acknowledging that we're imperfect at estimating quantities of the form P(e|H), and then driving all such estimates towards 1/n, where n is the number of possible observations? This doesn't seem like a very healthy way to think, but its effect is to again regress us towards our prior. Which, again, is a simplicity prior and not a maxent prior. (If instead we start what-iffing about whether we're wrong in our intuitive calculations that vaguely correspond to the P(H) quantities, and decide to try to make all our P(H) estimates more similar to each other regardless of H as a symbol of our virtuous self-doubt, then we start regressing towards maximum entropy. We correspondingly lose our ability to learn. And of course, if you're actually worried that you're wrong in your estimates of the prior probabilities, I recommend checking whether you think your P(H)-style estimates are too high or two low in specific instances, rather than driving all such estimates to uniformity. But also ¯\_(ツ)_/¯, I can't argue good priors into a rock.)

Insofar as we were wrong because we were failing to consider a sufficiently wide array of hypotheses, correcting our mistake depends on which hypotheses we're missing. Indeed, much of Eliezer's dialog seems to me like Eliezer trying to say "it's mistake #3 guys, it's always #3", plus "just as the hypothesis that we'll get AGI at 20 watts doesn't seem relevant because we know that the ways computers consume watts and the ways brains consume watts and they're radically different, so too can we predict that whatever the correct specific hypothesis for how the first human-attained AGIs consume compute, it will make the amount of compute that humans consume seem basically irrelevant." Like, if we don't get AGI till 2050 then we probably can't consider the correct specific research path, a la #3, but we can predict various properties of all plausible unvisualized paths, and adjust our current probabilities accordingly, in acknowledgement of our current #3-style errors.

In sum: accounting for wrongness should look less like saying "I'd better inject more entropy into my distributions", and more like asking "are my estimates of P(e|H) off in a predictable direction when e looks like this and H looks like that?". The former is more like sacrificing some of your hard-won information on the alter of the gods of modesty; the latter is more like considering the actual calculations you did and where the errors might reside in them. And even if you insist on sacrificing some of your information because maybe you did the calculations wrong, you should regress towards a simplicity prior rather than towards maximum entropy (which in practice looks like reaching for fewer and simpler-seeming deep regularities in the world, rather than pushing median AGI timelines out to the year 52,021), which is also how things will look if you think you're missing most of the relevant information. Though of course, your real mistake was #3, you're ~always committing mistake #3. And accounting for #3 in practice does tend to involve increasing your error bars until they are wide enough to include the sorts of curveballs that reality tends to throw at you. But the reason for widening your error bars there is to include more curveballs, not just to add entropy for modesty's sake. And you're allowed to think about all the predictable-in-advance properties of likely ballcurves even if you know you can't visualize-in-advance the specific curve that the ball will take.

In fact, Eliezer's argument reads to me like it's basically "look at these few and simple-seeming deep regularities in the world" plus a side-order of "the way reality will actually go is hard to visualize in advance, but we can still predict some likely properties of all the concrete hypotheses we're failing to visualize (which in this case invalidate biological anchors, and pull my timelines closer than 2051)", both of which seem to me like hallmarks of accounting for wrongness.

What I’ll be doing at MIRI
I have discussed with MIRI their decision to make their research non-disclosed-by-default and we agreed that my research agenda is a reasonable exception.

Small note: my view of MIRI's nondisclosed-by-default policy is that if all researchers involved with a research program think it should obviously be public then it should obviously be public, and that doesn't require a bunch of bureaucracy. I think this while simultaneously predicting that when researchers have a part of themselves that feels uncertain or uneasy about whether their research should be public, they will find that there are large benefits to instituting a nondisclosed-by-default policy. But the policy is there to enable researchers, not to annoy them and make them jump through hoops.

(Caveat: within ML, it's still rare for risk-based nondisclosure to be treated as a real option, and many social incentives favor publishing-by-default. I want to be very clear that within the context of those incentives, I expect many people to jump to "this seems obviously safe to me" when the evidence doesn't warrant it. I think it's important to facilitate an environment where it's not just OK-on-paper but also socially-hedonic to decide against publishing, and I think that these decisions often warrant serious thought. The aim of MIRI's disclosure policy is to remove undue pressures to make publication decisions prematurely, not to override researchers' considered conclusions.)

On motivations for MIRI's highly reliable agent design research

The second statement seems pretty plausible (when we consider human-accessible AGI designs, at least), but I'm not super confident of it, and I'm not resting my argument on it.

The weaker statement you provide doesn't seem like it's addressing my concern. I expect there are ways to get highly capable reasoning (sufficient for, e.g., gaining decisive strategic advantage) without understanding low-K "good reasoning"; the concern is that said systems are much more difficult to align.

On motivations for MIRI's highly reliable agent design research

As I noted when we chatted about this in person, my intuition is less "there is some small core of good consequentialist reasoning (it has “low Kolmogorov complexity” in some sense), and this small core will be quite important for AI capabilities" and more "good consequentialist reasoning is low-K and those who understand it will be better equipped to design AGI systems where the relevant consequentialist reasoning happens in transparent boxes rather than black boxes."

Indeed, if I thought one had to understand good consequentialist reasoning in order to design a highly capable AI system, I'd be less worried by a decent margin.

My current take on the Paul-MIRI disagreement on alignability of messy AI

Weighing in late here, I'll briefly note that my current stance on the difficulty of philosophical issues is (in colloquial terms) "for the love of all that is good, please don't attempt to implement CEV with your first transhuman intelligence". My strategy at this point is very much "build the minimum AI system that is capable of stabilizing the overall strategic situation, and then buy a whole lot of time, and then use that time to figure out what to do with the future." I might be more optimistic than you about how easy it will turn out to be to find a reasonable method for extrapolating human volition, but I suspect that that's a moot point either way, because regardless, thou shalt not attempt to implement CEV with humanity's very first transhuman intelligence.

Also, +1 to the overall point of "also pursue other approaches".

Paraconsistent Tiling Agents (Very Early Draft)

Nice work!

Minor note: in equation 1, I think the should be an .

I'm not all that familiar with paraconsistent logic, so many of the details are still opaque to me. However, I do have some intuitions about where there might be gremlins:

Solution 4.1 reads, "The agent could, upon realizing the contradiction, ..." You've got to be a bit careful here: the formalism you're using doesn't contain a reasoner that does something like "realize the contradiction." As stated, the agent is simply constructed to simply execute an action if it can prove ; it is not constructed to also reason about whether that proof was contradictory.

You could perhaps construct a system with an action condition of , but I expect that this will re-introduce many of the difficulties faced in a consistent logic (because this basically says "execute if consistently achieves ," and my current guess is that it's pretty hard to say "consistently" in a paraconsistent logic.

Or, in other words, I pretty strongly suspect that if you attempt to formalize a solution such as solution 4.1, you'll find lots of gremlins.

For similar reasons, I also expect solution 4.2 to be very difficult to formalize. What precisely is the action condition of an agent that "notices" when both and ? I don't know paraconsistent logic well enough yet to know how the obvious agent (with the action condition from two paragraphs above) behaves, but I'm guessing it's going to be a little difficult to work with.

Regardless, there do seem to be some promising aspects to the paraconsistent approach, and I'm glad you're looking into it!

Identity and quining in UDT

1. As you've already noticed, your anti-newcomb problem an instance of Dr. Nick Bone's "problematic problems". Benja actually gave a formalism of the general class of problems in the context of provability logic in a recent forum post. We dub these problems "evil problems," and I'm not convinced that your XDT is a sane way to deal with evil problems.

For one thing, every decision theory has an evil problem. As shown in the links above, even in if we consider "fair" games, there is always a problem that punishes a decision theory for acting like it does and rewards other decision problems for acting differently. XDT does not escape this problem. For example, consider the following scenario: there are two actions, 0 and 1. Any agent that takes the action which XDT does not take, scores ten points. All other agents score zero points. In this scenario, CDT scores 10, but XDT scores 0.

So while XDT two-boxes on its own anti-newcomb problem, it is still sometimes out-performed by CDT. Or, in other words, the sort of optimality that your XDT seems to be searching for is not a very good notion of optimality. (There are other notions of optimality that I'm more partial to, although they are not entirely satisfactory.) Finding the right notion of "optimality" is most of the problem, but I don't think the notion of optimality that XDT seems to be searching for is a very good one.

Specifically, this notion that "a good decision theory two-boxes on its anti-newcomb problem" strikes me as a terrible plan! Correct me if I'm wrong, but I think that the reasoning you're using goes something like this: (a) UDT does not perform optimally on its anti-newcomb problem. (b) The ideal decision theory would perform optimally on its anti-newcomb problem. (c) But given how the anti-newcomb problem is defined, that means that the ideal decision theory would two-box on its own newcomb problem. (d) Therefore I want to design an agent that two-boxes on its own anti-newcomb problem.

But this doesn't seem like the sort of reasoning that leads one to pick a sane decision theory: you can't build an agent that wins its own anti-newcomb problem (in the sense of getting $1001000) but you can build one that logically controls whether it gets$1000000 or \$1000. The above reasoning process selects a decision theory that logically-causes the worse outcome, and I don't think that's the right move.

2. All these agents reason by conditioning on statements which are false (such as "what if the predecessor wrote my code except with this line prepended?"). The resulting agents will obviously fail on a large class of problems; in particular, they'll fail on problems where the payoff depends upon the facts that are violated.

For simple "unfair" games (in the sense defined in the links above) where this occurs, consider scenarios where the agent is paid if and only if the length of its program is exactly a certain length: clearly, agents (5) and (6) could be severely misled in games like these. If you're only trying to make agents that work well on "fair" games (where the obvious formalization of "fair" is "extensional" as defined above), then you should probably make that much more explicit :-)

For "fair" games where the counterfactuals considered by these agents will be misleading, consider modal combat type scenarios, where the agent is reasoning about other agents that are reasoning about the first agent's source code. In these cases, there seems to be no guarantee that the logical conditional (on a false statement) is going to give a sane counterfactual (e.g. one where the extra line of code was prepended both to the agent's actual source, and to the source code that the opponent is reading.) See also my post on why conditionals are not counterfactuals.

To make this point slightly more general, it seems like all of these agents are depending pretty heavily on the "logical conditional" black box working out correctly. If you assume that conditioning on a false logical fact magically gets you all the right counterfactuals, then these decision theories make more sense. However, these decision theories all strike me as explorations about what happens when you put the logical-counterfactual-black-box in various new scenarios. (What happens when we condition on the parent's output? What happens when we condition on the program having a line prepended? etc.) Whereas the type of progress that we've been trying to make in decision theory is mostly geared towards opening the black box: how, in theory, could we design a logical-counterfactual-box that reliably works as intended?

Your agents seem to be assuming that that part of the problem is solved, which doesn't seem to be the case. As such, I have the impression that the agents you define in this post, while interesting, aren't really attacking the core of the problem, which is this: how can one reason under false premises?

3. You say

It is often claimed that the use of logical uncertainty in UDT allows for agents in different universes to reach a Pareto optimal outcome using acausal trade. If this is the case, then agents which have the same utility function should cooperate acausally with ease.

but I'm very skeptical. First of all, it would sure be nice if we could formally show that UDT-type agents always end up making intuitively-good trades, but it turns out that that's a big hairy problem (Wei Dai pointed out this comment thread).

Secondly, what makes you think that the agent defined by equation (6) is a UDT? I am not even convinced that it trades with itself (in, say, a counterfactual mugging), nevermind other UDT agents.

You also said "this argument should also make the use of full input-output mappings redundant in usual UDT," and I think this indicates a misunderstanding of updatelessness. UDT doesn't have some magical trades-with-other-UDTs property; rather, UDT choosing strategies without regard for its inputs is the mechanism by which it is able to trade with counterfactual versions of itself. If you take UDT and alter it so that it considers its input (instead of all I/O maps), then you get TDT, which definitely fails to trade with counterfactual versions of itself.

You can't say "updateless decision theory trades with counterfactual versions of itself, therefore it would still do so if it we took away the updatelessness," because the updatelessness is how it's able to make those trades! For similar reasons, I'm quite unconvinced that the agents of equations (5) or (6) perform well in situations such as the counterfactual mugging.

4. And, finally, I'm not quite sure why you're so concerned with avoiding quining here. The big problems, as I see them, are things like "what sort of logical conditioning mechanism gives us good logical counterfactuals?", and "how do multiple slightly-assymmetric UDT agents actually divide trade gains?", and "how do we resolve the problem where sometimes it seems like agents with less computing power have some sort of logical-first-mover-advantage?", and so on.

The use of quining in UDT doesn't seem to have any fundamental bearing on these questions (and indeed, I think there's a pretty simple modification to Vladimir Slepnev's original formalism that has the agent reason according to a distribution over its source code, instead of assuming it has a perfect quine), and therefore I don't quite understand the malcontent.

With all that out of the way, I'd also like to say: Nice work! You're clearly doing lots of in-depth thinking about the big decision theory problems, and I definitely applaud the effort. There are certainly some places where our thinking has diverged, but it's also clear that you're able to think about these things on your own and generate novel & interesting ideas, and that's definitely something I want to encourage!

Single-bit reflective oracles are enough

Also, FYI, I tossed together reflective implementations of Solomonoff Induction and AIXI using Haskell, which you can find on the MIRI github. It's not very polished, but it typechecks.

Un-manipulable counterfactuals

We might be talking about different things when we talk about counterfactuals. Let me be more explicit:

Say an agent is playing against a copy of itself on the prisoner's dilemma. It must evaluate what happens if it cooperates, and what happens if it defects. To do so, it needs to be able to predict what the world would look like "if it took action A". That prediction is what I call a "counterfactual", and it's not always obvious how to construct one. (In the counterfactual world where the agent defects, is the action of its copy also set to 'defect', or is it held constant?)

In this scenario, how do you use a stochastic event to "construct a counterfactual"? (I can think of some easy ways of doing this, some of which are essentially equivalent to using CDT, but I'm not quite sure which one you want to discuss.)

Un-manipulable counterfactuals

Patrick and I discussed something like this at a previous MIRIx. I think the big problem is that (if I understand what you're suggesting) it basically just implements CDT.

For example, in Newcomb's problem, if X=1 implies Omega is correct and X=0 implies the agent won't necessarily act as predicted, and it acts conditioned on X=0, then it will twobox.