All of orthonormal's Comments + Replies

What 2026 looks like (Daniel's Median Future)

I'd additionally expect the death of pseudonymity on the Internet, as AIs will find it easy to detect similar writing style and correlated posting behavior.  What at present takes detective work will in the future be cheaply automated, and we will finally be completely in Zuckerberg's desired world where nobody can maintain a second identity online.

Oh, and this is going to be retroactive, so be ready for the consequences of everything you've ever said online.

2Daniel Kokotajlo2moHot damn, that's a good point.
Understanding “Deep Double Descent”

If this post is selected, I'd like to see the followup made into an addendum—I think it adds a very important piece, and it should have been nominated itself.

3Oliver Habryka9moI agree with this, and was indeed kind of thinking of them as one post together.
What failure looks like

I think this post (and similarly, Evan's summary of Chris Olah's views) are essential both in their own right and as mutual foils to MIRI's research agenda. We see related concepts (mesa-optimization originally came out of Paul's talk of daemons in Solomonoff induction, if I remember right) but very different strategies for achieving both inner and outer alignment. (The crux of the disagreement seems to be the probability of success from adapting current methods.)

Strongly recommended for inclusion.

Soft takeoff can still lead to decisive strategic advantage

It's hard to know how to judge a post that deems itself superseded by a post from a later year, but I lean toward taking Daniel at his word and hoping we survive until the 2021 Review comes around.

Chris Olah’s views on AGI safety

The content here is very valuable, even if the genre of "I talked a lot with X and here's my articulation of X's model" comes across to me as a weird intellectual ghostwriting. I can't think of a way around that, though.

The Parable of Predict-O-Matic

This reminds me of That Alien Message, but as a parable about mesa-alignment rather than outer alignment. It reads well, and helps make the concepts more salient. Recommended.

Radical Probabilism

Remind me which bookies count and which don't, in the context of the proofs of properties?

If any computable bookie is allowed, a non-Bayesian is in trouble against a much larger bookie who can just (maybe through its own logical induction) discover who the bettor is and how to exploit them.

[EDIT: First version of this comment included "why do convergence bettors count if they don't know the bettor will oscillate", but then I realized the answer while Abram was composing his response, so I edited that part out. Editing it back in so that Abram's reply has context.]

It's a good question!

For me, the most general answer is the framework of logical induction, where the bookies are allowed so long as they have poly-time computable strategies. In this case, a bookie doesn't have to be guaranteed to make money in order to count; rather, if it makes arbitrarily much money, then there's a problem. So convergence traders are at risk of being stuck with a losing ticket, but, their existence forces convergence anyway.

If we don't care about logical uncertainty, the right condition is instead that the bookie knows the agent's beli... (read more)

Matt Botvinick on the spontaneous emergence of learning algorithms

The claim that came to my mind is that the conscious mind is the mesa-optimizer here, the original outer optimizer being a riderless elephant.

What specific dangers arise when asking GPT-N to write an Alignment Forum post?

This was literally the first output, with no rerolls in the middle! (Although after posting it, I did some other trials which weren't as good, so I did get lucky on the first one. Randomness parameter was set to 0.5.)

I cut it off there because the next paragraph just restated the previous one.

What specific dangers arise when asking GPT-N to write an Alignment Forum post?

(sorry, couldn't resist)

This is the first post in an Alignment Forum sequence explaining the approaches both MIRI and OpenAI staff believe are the most promising means of auditing the cognition of very complex machine learning models. We will be discussing each approach in turn, with a focus on how they differ from one another. 

The goal of this series is to provide a more complete picture of the various options for auditing AI systems than has been provided so far by any single person or organization. The hope is that it will help people make better-i... (read more)

2Daniel Kokotajlo1y:D How many samples did you prune through to get this? Did you do any re-rolls? What was your stopping procedure?
Are we in an AI overhang?

I'm imagining a tiny AI Safety organization, circa 2010, that focused on how to achieve probable alignment for scaled-up versions of that year's state-of-the-art AI designs. It's interesting to ask whether that organization would have achieved more or less than MIRI has, in terms of generalizable work and in terms of field-building.

Certainly it would have resulted in a lot of work that was initially successful but ultimately dead-end. But maybe early concrete results would have attracted more talent/attention/respect/funding, and the org could have thrown ... (read more)

Developmental Stages of GPTs

That's not a nitpick at all!

Upon reflection, the structured sentences, thematically resolved paragraphs, and even JSX code can be done without a lot of real lookahead. And there's some evidence it's not doing lookahead - its difficulty completing rhymes when writing poetry, for instance.

(Hmm, what's the simplest game that requires lookahead that we could try to teach to GPT-3, such that it couldn't just memorize moves?)

Thinking about this more, I think that since planning depends on causal modeling, I'd expect the latter to get good before the former. But I probably overstated the case for its current planning capabilities, and I'll edit accordingly. Thanks!

Developmental Stages of GPTs

The outer optimizer is the more obvious thing: it's straightforward to say there's a big difference in dealing with a superhuman Oracle AI with only the goal of answering each question accurately, versus one whose goals are only slightly different from that in some way. Inner optimizers are an illustration of another failure mode.

1John Maxwell1yGPT generates text by repeatedly picking whatever word seems highest probability given all the words that came before. So if its notion of "highest probability" is almost, but not quite, answering every question accurately, I would expect a system which usually answers questions accurately but sometimes answers them inaccurately. That doesn't sound very scary?
An Orthodox Case Against Utility Functions

I've been using computable to mean a total function (each instance is computable in finite time).

I'm thinking of an agent outside a universe about to take an action, and each action will cause that universe to run a particular TM. (You could maybe frame this as "the agent chooses the tape for the TM to run on".) For me, this is analogous to acting in the world and causing the world to shift toward some outcomes over others.

By asserting that U should be the computable one, I'm asserting that "how much do I like this outcome&quo... (read more)

An Orthodox Case Against Utility Functions

Let's talk first about non-embedded agents.

Say that I'm given the specification of a Turing machine, and I have a computable utility mapping from output states (including "does not halt") to [0,1]. We presumably agree that is possible.

I agree that it's impossible to make a computable mapping from Turing machines to outcomes, so therefore I cannot have a computable utility function from TMs to the reals which assigns the same value to any two TMs with identical output.

But I can have a logical inductor which, for each TM, produces a ... (read more)

1Alex Mennen2yI think we're going to have to back up a bit. Call the space of outcomes O and the space of Turing machines M. It sounds like you're talking about two functions, U:O→R and eval:M→O. I was thinking of U as the utility function we were talking about, but it seems you were thinking of U∘eval. You suggested U should be computable but eval should not be. It seems to me that eval should certainly be computable (with the caveat that it might be a partial function, rather than a total function), as computation is the only thing Turing machines do, and that if non-halting is included in a space of outcomes (so that eval is total), it should be represented as some sort of limit of partial information, rather than represented explicitly, so that eval is continuous. In any case, a slight generalization of Rice's theorem tells us that any computable function from Turing machines to reals that depends only of the machine's semantics must be constant, so I suppose I'm forced to agree that, if we want a utility function U∘eval that is defined on all Turing machines and depends only on their semantics, then at least one of U or eval should be uncomputable. But I guess I have to ask why we would want to assign utilities to Turing machines.
An Orthodox Case Against Utility Functions

I mean the sort of "eventually approximately consistent over computable patterns" thing exhibited by logical inductors, which is stronger than limit-computability.

1Alex Mennen2yIt's not clear to me what this means in the context of a utility function.
An Orthodox Case Against Utility Functions

I think that computable is obviously too strong a condition for classical utility; enumerable is better.

Imagine you're about to see the source code of a machine that's running, and if the machine eventually halts then 2 utilons will be generated. That's a simpler problem to reason about than the procrastination paradox, and your utility function is enumerable but not computable. (Likewise, logical inductors obviously don't make PA approximately computable, but their properties are what you'd want the definition of approximately enu... (read more)

2Alex Mennen2yI'm not sure what it would mean for a real-valued function to be enumerable. You could call a function f:X→R enumerable if there's a program that takes x∈X as input and enumerates the rationals that are less than f(x), but I don't think this is what you want, since presumably if a Turing machine halting can generate a positive amount of utility that doesn't depend on the number of steps taken before halting, then it could generate a negative amount of utility by halting as well. I think accepting the type of reasoning you give suggests that limit-computability is enough (ie there's a program that takes x∈X and produces a sequence of rationals that converges to f(x), with no guarantees on the rate of convergence). Though I don't agree that it's obvious we should accept such utility functions as valid.
Thinking About Filtered Evidence Is (Very!) Hard

If the listener is running a computable logical uncertainty algorithm, then for a difficult proposition it hasn't made much sense of, the listener might say "70% likely it's a theorem and X will say it, 20% likely it's not a theorem and X won't say it, 5% PA is inconsistent and X will say both, 5% X isn't naming all and only theorems of PA".

Conditioned on PA being consistent and on X naming all and only theorems of PA, and on the listener's logical uncertainty being well-calibrated, you'd expect that in 78% of s... (read more)

Writeup: Progress on AI Safety via Debate
As stated, I think this has a bigger vulnerability; B and B* just always answer the question with "yes."

Remember that this is also used to advance the argument. If A thinks B has such a strategy, A can ask the question in such a way that B's "yes" helps A's argument. But sure, there is something weird here.

Writeup: Progress on AI Safety via Debate
the dishonest team might want to call one as soon as they think the chance of them convincing a judge is below 50%, because that's the worst-case win-rate from blind guessing

I also think this is a fatal flaw with the existing two-person-team proposal; you need a system that gives you an epsilon chance of winning with it if you're using it spuriously.

I have what looks to me like an improvement, but there's still a vulnerability:

A challenges B by giving a yes-no question as well as a previous round to ask it. B answers, B* answers based on B&a... (read more)

1Matthew "Vaniver" Graves2yI thought the thing A* was doing was giving a measure of "answer differently" that was more reliable than something like 'string comparison'. If B's answer is "dog" and B*'s answer was "canine", then hopefully those get counted as "the same" in situations where the difference is irrelevant and "different" in situations where the difference is relevant. If everything can be yes/no, then I agree this doesn't lose you much, but I think this reduces the amount of trickery you can detect. That is, imagine one of those games where I'm thinking of a word, and you have to guess it, and you can ask questions that narrow down the space of possible words. One thing I could do is change my word whenever I think you're getting close, but I have to do so to a different word that has all the properties I've revealed so far. (Or, like, each time I could answer in the way that leaves me with the largest set of possible words left, maximizing the time-to-completion.) If we do the thing where B says the word, and B* gets to look at B's notes up to point X and say B's word, then the only good strategy for B is to have the word in their notes (and be constrained by it), but this is resistant to reducing it to a yes/no question. (Even if the question is something like "is there a word in B's notes?" B* might be able to tell that B will say "yes" even tho there isn't a word in B's notes; maybe because B says "hey I'm doing the strategy where I don't have a word to be slippery, but pretend that I do have a word if asked" in the notes.) As stated, I think this has a bigger vulnerability; B and B* just always answer the question with "yes." One nice thing about yes/no questions is that maybe you can randomly flip them (so one gets "does B think the animal is a dog?" and the other gets "does B think the animal is not a dog?") so there's no preferred orientation, which would eat the "always say yes" strategy unless the question-flipping is detectable. (Since A is the one asking the question,
Bottle Caps Aren't Optimisers

Okay, so another necessary condition for being downstream from an optimizer is being causally downstream. I'm sure there are other conditions, but the claim still feels like an important addition to the conversation.

Bottle Caps Aren't Optimisers

I'm surprised nobody has yet replied that the two examples are both products of significant optimizers with relevant optimization targets, and that the naive definition seems to work with one modification:

A system is downstream from an optimizer of some objective function to the extent that that objective function attains much higher values than would be attained if the system didn't exist, or were doing some other random thing.

2DanielFilan2yYes, this seems pretty important and relevant. That being said, I think that that definition suggests that natural selection and/or the earth's crust are downstream from an optimiser of the number of Holiday Inns, or that my liver is downstream from an optimiser from my income, both of which aren't right. Probably it's important to relate 'natural subgoals' to some ideal definition - which offers some hope, since 'subgoal' is really a computational notion, so maybe investigation along these lines would offer a more computational characterisation of optimisation. [EDIT: I made this comment longer and more contentful]
Embedded Agents

Insofar as the AI Alignment Forum is part of the Best-of-2018 Review, this post deserves to be included. It's the friendliest explanation to MIRI's research agenda (as of 2018) that currently exists.

The Credit Assignment Problem
Removing things entirely seems extreme.

Dropout is a thing, though.

1Charlie Steiner2yDropout is like the converse of this - you use dropout to assess the non-outdropped elements. This promotes resiliency to perturbations in the model - whereas if you evaluate things by how bad it is to break them, you could promote fragile, interreliant collections of elements over resilient elements. I think the root of the issue is that this Shapley value doesn't distinguish between something being bad to break, and something being good to have more of. If you removed all my blood I would die, but that doesn't mean that I would currently benefit from additional blood. Anyhow, the joke was that as soon as you add a continuous parameter, you get gradient descent back again.
The Credit Assignment Problem

Shapley Values [thanks Zack for reminding me of the name] are akin to credit assignment: you have a bunch of agents coordinating to achieve something, and then you want to assign payouts fairly based on how much each contribution mattered to the final outcome.

And the way you do this is, for each agent you look at how good the outcome would have been if everybody except that agent had coordinated, and then you credit each agent proportionally to how much the overall performance would have fallen off without them.

So what about doing the same here- send rewar... (read more)

3Abram Demski2yYeah, it's definitely related. The main thing I want to point out is that Shapley values similarly require a model in order to calculate. So you have to distinguish between the problem of calculating a detailed distribution of credit and being able to assign credit "at all" -- in artificial neural networks, backprop is how you assign detailed credit, but a loss function is how you get a notion of credit at all. Hence, the question "where do gradients come from?" -- a reward function is like a pile of money made from a joint venture; but to apply backprop or Shapley value, you also need a model of counterfactual payoffs under a variety of circumstances. This is a problem, if you don't have a seperate "epistemic" learning process to provide that model -- ie, it's a problem if you are trying to create one big learning algorithm that does everything. Specifically, you don't automatically know how to because in the cases I'm interested in, ie online learning, you don't have the option of -- because you need a model in order to rerun. But, also, I think there are further distinctions to make. I believe that if you tried to apply Shapley value to neural networks, it would go poorly; and presumably there should be a "philosophical" reason why this is the case (why Shapley value is solving a different problem than backprop). I don't know exactly what the relevant distinction is. (Or maybe Shapley value works fine for NN learning; but, I'd be surprised.)
1Charlie Steiner2yRemoving things entirely seems extreme. How about having a continuous "contribution parameter," where running the algorithm without an element would correspond to turning this parameter down to zero, but you could also set the parameter to 0.5 if you wanted that element to have 50% of the influence it has right now. Then you can send rewards to elements if increasing their contribution parameter would improve the decision. :P

I can't for the life of me remember what this is called

Shapley value

(Best wishes, Less Wrong Reference Desk)

Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

Good comment. I disagree with this bit:

I would, for instance, predict that if Superintelligence were published during the era of GOFAI, all else equal it would've made a bigger splash because AI researchers then were more receptive to abstract theorizing.

And then it would probably have been seen as outmoded and thrown away completely when AI capabilities research progressed into realms that vastly surpassed GOFAI. I don't know that there's an easy way to get capabilities researchers to think seriously about safety concerns that haven't manifested on a sufficient scale yet.

Proposal for an Implementable Toy Model of Informed Oversight

I like this suggestion of a more feasible form of steganography for NNs to figure out! But I think you'd need further advances in transparency to get useful informed oversight capabilities from (transformed or not) copies of the predictive network.

HCH as a measure of manipulation

I should have said "reliably estimate HCH"; I'd also want quite a lot of precision in addition to calibration before I trust it.

HCH as a measure of manipulation

Re #2, I think this is an important objection to low-impact-via-regularization-penalty in general.

HCH as a measure of manipulation

Re #1, an obvious set of questions to include in are questions of approval for various aspects of the AI's policy. (In particular, if we want the AI to later calculate a human's HCH and ask it for guidance, then we would like to be sure that HCH's answer to that question is not manipulated.)

HCH as a measure of manipulation

There's the additional objection of "if you're doing this, why not just have the AI ask HCH what to do?"

Overall, I'm hoping that it could be easier for an AI to robustly conclude that a certain plan only changes a human's HCH via certain informational content, than for the AI to reliably calculate the human's HCH. But I don't have strong arguments for this intuition.

1Jessica Taylor5y"Having a well-calibrated estimate of HCH" is the condition you want, not "being able to reliably calculate HCH".
All the indifference designs

Question that I haven't seen addressed (and haven't worked out myself): which of these indifference methods are reflectively stable, in the sense that the AI would not push a button to remove them (or switch to a different indifference method)?

Modal Combat for games other than the prisoner's dilemma

This is a lot of good work! Modal combat is increasingly deprecated though (in my opinion), for reasons like the ones you noted in this post, compared to studying decision theory with logical inductors; and so I'm not sure this is worth developing further.

Censoring out-of-domain representations

Yup, this isn't robust to extremely capable systems; it's a quantitative shift in how promising it looks to the agent to learn about external affairs, not a qualitative one.

(In the example with the agent doing engineering in a sandbox that doesn't include humans or general computing devices, there could be a strong internal gradient to learn obvious details about the things immediately outside its sandbox, and a weaker gradient for learning more distant or subtle things before you know the nearby obvious ones.)

A whitelisting variant would be way more reliable than a blacklisting one, clearly.

(Non-)Interruptibility of Sarsa(λ) and Q-Learning

Nice! One thing that might be useful for context: what's the theoretical correct amount of time that you would expect an algorithm to spend on the right vs. the left if the session gets interrupted each time it goes 1 unit to the right? (I feel like there should be a pretty straightforward way to calculate the heuristic version where the movement is just Brownian motion that gets interrupted early if it hits +1.)

Asymptotic Decision Theory

Typo: The statement of Theorem 4.1 omits the word "continuous".

(C)IRL is not solely a learning process

Stuart did make it easier for many of us to read his recent ideas by crossposting them here. I'd like there to be some central repository for the current set of AI control work, and I'm hoping that the forum could serve as that.

Is there a functionality that, if added here, would make it trivial to crosspost when you wrote something of note?

(C)IRL is not solely a learning process

The authors of the CIRL paper are in fact aware of them, and are pondering them for future work. I've had fruitful conversations with Dylan Hadfield-Menell (one of the authors), talking about how a naive implementation goes wrong for irrational humans, and about what a tractable non-naive implementation might look like (trying to model probabilities of a human's action under joint hypotheses about the correct reward function and about the human's psychology); he's planning future work relevant to that question.

Also note Dylan's talk on CIRL, value of infor

... (read more)
IRL is hard

I agree strongly with the general principle "we need to be able to prove guarantees about our learning process in environments rich enough to be realistic", but I disagree with the claim that this shows a flaw in IRL. Adversarial environments seem to me very disanalogous to learning complicated and implicit preferences in a non-adversarial environment.

(You and I talked about this a bit, and I pointed out that computational complexity issues only block people in practice when the setup needs to be adversarial, e.g. intentional cryptography to prevent an adv

... (read more)
0Vanessa Kosoy5yI disagree that adversarial environments are "disanalogous". If they are indeed disanalogous, we should be able to formally define a natural class of environment that excludes them but is still sufficiently rich e.g. it should allow other agents of similar complexity. If there is such a class, I would be glad to see the definition. I expect that in most complex polynomial-time environments IRL is be able to extract some information about the utility function but there is a significant fraction it is unable to extract. I strongly disagree with "computational complexity issues only block people in practice when the setup needs to be adversarial." My experience is that that computational complexity issues block people all the time. If someone invented a black box that solves arbitrary instances of SAT in practical time it would be a revolution in the technological capacities of civilization. Electronic and mechanical design would be almost fully automated (and they are currently very far from automated; the practical utility of SAT solvers mentioned in Wikipedia is very restricted). Algorithm engineering would become more or less redundant. Mathematical theorem proving would be fully automated. Protein folding would be solved. Training complex neural networks would become trivial. All of the above would work much better than humanity ability. Also, if you were right people would be much less excited about quantum computers (that is, the excitement might be still unjustified because quantum computer are infeasible or because they are feasible but don't solve many interesting problems, but at least there's a chance they will solve many interesting problems). Obviously there are NP-hard optimization problems with approximate solutions that are good enough for practical applications but it means little about the existence of good enough IRL algorithms. Indeed, if we accept the fragility of value thesis, even small inaccuracies in value learning may have catastrophic con
Improbable Oversight, An Attempt at Informed Oversight

If you're confident of getting a memory trace for all books consulted, then there are simpler ways of preventing plagiarism in the informed oversight case: have the overseer read only the books consulted by the agent (or choose randomly among them for the ones to read). The informed oversight problem here assumes that the internals of A are potentially opaque to B, even though B has greater capabilities than A.

Learning (meta-)preferences

Yup, including better models of human irrationality seems like a promising direction for CIRL. I've been writing up a short note on the subject with more explicit examples- if you want to work on this without duplicating effort, let me know and I'll share the rough draft with you.

0Stuart Armstrong5yOk, send me the draft.
Improbable Oversight, An Attempt at Informed Oversight

Even the last version might have odd incentives. If A knew that the chances were high enough that an actually original A book would be seen as rare plagiarism of some book unknown to A, the dominant strategy could be to instead commit the most obvious plagiarism ever, in order to minimize the penalty that cannot be reliably avoided.

1William Saunders5yThis falls in with the question of whether we can distinguish whether A was intentionally vs. unintentionally committing a forbidden action. If the advice class R only contains information about things external to A, then there is no way for this method to distinguish between the two, and we should forbid anything that could be intentional bad behaviour. However, we might be able to include information about A's internal state in the advice. For example, the advice is a pair (book, location in A's memory trace) and, F(a,r) is only true if the location in A's memory trace indicates that A plagarize the particular book (of course, you need to be sure that you'd be able to spot something in A's memory trace). At least the method fails safely in this case. The null action will always be preferred to committing obvious plagarism (and committing obvious plagarism is pretty great problem to have compared to a silent failure!). And you can always tune k to make A more willing to trade action goodness for forbiden-ness probability, or reduce the size of the library, to alter the incentives if A refuses to do anything when you turn it on.
A new proposal for logical counterfactuals

Can you define more precisely what you mean by "censoring contradictions"?

0Jack Gallagher5yBy censoring I mean a specific technique for forcing the consistency of a possibly inconsistent set of axioms. Suppose you have a set of deduction rules D over a language ℓ. You can construct a function fD:P(ℓ)→P(ℓ) that takes a set of sentences S and outputs all the sentences that can be proved in one step using D and the sentences in S. You can also construct a censored f′D by lettingf′D(S)={ϕ|ϕ∈fD(S)∧¬ϕ∉S}.
Example of double indifference

In the spirit of "one step is normal, two steps are suspicious, omega steps are normal", perhaps there's a 'triple corrigibility' issue when ?

0Stuart Armstrong5yI'm not assuming EαEβ=Eβ. If you do assume that, everything becomes much simpler.
Double indifference is better indifference

Typo: in the paragraph before the equation arrays, you forgot to change from 5 to 42 (you did so in the following equation arrays). This buffaloed me for a bit!

0Jessica Taylor5yFixed, thanks.
Maximizing a quantity while ignoring effect through some channel

It's illustrating the failure of a further desideratum for the shutdown problem: we would like the AI to be able to update on and react to things that happen in the world which correlate with a certain channel, and yet still not attempt to influence that channel.

For motivation, assume a variant on the paperclip game:

  • the humans can be observed reaching for the button several turns before it is pressed
  • the humans' decision to press the button is a stochastic function of environmental variables (like seeing that the AI has unexpectedly been hit by lightning
... (read more)
Games for factoring out variables

There's nothing in the setup preventing the players from having access to independent random bits, though it's fair to say that these approaches assume this to be the case even when it's not.

But then the fault is with that assumption of access to randomness, not with any of the constraints on . So I don't think this is a strike against these methods.

0Stuart Armstrong5yI'm not following. This "game" isn't a real game. There are not multiple players. There is one agent, where we have taken its real, one-valued probability, and changed it into a two-valued Q, for the purposes of factoring out the impact of the variable. The real probability is the original probability, which is the diagonal of Q.
Games for factoring out variables

Typo in the "Single Variable Maximalisation" section: you meant to write rather than .

0Stuart Armstrong5yThanks, corrected!
Maximizing a quantity while ignoring effect through some channel

We'd discussed how this "magical counterfactual" approach has the property of ignoring evidence of precursors to a button-press, since they don't count as evidence for whether the button would be pressed in the counterfactual world. Here's a simple illustration of that issue:

In this world, there is a random fair coinflip, then the AI gets to produce either a staple or a paperclip, and then a button is pressed. We have a utility function that rewards paperclips if the button is pressed, and staples if it is not pressed. Furthermore, the button is pressed if

... (read more)
0Stuart Armstrong5yThis seems to be what we desire. The coin flip is only relevant via it's impact on the button; we want the AI to ignore the impact via the button; hence the AI ignore the coin flip.
An approach to the Agent Simulates Predictor problem


Typo: in the first full paragraph of page 2, I assume you mean the agent will one-box, not two-box.

And I'm not sure the final algorithm necessarily one-boxes even if the logical uncertainty engine thinks the predictor's (stronger) axioms are probably consistent- I think there might be a spurious counterfactual where the conditional utilities view the agent two-boxing as evidence that the predictor's axioms must be inconsistent. Is there a clean proof that the algorithm does the correct thing in this case?

0Vanessa Kosoy5yI think you mean "a spurious counterfactual where the conditional utilities view the agent one-boxing as evidence that the predictor’s axioms must be inconsistent"? That is, the agent correctly believes that predictor's axioms are likely to be consistent but also thinks that they would be inconsistent if it one-boxed, so it two-boxes?
1Alex Mennen6yYes, thanks for the correction. I'd fix it, but I don't think it's possible to edit a pdf in google drive, and it't not worth re-uploading and posting a new link for a typo. I don't have such a proof. I mentioned that as a possible concern at the end of the second-last paragraph of the section on the predictor having stronger logic and more computing power. Reconsidering though, this seems like a more serious concern than I initially imagined. It seems this will behave reasonably only when the agent does not trust itself too much, which would have terrible consequences for problems involving sequential decision-making. Ideally, we'd want to replace the conditional expected value function with something of a more counterfactual nature to avoid these sorts of issues, but I don't have a coherent way of specifying what that would even mean.
Load More