I'd additionally expect the death of pseudonymity on the Internet, as AIs will find it easy to detect similar writing style and correlated posting behavior. What at present takes detective work will in the future be cheaply automated, and we will finally be completely in Zuckerberg's desired world where nobody can maintain a second identity online.
Oh, and this is going to be retroactive, so be ready for the consequences of everything you've ever said online.
If this post is selected, I'd like to see the followup made into an addendum—I think it adds a very important piece, and it should have been nominated itself.
I think this post (and similarly, Evan's summary of Chris Olah's views) are essential both in their own right and as mutual foils to MIRI's research agenda. We see related concepts (mesa-optimization originally came out of Paul's talk of daemons in Solomonoff induction, if I remember right) but very different strategies for achieving both inner and outer alignment. (The crux of the disagreement seems to be the probability of success from adapting current methods.)
Strongly recommended for inclusion.
It's hard to know how to judge a post that deems itself superseded by a post from a later year, but I lean toward taking Daniel at his word and hoping we survive until the 2021 Review comes around.
The content here is very valuable, even if the genre of "I talked a lot with X and here's my articulation of X's model" comes across to me as a weird intellectual ghostwriting. I can't think of a way around that, though.
This reminds me of That Alien Message, but as a parable about mesa-alignment rather than outer alignment. It reads well, and helps make the concepts more salient. Recommended.
Remind me which bookies count and which don't, in the context of the proofs of properties?
If any computable bookie is allowed, a non-Bayesian is in trouble against a much larger bookie who can just (maybe through its own logical induction) discover who the bettor is and how to exploit them.
[EDIT: First version of this comment included "why do convergence bettors count if they don't know the bettor will oscillate", but then I realized the answer while Abram was composing his response, so I edited that part out. Editing it back in so that Abram's reply has context.]
It's a good question!
For me, the most general answer is the framework of logical induction, where the bookies are allowed so long as they have poly-time computable strategies. In this case, a bookie doesn't have to be guaranteed to make money in order to count; rather, if it makes arbitrarily much money, then there's a problem. So convergence traders are at risk of being stuck with a losing ticket, but, their existence forces convergence anyway.
If we don't care about logical uncertainty, the right condition is instead that the bookie knows the agent's beli... (read more)
The claim that came to my mind is that the conscious mind is the mesa-optimizer here, the original outer optimizer being a riderless elephant.
This was literally the first output, with no rerolls in the middle! (Although after posting it, I did some other trials which weren't as good, so I did get lucky on the first one. Randomness parameter was set to 0.5.)
I cut it off there because the next paragraph just restated the previous one.
(sorry, couldn't resist)
This is the first post in an Alignment Forum sequence explaining the approaches both MIRI and OpenAI staff believe are the most promising means of auditing the cognition of very complex machine learning models. We will be discussing each approach in turn, with a focus on how they differ from one another.
The goal of this series is to provide a more complete picture of the various options for auditing AI systems than has been provided so far by any single person or organization. The hope is that it will help people make better-i... (read more)
I'm imagining a tiny AI Safety organization, circa 2010, that focused on how to achieve probable alignment for scaled-up versions of that year's state-of-the-art AI designs. It's interesting to ask whether that organization would have achieved more or less than MIRI has, in terms of generalizable work and in terms of field-building.
Certainly it would have resulted in a lot of work that was initially successful but ultimately dead-end. But maybe early concrete results would have attracted more talent/attention/respect/funding, and the org could have thrown ... (read more)
That's not a nitpick at all!
Upon reflection, the structured sentences, thematically resolved paragraphs, and even JSX code can be done without a lot of real lookahead. And there's some evidence it's not doing lookahead - its difficulty completing rhymes when writing poetry, for instance.
(Hmm, what's the simplest game that requires lookahead that we could try to teach to GPT-3, such that it couldn't just memorize moves?)
Thinking about this more, I think that since planning depends on causal modeling, I'd expect the latter to get good before the former. But I probably overstated the case for its current planning capabilities, and I'll edit accordingly. Thanks!
The outer optimizer is the more obvious thing: it's straightforward to say there's a big difference in dealing with a superhuman Oracle AI with only the goal of answering each question accurately, versus one whose goals are only slightly different from that in some way. Inner optimizers are an illustration of another failure mode.
I've been using computable to mean a total function (each instance is computable in finite time).
I'm thinking of an agent outside a universe about to take an action, and each action will cause that universe to run a particular TM. (You could maybe frame this as "the agent chooses the tape for the TM to run on".) For me, this is analogous to acting in the world and causing the world to shift toward some outcomes over others.
By asserting that U should be the computable one, I'm asserting that "how much do I like this outcome&quo... (read more)
Let's talk first about non-embedded agents.
Say that I'm given the specification of a Turing machine, and I have a computable utility mapping from output states (including "does not halt") to [0,1]. We presumably agree that is possible.
I agree that it's impossible to make a computable mapping from Turing machines to outcomes, so therefore I cannot have a computable utility function from TMs to the reals which assigns the same value to any two TMs with identical output.
But I can have a logical inductor which, for each TM, produces a ... (read more)
I mean the sort of "eventually approximately consistent over computable patterns" thing exhibited by logical inductors, which is stronger than limit-computability.
I think that computable is obviously too strong a condition for classical utility; enumerable is better.
Imagine you're about to see the source code of a machine that's running, and if the machine eventually halts then 2 utilons will be generated. That's a simpler problem to reason about than the procrastination paradox, and your utility function is enumerable but not computable. (Likewise, logical inductors obviously don't make PA approximately computable, but their properties are what you'd want the definition of approximately enu... (read more)
If the listener is running a computable logical uncertainty algorithm, then for a difficult proposition it hasn't made much sense of, the listener might say "70% likely it's a theorem and X will say it, 20% likely it's not a theorem and X won't say it, 5% PA is inconsistent and X will say both, 5% X isn't naming all and only theorems of PA".
Conditioned on PA being consistent and on X naming all and only theorems of PA, and on the listener's logical uncertainty being well-calibrated, you'd expect that in 78% of s... (read more)
As stated, I think this has a bigger vulnerability; B and B* just always answer the question with "yes."
Remember that this is also used to advance the argument. If A thinks B has such a strategy, A can ask the question in such a way that B's "yes" helps A's argument. But sure, there is something weird here.
the dishonest team might want to call one as soon as they think the chance of them convincing a judge is below 50%, because that's the worst-case win-rate from blind guessing
I also think this is a fatal flaw with the existing two-person-team proposal; you need a system that gives you an epsilon chance of winning with it if you're using it spuriously.
I have what looks to me like an improvement, but there's still a vulnerability:
A challenges B by giving a yes-no question as well as a previous round to ask it. B answers, B* answers based on B&a... (read more)
Okay, so another necessary condition for being downstream from an optimizer is being causally downstream. I'm sure there are other conditions, but the claim still feels like an important addition to the conversation.
I'm surprised nobody has yet replied that the two examples are both products of significant optimizers with relevant optimization targets, and that the naive definition seems to work with one modification:
A system is downstream from an optimizer of some objective function to the extent that that objective function attains much higher values than would be attained if the system didn't exist, or were doing some other random thing.
Insofar as the AI Alignment Forum is part of the Best-of-2018 Review, this post deserves to be included. It's the friendliest explanation to MIRI's research agenda (as of 2018) that currently exists.
Removing things entirely seems extreme.
Dropout is a thing, though.
Shapley Values [thanks Zack for reminding me of the name] are akin to credit assignment: you have a bunch of agents coordinating to achieve something, and then you want to assign payouts fairly based on how much each contribution mattered to the final outcome.
And the way you do this is, for each agent you look at how good the outcome would have been if everybody except that agent had coordinated, and then you credit each agent proportionally to how much the overall performance would have fallen off without them.
So what about doing the same here- send rewar... (read more)
I can't for the life of me remember what this is called
I can't for the life of me remember what this is called
(Best wishes, Less Wrong Reference Desk)
Good comment. I disagree with this bit:
I would, for instance, predict that if Superintelligence were published during the era of GOFAI, all else equal it would've made a bigger splash because AI researchers then were more receptive to abstract theorizing.
And then it would probably have been seen as outmoded and thrown away completely when AI capabilities research progressed into realms that vastly surpassed GOFAI. I don't know that there's an easy way to get capabilities researchers to think seriously about safety concerns that haven't manifested on a sufficient scale yet.
I like this suggestion of a more feasible form of steganography for NNs to figure out! But I think you'd need further advances in transparency to get useful informed oversight capabilities from (transformed or not) copies of the predictive network.
I should have said "reliably estimate HCH"; I'd also want quite a lot of precision in addition to calibration before I trust it.
Re #2, I think this is an important objection to low-impact-via-regularization-penalty in general.
Re #1, an obvious set of questions to include in q are questions of approval for various aspects of the AI's policy. (In particular, if we want the AI to later calculate a human's HCH and ask it for guidance, then we would like to be sure that HCH's answer to that question is not manipulated.)
There's the additional objection of "if you're doing this, why not just have the AI ask HCH what to do?"
Overall, I'm hoping that it could be easier for an AI to robustly conclude that a certain plan only changes a human's HCH via certain informational content, than for the AI to reliably calculate the human's HCH. But I don't have strong arguments for this intuition.
Question that I haven't seen addressed (and haven't worked out myself): which of these indifference methods are reflectively stable, in the sense that the AI would not push a button to remove them (or switch to a different indifference method)?
This is a lot of good work! Modal combat is increasingly deprecated though (in my opinion), for reasons like the ones you noted in this post, compared to studying decision theory with logical inductors; and so I'm not sure this is worth developing further.
Yup, this isn't robust to extremely capable systems; it's a quantitative shift in how promising it looks to the agent to learn about external affairs, not a qualitative one.
(In the example with the agent doing engineering in a sandbox that doesn't include humans or general computing devices, there could be a strong internal gradient to learn obvious details about the things immediately outside its sandbox, and a weaker gradient for learning more distant or subtle things before you know the nearby obvious ones.)
A whitelisting variant would be way more reliable than a blacklisting one, clearly.
Nice! One thing that might be useful for context: what's the theoretical correct amount of time that you would expect an algorithm to spend on the right vs. the left if the session gets interrupted each time it goes 1 unit to the right? (I feel like there should be a pretty straightforward way to calculate the heuristic version where the movement is just Brownian motion that gets interrupted early if it hits +1.)
Typo: The statement of Theorem 4.1 omits the word "continuous".
Stuart did make it easier for many of us to read his recent ideas by crossposting them here. I'd like there to be some central repository for the current set of AI control work, and I'm hoping that the forum could serve as that.
Is there a functionality that, if added here, would make it trivial to crosspost when you wrote something of note?
The authors of the CIRL paper are in fact aware of them, and are pondering them for future work. I've had fruitful conversations with Dylan Hadfield-Menell (one of the authors), talking about how a naive implementation goes wrong for irrational humans, and about what a tractable non-naive implementation might look like (trying to model probabilities of a human's action under joint hypotheses about the correct reward function and about the human's psychology); he's planning future work relevant to that question.
Also note Dylan's talk on CIRL, value of infor
I agree strongly with the general principle "we need to be able to prove guarantees about our learning process in environments rich enough to be realistic", but I disagree with the claim that this shows a flaw in IRL. Adversarial environments seem to me very disanalogous to learning complicated and implicit preferences in a non-adversarial environment.
(You and I talked about this a bit, and I pointed out that computational complexity issues only block people in practice when the setup needs to be adversarial, e.g. intentional cryptography to prevent an adv
If you're confident of getting a memory trace for all books consulted, then there are simpler ways of preventing plagiarism in the informed oversight case: have the overseer read only the books consulted by the agent (or choose randomly among them for the ones to read). The informed oversight problem here assumes that the internals of A are potentially opaque to B, even though B has greater capabilities than A.
Yup, including better models of human irrationality seems like a promising direction for CIRL. I've been writing up a short note on the subject with more explicit examples- if you want to work on this without duplicating effort, let me know and I'll share the rough draft with you.
Even the last version might have odd incentives. If A knew that the chances were high enough that an actually original A book would be seen as rare plagiarism of some book unknown to A, the dominant strategy could be to instead commit the most obvious plagiarism ever, in order to minimize the penalty that cannot be reliably avoided.
Can you define more precisely what you mean by "censoring contradictions"?
In the spirit of "one step is normal, two steps are suspicious, omega steps are normal", perhaps there's a 'triple corrigibility' issue when EαEβ≠Eβ?
Typo: in the paragraph before the equation arrays, you forgot to change E[v|u→v] from 5 to 42 (you did so in the following equation arrays). This buffaloed me for a bit!
It's illustrating the failure of a further desideratum for the shutdown problem: we would like the AI to be able to update on and react to things that happen in the world which correlate with a certain channel, and yet still not attempt to influence that channel.
For motivation, assume a variant on the paperclip game:
There's nothing in the setup preventing the players from having access to independent random bits, though it's fair to say that these approaches assume this to be the case even when it's not.
But then the fault is with that assumption of access to randomness, not with any of the constraints on Q. So I don't think this is a strike against these methods.
Typo in the "Single Variable Maximalisation" section: you meant to write R(a)R′(a′) rather than R(a)R(a′).
We'd discussed how this "magical counterfactual" approach has the property of ignoring evidence of precursors to a button-press, since they don't count as evidence for whether the button would be pressed in the counterfactual world. Here's a simple illustration of that issue:
In this world, there is a random fair coinflip, then the AI gets to produce either a staple or a paperclip, and then a button is pressed. We have a utility function that rewards paperclips if the button is pressed, and staples if it is not pressed. Furthermore, the button is pressed if
Typo: in the first full paragraph of page 2, I assume you mean the agent will one-box, not two-box.
And I'm not sure the final algorithm necessarily one-boxes even if the logical uncertainty engine thinks the predictor's (stronger) axioms are probably consistent- I think there might be a spurious counterfactual where the conditional utilities view the agent two-boxing as evidence that the predictor's axioms must be inconsistent. Is there a clean proof that the algorithm does the correct thing in this case?