All of Rob Bensinger's Comments + Replies

When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives

Copying over a Slack comment from Abram Demski:

I think this post could be pretty important.

It offers a formal treatment of "goal-directedness" and its relationship to coherence theorems such as VNM, a topic which has seen some past controversy but which has -- till now -- been dealt with only quite informally. Personally I haven't known how to engage with the whole goal-directedness debate, and I think part of the reason for that is the vagueness of the idea. Goal-directedness doesn't seem that cruxy for most of my thinking, but some other people seem to r

... (read more)
Alignment Research = Conceptual Alignment Research + Applied Alignment Research

OK, thanks for the clarifications!

Still, I always had the impression that this line of work focused more on how to build a perfectly rational AGI than on building an aligned one. Can you explain me why that's inaccurate?

I don't know what you mean by "perfectly rational AGI". (Perfect rationality isn't achievable, rationality-in-general is convergently instrumental, and rationality is insufficient for getting good outcomes. So why would that be the goal?)

I think of the basic case for HRAD this way:

  • We seem to be pretty confused about a lot of aspects of opti
... (read more)
Alignment Research = Conceptual Alignment Research + Applied Alignment Research

Maybe what I want is a two-dimensional "prosaic AI vs. novel AI" and "whiteboards vs. code". Then I can more clearly say that I'm pretty far toward 'novel AI' on one dimension (though not as far as I was in 2015), separate from whether I currently think the bigger bottlenecks (now or in the future) are more whiteboard-ish problems vs. more code-ish problems.

1Adam Shimi2moWhat you propose seems valuable, although not an alternative to my distinction IMO. This 2-D grid is more about what people consider as the most promising way of getting aligned AGI and how to get there, whereas my distinction focuses on separating two different types of research which have very different methods, epistemic standards and needs in terms of field-building.
Alignment Research = Conceptual Alignment Research + Applied Alignment Research

Cool, that makes sense!

I abused the hyperbole in that case. What I was pointing out is the impression that old-school MIRI (a lot of the HRAD work) thinks that solving the alignment problem requires deconfusing every related philosophical problem in terms of maths, and then implementing that. Such a view doesn't seem shared by many in the community for a couple of reasons:

I'm still not totally clear here about which parts were "hyperbole" vs. endorsed. You say that people's "impression" was that MIRI wanted to deconfuse "every related philosophical problem... (read more)

2Adam Shimi2moI think that the issue is that I have a mental model of this process you describe that summarize it as "you need to solve a lot of philosophical issues for it to work", and so that's what I get by default when I query for that agenda. Still, I always had the impression that this line of work focused more on how to build a perfectly rational AGI than on building an aligned one. Can you explain me why that's inaccurate? Yeah, I think this is a pretty common perspective on that work from outside MIRI. That's my take (that there isn't enough time to solve all of the necessary components) and the one I've seen people use in discussing MIRI multiple time. A really important point is that the division isn't meant to split researchers themselves but research. So the experiment part would be applied alignment research and the rest conceptual alignment research. What's interesting is that this is a good example of applied alignment research that doesn't have the benefits I mention of more prosaic applied alignment research: being publishable at big ML/AI conferences, being within an accepted paradigm of modern AI... I would say that the non-prosaic approaches require at least some conceptual alignment research (because the research can't be done fully inside current paradigms of ML and AI), but probably encompass some applied research. Maybe Steve's work [https://www.alignmentforum.org/users/steve2152] is a good example, with a proposal split of two of his posts in this comment [ Steve's work].
2Rob Bensinger2moMaybe what I want is a two-dimensional "prosaic AI vs. novel AI" and "whiteboards vs. code". Then I can more clearly say that I'm pretty far toward 'novel AI' on one dimension (though not as far as I was in 2015), separate from whether I currently think the bigger bottlenecks (now or in the future) are more whiteboard-ish problems vs. more code-ish problems.
Alignment Research = Conceptual Alignment Research + Applied Alignment Research

Nowadays, the vast majority of the field disagree that there’s any hope of formalizing all of philosophy and then just implementing that to get an aligned AGI.

What do you mean by "formalizing all of philosophy"? I don't see 'From Philosophy to Math to Engineering' as arguing that we should turn all of philosophy into math (and I don't even see the relevance of this to Friendly AI). It's just claiming that FAI research begins with fuzzy informal ideas/puzzles/goals (like the sort you might see philosophers debate), then tries to move in more formal directio... (read more)

1Adam Shimi2moThanks for the comment! I abused the hyperbole in that case. What I was pointing out is the impression that old-school MIRI (a lot of the HRAD work) thinks that solving the alignment problem requires deconfusing every related philosophical problem in terms of maths, and then implementing that. Such a view doesn't seem shared by many in the community for a couple of reasons: * Some doubt that the level of mathematical formalization required is even possible * If timelines are quite short, we probably don't have the time to do all that. * If AGI turns out to be prosaic AGI (which sounds like one of the best bet to make now), then what matters is aligning neural nets, not finding a way of write down a perfectly aligned AGI from scratch (related to the previous point because it seems improbable that the formalization will be finished before neural nets reach AGI, in such a prosaic setting). Thanks for that clarification, it makes sense to me. That being said, multiple people (both me a couple of years ago and people I mentor/talk too) seem to have been pushed by MIRI's work in general to think that they need extremely high-level of maths and formalism to even contribute to alignment, which I disagree with, and apparently Luke and you do too. Reading the linked post, what jumps to me is the focus that friendly AI is about turning philosophy into maths, and I think that's the culprit. That is part of the process, important one and great if we manage it. But expressing and thinking through problems of alignment at a less formal level is still very useful and important; that's how we have most of the big insights and arguments in the field. Funnily, it sounds like MIRI itself (specifically Scott) has call that into doubt with Finite Factored Sets [https://www.lesswrong.com/s/kxs3eeEti9ouwWFzr]. This work isn't throwing away all of Pearl's work, but it argues that some part where missing/some assumptions unwarranted. Even a case of deconfusion as groun
"Existential risk from AI" survey results

One-off, though Carlier, Clarke, and Schuett have a similar survey coming out in the next week.

Coherence arguments imply a force for goal-directed behavior

Maybe changing the title would prime people less to have the wrong interpretation? E.g., to 'Coherence arguments require that the system care about something'.

Even just 'Coherence arguments do not entail goal-directed behavior' might help, since colloquial "imply" tends to be probabilistic, but you mean math/logic "imply" instead. Or 'Coherence theorems do not entail goal-directed behavior on their own'.

The case for aligning narrowly superhuman models

I've copied over comments by MIRI's Evan Hubinger and Eliezer Yudkowsky on a slightly earlier draft of Ajeya's post — as a separate post, since it's a lot of text.

Distinguishing claims about training vs deployment

Two examples of MIRI talking about orthogonality, instrumental convergence, etc.: "Five Theses, Two Lemmas, and a Couple of Strategic Implications" (2013) and "So Far: Unfriendly AI Edition" (2016). The latter is closer to how I'd start a discussion with a random computer scientist today, if they thought AGI alignment isn't important to work on and I wanted to figure out where the disagreement lies.

I think "Five Theses..." is basically a list of 'here are the five things Ray Kurzweil is wrong about'. A lot of people interested in AGI early on held Kurzweil... (read more)

Commentary on AGI Safety from First Principles

After seeing this post last month, Eliezer mentioned to me that he likes your recent posts, and would want to spend money to make more posts like this exist, if that were an option.

(I've poked Richard about this over email already, but wanted to share the Eliezer-praise here too.)

Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

May be useful to include in the review with some of the comments, or with a postmortem and analysis by Ben (or someone).

I don't think the discussion stands great on its own, but it may be helpful for:

  • people familiar with AI alignment who want to better understand some human factors behind 'the field isn't coordinating or converging on safety'.
  • people new to AI alignment who want to use the views of leaders in the field to help them orient.
AI Safety "Success Stories"

Seems like a good starting point for discussion. Researchers need to have some picture of what AI alignment is "for," in order to think about what research directions look most promising.

Soft takeoff can still lead to decisive strategic advantage

I'm not a slow-takeoff proponent, and I don't agree with everything in this post; but I think it's asking a lot of the right questions and introducing some useful framings.

Multiplicative Operations on Cartesian Frames

And now I've made a LW post collecting most of the definitions in the sequence so far, so they're easier to find: https://www.lesswrong.com/posts/kLLu387fiwbis3otQ/cartesian-frames-definitions 

Additive and Multiplicative Subagents

I'm collecting most of the definitions from this sequence on one page, for easier reference: https://www.lesswrong.com/posts/kLLu387fiwbis3otQ/cartesian-frames-definitions 

Multiplicative Operations on Cartesian Frames

For my personal use when I was helping review Scott's drafts, I made some mnemonics (complete with silly emojis to keep track of the small Cartesian frames and operations) here: https://docs.google.com/drawings/d/1bveBk5Pta_tml_4ezJ0oWiq-qudzgnsRlfbGJgZ1qv4/.

(Also includes my crude visualizations of morphism composition and homotopy equivalence to help those concepts stick better in my brain.)

1DanielFilan1yThanks!
Biextensional Equivalence

Scott's post explaining the relationship between  and  exists as of now: Functors and Coarse Worlds.

Controllables and Observables, Revisited

To get an intuition for morphisms, I tried listing out every frame that has a morphism going to a simple 2x2 frame

 .

Are any of the following wrong? And, am I missing any?

 

Frames I think have a morphism going to :

 

 

Every frame that looks like a frame on this list (other than ), but with extra columns added — regardless of what's in those columns. (As a special case, this... (read more)

2Scott Garrabrant1yYou can also duplicate rows inC0, and then add columns, so you can get things like⎛⎜⎝w0w1w2w0w1w3w2w3w0⎞⎟⎠. There are infinitely many biextensional Cartesian frames over{w0,w1,w2,w3}with morphism toC0, with arbitrarily large dimensions.
3Scott Garrabrant1yC∗0is wrong. You can see it has Ensurables thatC0does not have.
Biextensional Equivalence

An example of frames that are biextensionally equivalent to :

... or any frame that enlarges one of those four frames by adding extra copies of any of the rows and/or columns.

1ESRogs1yThis is helpful. Thanks!
Biextensional Equivalence

They're not equivalent. If two frames are 'homotopy equivalent' / 'biextensionally equivalent' (two names for the same thing, in Cartesian frames), it means that you can change one frame into the other (ignoring the labels of possible agents and environments, i.e., just looking at the possible worlds) by doing some combination of 'make a copy of a row', 'make a copy of a column', 'delete a row that's a copy of another row', and/or 'delete a column that's a copy of another column'.

The entries of  and  are totally different (... (read more)

1Rob Bensinger1yAn example of frames that are biextensionally equivalent toC1: ⎛⎜ ⎜ ⎜⎝w8w9w10w11w8w9w10w11⎞⎟ ⎟ ⎟⎠≃⎛⎜⎝w8w9w10w11w8w9⎞⎟⎠≃⎛⎜⎝w8w9w10w11w10w11⎞⎟⎠≃( w8w9w10w11) ... or any frame that enlarges one of those four frames by adding extra copies of any of the rows and/or columns.
Relevant pre-AGI possibilities

Yet almost everyone agrees the world will likely be importantly different by the time advanced AGI arrives.

Why do you think this? My default assumption is generally that the world won't be super different from how it looks today in strategically relevant ways. (Maybe it will be, but I don't see a strong reason to assume that, though I strongly endorse thinking about big possible changes!)

1Daniel Kokotajlo1yMaybe I was overconfident here. I was generalizing from the sample of people I'd talked to. Also, as you'll see by reading the entries on the list, I have a somewhat low bar for strategic relevance.
Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI

A part I liked and thought was well-explained:

I think there's a strong argument for deception being simpler than corrigibility. Corrigibility has some fundamental difficulties in terms of... If you're imagining gradient descent process, which is looking at a proxy aligned model and is trying to modify it so that it makes use of this rich input data, it has to do some really weird things to make corrigibility work.

It has to first make a very robust pointer. With corrigibility, if it's pointing at all incorrectly to the wrong thing in the input data, wrong t

... (read more)
Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI

Lucas Perry: I guess I imagine the coordination here is that information on relative training competitiveness and performance competitiveness in systems is evaluated within AI companies and then possibly fed to high power decision makers who exist in strategy and governance for coming up with the correct strategy, given the landscape of companies and AI systems which exist?

Evan Hubinger: Yeah, that’s right.

I asked Evan about this and he said he misheard Lucas as asking roughly 'Are training competitiveness and performance competitiveness important for rese

... (read more)
Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI

Transcription errors:

I think it’s useful to sort of have in the back of your mind this analogy to evolution, but I would also be careful not to take it too far. I imagine that everything is going to generalize to the case of machine learning because it is a different process.

Should be "I think it’s useful to sort of have in the back of your mind this analogy to evolution, but I would also be careful not to take it too far and imagine that everything is going to generalize to the case of machine learning, because it is a different process."

HDH process

Should

... (read more)
Plausible cases for HRAD work, and locating the crux in the "realism about rationality" debate

World 3 doesn't strike me as a thing you can get in the critical period when AGI is a new technology. Worlds 1 and 2 sound approximately right to me, though the way I would say it is roughly: We can use math to better understand reasoning, and the process of doing this will likely improve our informal and heuristic descriptions of reasoning too, and will likely involve us recognizing that we were in some ways using the wrong high-level concepts to think about reasoning.

I haven't run the characterization above by any MIRI researchers, and different MIRI res

... (read more)
What are the high-level approaches to AI alignment?

I agree with Ben and Richard's summaries; see https://www.lesswrong.com/posts/uKbxi2EJ3KBNRDGpL/comment-on-decision-theory:

We aren't working on decision theory in order to make sure that AGI systems are decision-theoretic, whatever that would involve. We're working on decision theory because there's a cluster of confusing issues here (e.g., counterfactuals, updatelessness, coordination) that represent a lot of holes or anomalies in our current best understanding of what high-quality reasoning is and how it works.

[...] The idea behind looking at (e.g.) coun

... (read more)
Possible takeaways from the coronavirus pandemic for slow AI takeoff

In September 2017, based on some conversations with MIRI and non-MIRI folks, I wrote:

I think that at least 80% of the AI safety researchers at MIRI, FHI, CHAI, OpenAI, and DeepMind would currently assign a >10% probability to this claim: "The research community will fail to solve one or more technical AI safety problems, and as a consequence there will be a permanent and drastic reduction in the amount of value in our future."

People may have become more optimistic since then, but most people falling in the 1-10% range would still surprise me a... (read more)

What is the alternative to intent alignment called?

That would imply that 'intent alignment' is about aligning AI systems with what humans intend. But 'intent alignment' is about making AI systems intend to 'do the good thing'. (Where 'do the good thing' could be cashed out as 'do what some or all humans want', 'achieve humans' goals', or many other things.)

The thing I usually contrast with 'intent alignment' (≈ the AI's intentions match what's good) is something like 'outcome alignment' (≈ the AI's causal effects match what's good). As I personally think about it, the value of the former category is that i

... (read more)
1Richard Ngo1ySo I guess more specifically what I'm trying to ask is: how do we distinguish between interpreting the good thing as "human intentions for the agent" versus "human goals"? In other words, we have at least four options here: 1. AI intends to do what the human wants it to do. 2. AI actually achieves what the human wants it to do. 3. AI intends to pursue the human's true goals. 4. AI actually achieves the human's true goals. So right now intent alignment (as specified by Paul) describes 1, and outcome alignment (as I'm inferring from your description) describes 4. But it seems quite important to have a name for 3 in particular.
AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

More links:

... (read more)
4Rohin Shah1yThat is in fact what I meant :)
AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

I emailed Luke some corrections to the transcript above, most of which are now implemented. The changes that seemed least trivial to me (noted in underline):

... (read more)
5Rob Bensinger1yMore links: * I googled 'daniel ellsberg nuclear first strikes' and found U.S. Planned Nuclear First Strike to Destroy Soviets and China – Daniel Ellsberg on RAI (6/13) [https://therealnews.com/stories/u-s-planned-nuclear-first-strike-to-destroy-soviets-and-china-daniel-ellsberg-on-rai-6-8] and U.S. Refuses to Adopt a Nuclear Weapon No First Use Pledge – Daniel Ellsberg on RAI (7/13) [https://therealnews.com/stories/u-s-refuses-to-adopt-a-nuclear-weapon-no-first-use-pledge-daniel-ellsberg-on-rai-7-8] . * Rohin Shah mentions a paper arguing image classifiers vulnerable to adversarial examples are "picking up on real imperceptible features that do generalize to the test set, that humans can't detect". This might be the MIT paper Adversarial Examples are not Bugs, they are Features [https://papers.nips.cc/paper/8307-adversarial-examples-are-not-bugs-they-are-features.pdf] . * MIRI's AI Risk for Computer Scientists workshop [https://intelligence.org/ai-risk-for-computer-scientists/]. Workshops are on hold due to COVID-19, but you're welcome to apply, get in touch with us, etc.
A list of good heuristics that the case for AI x-risk fails

This doesn't seem like it belongs on a "list of good heuristics", though!

A list of good heuristics that the case for AI x-risk fails

I helped make this list in 2016 for a post by Nate, partly because I was dissatisfied with Scott's list (which includes people like Richard Sutton, who thinks worrying about AI risk is carbon chauvinism):

Stuart Russell’s Cambridge talk is an excellent introduction to long-term AI risk. Other leading AI researchers who have expressed these kinds of concerns about general AI include Francesca Rossi (IBM), Shane Legg (Google DeepMind), Eric Horvitz (Microsoft), Bart Selman (Cornell), Ilya Sutskever (OpenAI), Andrew Davison (Imperial College London), David McA

... (read more)
Optimization Amplifies

One of the main explanations of the AI alignment problem I link people to.

Toward a New Technical Explanation of Technical Explanation

When I read this post, it struck me as a remarkably good introduction to logical induction, and the whole discussion seemed very core to the formal-epistemology projects on LW and AIAF.

Misconceptions about continuous takeoff
My intuition is that it'd probably be pretty easy to create an aligned superhuman AI if we knew how to create non-singular, mis-aligned superhuman AIs, and had cheap, robust methods to tell if a particular AI was misaligned.

This sounds different from how I model the situation; my views agree here with Nate's (emphasis added):

I would rephrase 3 as "There are many intuitively small mistakes one can make early in the design process that cause resultant systems to be extremely difficult to align with operators’ intentions.” Iȁ
... (read more)
1John Maxwell2yHas the "alignment roadblock" scenario been argued for anywhere? Like Lanrian, I think it sounds implausible. My intuition is that understanding human values is a hard problem, but taking over the world is a harder problem. For example, the AI which can talk its way out of a box probably has a very deep understanding of humans--a deeper understanding than most humans have of humans! In order to have such a deep understanding, it must have lower-level building blocks for making sense of the world which work extremely well, and could be used for a value learning system. BTW, coincidentally, I quoted this same passage in a post [https://www.lesswrong.com/posts/2Z8pMDfDduAwtwpcX/three-stories-for-how-agi-comes-before-fai] I wrote recently which discussed this scenario (among others). Is there a particular subscenario of this I outlined which seems especially plausible to you?
2Lukas Finnveden2yThat sounds right. I was thinking about an infinitely robust misalignment-oracle to clarify my thinking, but I agree that we'll need to be very careful with any real-world-tests. If I imagine writing code and using the misalignment-oracle on it, I think I mostly agree with Nate's point. If we have the code and compute to train a superhuman version of GPT-2, and the oracle tells us that any agent coming out from that training process is likely to be misaligned, we haven't learned much new, and it's not clear how to design a safe agent from there. I imagine a misalignment-oracle to be more useful if we use it during the training process, though. Concretely, it seems like a misalignment-oracle would be extremely useful to achieve inner alignment in IDA: as soon as the AI becomes misaligned, we can either rewind the training process and figure out what we did wrong, or directly use the oracle as a training signal that severely punish any step that makes the agent misaligned. Coupled with the ability to iterate on designs, since we won't accidentally blow up the world on the way, I'd guess that something like this is more likely to work than not. This idea is extremely sensitive to (c), though.
AI Alignment Open Thread August 2019
Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)

This in particular doesn't match my model. Quoting some relevant bits from Embedded Agency:

So I'm not talking about agents who know their own actions because I think there's going to be a big problem with intelligent machines inferring their own actions in the future. Rather, the possibility of knowing your own actions illustrates something confusing about determining the conseque
... (read more)
Alignment Newsletter One Year Retrospective

One option that's smaller than link posts might be to mention in the AF/LW version of the newsletter which entries are new to AIAF/LW as far as you know; or make comment threads in the newsletter for those entries. I don't know how useful these would be either, but it'd be one way to create common knowledge 'this is currently the one and only place to discuss these things on LW/AIAF'.

Comparison of decision theories (with a focus on logical-counterfactual decision theories)

Agents need to consider multiple actions and choose the one that has the best outcome. But we're supposing that the code representing the agent's decision only has one possible output. E.g., perhaps an agent is going to choose between action A and action B, and will end up choosing A. Then a sufficiently close examination of the agent's source code will reveal that the scenario "the agent chooses B" is logically inconsistent. But then it's not clear how the agent can reason about the desirability of "the agent chooses B" while evaluating its outcomes, if not via some mechanism for nontrivially reasoning about outcomes of logically inconsistent situations.

1Chris Leong3yDo we need the ability to reason about logically inconsistent situations? Perhaps we could attempt to transform the question of logical counterfactuals into a question about consistent situations instead as I describe in this post [https://www.lesswrong.com/posts/BRuWm4GxcTNPn4XDX/deconfusing-logical-counterfactuals] ? Or to put it another way, is the idea of logical counterfactuals an analogy or something that is supposed to be taken literally?
Comparison of decision theories (with a focus on logical-counterfactual decision theories)

The comment starting "The main datapoint that Rob left out..." is actually by Nate Soares. I cross-posted it to LW from an email conversation.

Embedded Agency (full-text version)

The above is the full Embedded Agency sequence, cross-posted from the MIRI website so that it's easier to find the text version on AIAF/LW (via search, sequences, author pages, etc.).

Scott and Abram have added a new section on self-reference to the sequence since it was first posted, and slightly expanded the subsequent section on logical uncertainty and the start of the robust delegation section.

2Rob Bensinger1yAbram added a lot of additional material to this today: https://www.lesswrong.com/posts/9vYg8MyLL4cMMaPQJ/updates-and-additions-to-embedded-agency [https://www.lesswrong.com/posts/9vYg8MyLL4cMMaPQJ/updates-and-additions-to-embedded-agency] .
Load More