Copying over a Slack comment from Abram Demski:
I think this post could be pretty important.It offers a formal treatment of "goal-directedness" and its relationship to coherence theorems such as VNM, a topic which has seen some past controversy but which has -- till now -- been dealt with only quite informally. Personally I haven't known how to engage with the whole goal-directedness debate, and I think part of the reason for that is the vagueness of the idea. Goal-directedness doesn't seem that cruxy for most of my thinking, but some other people seem to r
I think this post could be pretty important.
It offers a formal treatment of "goal-directedness" and its relationship to coherence theorems such as VNM, a topic which has seen some past controversy but which has -- till now -- been dealt with only quite informally. Personally I haven't known how to engage with the whole goal-directedness debate, and I think part of the reason for that is the vagueness of the idea. Goal-directedness doesn't seem that cruxy for most of my thinking, but some other people seem to r
OK, thanks for the clarifications!
Still, I always had the impression that this line of work focused more on how to build a perfectly rational AGI than on building an aligned one. Can you explain me why that's inaccurate?
I don't know what you mean by "perfectly rational AGI". (Perfect rationality isn't achievable, rationality-in-general is convergently instrumental, and rationality is insufficient for getting good outcomes. So why would that be the goal?)
I think of the basic case for HRAD this way:
Maybe what I want is a two-dimensional "prosaic AI vs. novel AI" and "whiteboards vs. code". Then I can more clearly say that I'm pretty far toward 'novel AI' on one dimension (though not as far as I was in 2015), separate from whether I currently think the bigger bottlenecks (now or in the future) are more whiteboard-ish problems vs. more code-ish problems.
Cool, that makes sense!
I abused the hyperbole in that case. What I was pointing out is the impression that old-school MIRI (a lot of the HRAD work) thinks that solving the alignment problem requires deconfusing every related philosophical problem in terms of maths, and then implementing that. Such a view doesn't seem shared by many in the community for a couple of reasons:
I'm still not totally clear here about which parts were "hyperbole" vs. endorsed. You say that people's "impression" was that MIRI wanted to deconfuse "every related philosophical problem... (read more)
Nowadays, the vast majority of the field disagree that there’s any hope of formalizing all of philosophy and then just implementing that to get an aligned AGI.
What do you mean by "formalizing all of philosophy"? I don't see 'From Philosophy to Math to Engineering' as arguing that we should turn all of philosophy into math (and I don't even see the relevance of this to Friendly AI). It's just claiming that FAI research begins with fuzzy informal ideas/puzzles/goals (like the sort you might see philosophers debate), then tries to move in more formal directio... (read more)
One-off, though Carlier, Clarke, and Schuett have a similar survey coming out in the next week.
Source for the blacksmith analogy: I Still Don't Get Foom
Maybe changing the title would prime people less to have the wrong interpretation? E.g., to 'Coherence arguments require that the system care about something'.
Even just 'Coherence arguments do not entail goal-directed behavior' might help, since colloquial "imply" tends to be probabilistic, but you mean math/logic "imply" instead. Or 'Coherence theorems do not entail goal-directed behavior on their own'.
Some skepticism from Eliezer here: https://twitter.com/ESRogs/status/1337869362678571008
I've copied over comments by MIRI's Evan Hubinger and Eliezer Yudkowsky on a slightly earlier draft of Ajeya's post — as a separate post, since it's a lot of text.
Previously linked here: https://www.alignmentforum.org/posts/wsBpJn7HWEPCJxYai/excerpt-from-arbital-solomonoff-induction-dialogue
Two examples of MIRI talking about orthogonality, instrumental convergence, etc.: "Five Theses, Two Lemmas, and a Couple of Strategic Implications" (2013) and "So Far: Unfriendly AI Edition" (2016). The latter is closer to how I'd start a discussion with a random computer scientist today, if they thought AGI alignment isn't important to work on and I wanted to figure out where the disagreement lies.
I think "Five Theses..." is basically a list of 'here are the five things Ray Kurzweil is wrong about'. A lot of people interested in AGI early on held Kurzweil... (read more)
After seeing this post last month, Eliezer mentioned to me that he likes your recent posts, and would want to spend money to make more posts like this exist, if that were an option.
(I've poked Richard about this over email already, but wanted to share the Eliezer-praise here too.)
I agree with this post.
May be useful to include in the review with some of the comments, or with a postmortem and analysis by Ben (or someone).
I don't think the discussion stands great on its own, but it may be helpful for:
Seems like a good starting point for discussion. Researchers need to have some picture of what AI alignment is "for," in order to think about what research directions look most promising.
I want to see more attempts to answer this question. Also related to another post I nominated: https://www.lesswrong.com/posts/PKy8NuNPknenkDY74/soft-takeoff-can-still-lead-to-decisive-strategic-advantage
I'm not a slow-takeoff proponent, and I don't agree with everything in this post; but I think it's asking a lot of the right questions and introducing some useful framings.
I've added the section-2 definitions above to https://www.lesswrong.com/posts/kLLu387fiwbis3otQ/cartesian-frames-definitions.
And now I've made a LW post collecting most of the definitions in the sequence so far, so they're easier to find: https://www.lesswrong.com/posts/kLLu387fiwbis3otQ/cartesian-frames-definitions
I'm collecting most of the definitions from this sequence on one page, for easier reference: https://www.lesswrong.com/posts/kLLu387fiwbis3otQ/cartesian-frames-definitions
For my personal use when I was helping review Scott's drafts, I made some mnemonics (complete with silly emojis to keep track of the small Cartesian frames and operations) here: https://docs.google.com/drawings/d/1bveBk5Pta_tml_4ezJ0oWiq-qudzgnsRlfbGJgZ1qv4/.
(Also includes my crude visualizations of morphism composition and homotopy equivalence to help those concepts stick better in my brain.)
Scott's post explaining the relationship between C0 and C1 exists as of now: Functors and Coarse Worlds.
To get an intuition for morphisms, I tried listing out every frame that has a morphism going to a simple 2x2 frame
C0= f0f1b0b1(w0w1w2w3) .
Are any of the following wrong? And, am I missing any?
Frames I think have a morphism going to C0:
Every frame that looks like a frame on this list (other than 0), but with extra columns added — regardless of what's in those columns. (As a special case, this... (read more)
An example of frames that are biextensionally equivalent to C1:
... or any frame that enlarges one of those four frames by adding extra copies of any of the rows and/or columns.
They're not equivalent. If two frames are 'homotopy equivalent' / 'biextensionally equivalent' (two names for the same thing, in Cartesian frames), it means that you can change one frame into the other (ignoring the labels of possible agents and environments, i.e., just looking at the possible worlds) by doing some combination of 'make a copy of a row', 'make a copy of a column', 'delete a row that's a copy of another row', and/or 'delete a column that's a copy of another column'.
The entries of C0 and C1 are totally different (Image(C0)... (read more)
Scott's Sunday talk, covering content from this post and the Intro post: https://www.youtube.com/watch?v=H1tJdaCvcck
Abram added a lot of additional material to this today: https://www.lesswrong.com/posts/9vYg8MyLL4cMMaPQJ/updates-and-additions-to-embedded-agency.
Yet almost everyone agrees the world will likely be importantly different by the time advanced AGI arrives.
Why do you think this? My default assumption is generally that the world won't be super different from how it looks today in strategically relevant ways. (Maybe it will be, but I don't see a strong reason to assume that, though I strongly endorse thinking about big possible changes!)
A part I liked and thought was well-explained:
I think there's a strong argument for deception being simpler than corrigibility. Corrigibility has some fundamental difficulties in terms of... If you're imagining gradient descent process, which is looking at a proxy aligned model and is trying to modify it so that it makes use of this rich input data, it has to do some really weird things to make corrigibility work.It has to first make a very robust pointer. With corrigibility, if it's pointing at all incorrectly to the wrong thing in the input data, wrong t
I think there's a strong argument for deception being simpler than corrigibility. Corrigibility has some fundamental difficulties in terms of... If you're imagining gradient descent process, which is looking at a proxy aligned model and is trying to modify it so that it makes use of this rich input data, it has to do some really weird things to make corrigibility work.
It has to first make a very robust pointer. With corrigibility, if it's pointing at all incorrectly to the wrong thing in the input data, wrong t
Lucas Perry: I guess I imagine the coordination here is that information on relative training competitiveness and performance competitiveness in systems is evaluated within AI companies and then possibly fed to high power decision makers who exist in strategy and governance for coming up with the correct strategy, given the landscape of companies and AI systems which exist?Evan Hubinger: Yeah, that’s right.
Lucas Perry: I guess I imagine the coordination here is that information on relative training competitiveness and performance competitiveness in systems is evaluated within AI companies and then possibly fed to high power decision makers who exist in strategy and governance for coming up with the correct strategy, given the landscape of companies and AI systems which exist?
Evan Hubinger: Yeah, that’s right.
I asked Evan about this and he said he misheard Lucas as asking roughly 'Are training competitiveness and performance competitiveness important for rese
I think it’s useful to sort of have in the back of your mind this analogy to evolution, but I would also be careful not to take it too far. I imagine that everything is going to generalize to the case of machine learning because it is a different process.
Should be "I think it’s useful to sort of have in the back of your mind this analogy to evolution, but I would also be careful not to take it too far and imagine that everything is going to generalize to the case of machine learning, because it is a different process."
World 3 doesn't strike me as a thing you can get in the critical period when AGI is a new technology. Worlds 1 and 2 sound approximately right to me, though the way I would say it is roughly: We can use math to better understand reasoning, and the process of doing this will likely improve our informal and heuristic descriptions of reasoning too, and will likely involve us recognizing that we were in some ways using the wrong high-level concepts to think about reasoning.
I haven't run the characterization above by any MIRI researchers, and different MIRI res
I agree with Ben and Richard's summaries; see https://www.lesswrong.com/posts/uKbxi2EJ3KBNRDGpL/comment-on-decision-theory:
We aren't working on decision theory in order to make sure that AGI systems are decision-theoretic, whatever that would involve. We're working on decision theory because there's a cluster of confusing issues here (e.g., counterfactuals, updatelessness, coordination) that represent a lot of holes or anomalies in our current best understanding of what high-quality reasoning is and how it works.[...] The idea behind looking at (e.g.) coun
We aren't working on decision theory in order to make sure that AGI systems are decision-theoretic, whatever that would involve. We're working on decision theory because there's a cluster of confusing issues here (e.g., counterfactuals, updatelessness, coordination) that represent a lot of holes or anomalies in our current best understanding of what high-quality reasoning is and how it works.
[...] The idea behind looking at (e.g.) coun
In September 2017, based on some conversations with MIRI and non-MIRI folks, I wrote:
I think that at least 80% of the AI safety researchers at MIRI, FHI, CHAI, OpenAI, and DeepMind would currently assign a >10% probability to this claim: "The research community will fail to solve one or more technical AI safety problems, and as a consequence there will be a permanent and drastic reduction in the amount of value in our future."
People may have become more optimistic since then, but most people falling in the 1-10% range would still surprise me a... (read more)
That would imply that 'intent alignment' is about aligning AI systems with what humans intend. But 'intent alignment' is about making AI systems intend to 'do the good thing'. (Where 'do the good thing' could be cashed out as 'do what some or all humans want', 'achieve humans' goals', or many other things.)
The thing I usually contrast with 'intent alignment' (≈ the AI's intentions match what's good) is something like 'outcome alignment' (≈ the AI's causal effects match what's good). As I personally think about it, the value of the former category is that i
I emailed Luke some corrections to the transcript above, most of which are now implemented. The changes that seemed least trivial to me (noted in underline):
This doesn't seem like it belongs on a "list of good heuristics", though!
I helped make this list in 2016 for a post by Nate, partly because I was dissatisfied with Scott's list (which includes people like Richard Sutton, who thinks worrying about AI risk is carbon chauvinism):
Stuart Russell’s Cambridge talk is an excellent introduction to long-term AI risk. Other leading AI researchers who have expressed these kinds of concerns about general AI include Francesca Rossi (IBM), Shane Legg (Google DeepMind), Eric Horvitz (Microsoft), Bart Selman (Cornell), Ilya Sutskever (OpenAI), Andrew Davison (Imperial College London), David McA
One of the main explanations of the AI alignment problem I link people to.
When I read this post, it struck me as a remarkably good introduction to logical induction, and the whole discussion seemed very core to the formal-epistemology projects on LW and AIAF.
My intuition is that it'd probably be pretty easy to create an aligned superhuman AI if we knew how to create non-singular, mis-aligned superhuman AIs, and had cheap, robust methods to tell if a particular AI was misaligned.
This sounds different from how I model the situation; my views agree here with Nate's (emphasis added):
I would rephrase 3 as "There are many intuitively small mistakes one can make early in the design process that cause resultant systems to be extremely difficult to align with operators’ intentions.” Iȁ
Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)
This in particular doesn't match my model. Quoting some relevant bits from Embedded Agency:
So I'm not talking about agents who know their own actions because I think there's going to be a big problem with intelligent machines inferring their own actions in the future. Rather, the possibility of knowing your own actions illustrates something confusing about determining the conseque
One option that's smaller than link posts might be to mention in the AF/LW version of the newsletter which entries are new to AIAF/LW as far as you know; or make comment threads in the newsletter for those entries. I don't know how useful these would be either, but it'd be one way to create common knowledge 'this is currently the one and only place to discuss these things on LW/AIAF'.
Agents need to consider multiple actions and choose the one that has the best outcome. But we're supposing that the code representing the agent's decision only has one possible output. E.g., perhaps an agent is going to choose between action A and action B, and will end up choosing A. Then a sufficiently close examination of the agent's source code will reveal that the scenario "the agent chooses B" is logically inconsistent. But then it's not clear how the agent can reason about the desirability of "the agent chooses B" while evaluating its outcomes, if not via some mechanism for nontrivially reasoning about outcomes of logically inconsistent situations.
The comment starting "The main datapoint that Rob left out..." is actually by Nate Soares. I cross-posted it to LW from an email conversation.
The above is the full Embedded Agency sequence, cross-posted from the MIRI website so that it's easier to find the text version on AIAF/LW (via search, sequences, author pages, etc.).
Scott and Abram have added a new section on self-reference to the sequence since it was first posted, and slightly expanded the subsequent section on logical uncertainty and the start of the robust delegation section.