Rob Bensinger

Communications lead at MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer's.

Wiki Contributions


Alignment Research = Conceptual Alignment Research + Applied Alignment Research

OK, thanks for the clarifications!

Still, I always had the impression that this line of work focused more on how to build a perfectly rational AGI than on building an aligned one. Can you explain me why that's inaccurate?

I don't know what you mean by "perfectly rational AGI". (Perfect rationality isn't achievable, rationality-in-general is convergently instrumental, and rationality is insufficient for getting good outcomes. So why would that be the goal?)

I think of the basic case for HRAD this way:

  • We seem to be pretty confused about a lot of aspects of optimization, reasoning, decision-making, etc. (Embedded Agency is talking about more or less the same set of questions as HRAD, just with subsystem alignment added to the mix.)
  • If we were less confused, it might be easier to steer toward approaches to AGI that make it easier to do alignment work like 'understand what cognitive work the system is doing internally', 'ensure that none of the system's compute is being used to solve problems we don't understand / didn't intend', 'ensure that the amount of quality-adjusted thinking the system is putting into the task at hand is staying within some bound', etc.

    These approaches won't look like decision theory, but being confused about basic ground-floor things like decision theory is a sign that you're likely not in an epistemic position to efficiently find such approaches, much like being confused about how/whether chess is computable is a sign that you're not in a position to efficiently steer toward good chess AI designs.
Alignment Research = Conceptual Alignment Research + Applied Alignment Research

Maybe what I want is a two-dimensional "prosaic AI vs. novel AI" and "whiteboards vs. code". Then I can more clearly say that I'm pretty far toward 'novel AI' on one dimension (though not as far as I was in 2015), separate from whether I currently think the bigger bottlenecks (now or in the future) are more whiteboard-ish problems vs. more code-ish problems.

Alignment Research = Conceptual Alignment Research + Applied Alignment Research

Cool, that makes sense!

I abused the hyperbole in that case. What I was pointing out is the impression that old-school MIRI (a lot of the HRAD work) thinks that solving the alignment problem requires deconfusing every related philosophical problem in terms of maths, and then implementing that. Such a view doesn't seem shared by many in the community for a couple of reasons:

I'm still not totally clear here about which parts were "hyperbole" vs. endorsed. You say that people's "impression" was that MIRI wanted to deconfuse "every related philosophical problem", which suggests to me that you think there's some gap between the impression and reality. But then you say "such a view doesn't seem shared by many in the community" (as though the "impression" is an actual past-MIRI-view others rejected, rather than a misunderstanding).

HRAD has always been about deconfusion (though I agree we did a terrible job of articulating this), not about trying to solve all of philosophy or "write down a perfectly aligned AGI from scratch". The spirit wasn't 'we should dutifully work on these problems because they're Important-sounding and Philosophical'; from my perspective, it was more like 'we tried to write down a sketch of how to align an AGI, and immediately these dumb issues with self-reference and counterfactuals and stuff cropped up, so we tried to get those out of the way fast so we could go back to sketching how to aim an AGI at intended targets'. As Eliezer put it,

It was a dumb kind of obstacle to run into—or at least it seemed that way at that time. It seemed like if you could get a textbook from 200 years later, there would be one line of the textbook telling you how to get past that.

From my perspective, the biggest reason MIRI started diversifying approaches away from our traditional focus was shortening timelines, where we still felt that "conceptual" progress was crucial, and still felt that marginal progress on the Agent Foundations directions would be useful; but we now assigned more probability to 'there may not be enough time to finish the core AF stuff', enough to want to put a lot of time into other problems too.

Actually, I'm not sure how to categorize MIRI's work using your conceptual vs. applied division. I'd normally assume "conceptual", because our work is so far away from prosaic alignment; but you also characterize applied alignment research as being about "experimentally testing these ideas [from conceptual alignment]", which sounds like the 2017-initiated lines of research we described in our 2018 update. If someone is running software experiments to test ideas about "Seeking entirely new low-level foundations for optimization" outside the current ML paradigm, where does that fall?

If AGI turns out to be prosaic AGI (which sounds like one of the best bet to make now), then what matters is aligning neural nets, not finding a way of write down a perfectly aligned AGI from scratch

Prosaic AGI alignment and "write down a perfectly aligned AGI from scratch" both seem super doomed to me, compared to approaches that are neither prosaic nor perfectly-neat-and-tidy. Where does research like that fall?

Alignment Research = Conceptual Alignment Research + Applied Alignment Research

Nowadays, the vast majority of the field disagree that there’s any hope of formalizing all of philosophy and then just implementing that to get an aligned AGI.

What do you mean by "formalizing all of philosophy"? I don't see 'From Philosophy to Math to Engineering' as arguing that we should turn all of philosophy into math (and I don't even see the relevance of this to Friendly AI). It's just claiming that FAI research begins with fuzzy informal ideas/puzzles/goals (like the sort you might see philosophers debate), then tries to move in more formal directions.

I imagine part of Luke's point in writing the post was to push back against the temptation to see formal and informal approaches as opposed ('MIRI does informal stuff, so it must not like formalisms'), and to push back against the idea that analytic philosophers 'own' whatever topics they happen to have historically discussed.

Conceptual alignment research isn’t just turning philosophy into mathematics. This is a failure mode I warned against recently: what matters is deconfusion, not formalization.

Pearl's causality (the main example of "turning philosophy into mathematics" Luke uses) was an example of achieving deconfusion about causality, not an example of 'merely formalizing' something. I agree that calling this deconfusion is a clearer way of pointing at the thing, though!

"Existential risk from AI" survey results

One-off, though Carlier, Clarke, and Schuett have a similar survey coming out in the next week.

Coherence arguments imply a force for goal-directed behavior

Maybe changing the title would prime people less to have the wrong interpretation? E.g., to 'Coherence arguments require that the system care about something'.

Even just 'Coherence arguments do not entail goal-directed behavior' might help, since colloquial "imply" tends to be probabilistic, but you mean math/logic "imply" instead. Or 'Coherence theorems do not entail goal-directed behavior on their own'.

The case for aligning narrowly superhuman models

I've copied over comments by MIRI's Evan Hubinger and Eliezer Yudkowsky on a slightly earlier draft of Ajeya's post — as a separate post, since it's a lot of text.

Load More