owencb — AI Alignment Forum

Decomposing Agency — capabilities without desires

owencb1mo20Review for 2024 Review

I like this post and am glad that we wrote it.

Despite that, I feel keenly aware that it's asking a lot more questions than it's answering. I don't think I've got massively further in the intervening year in having good answers to those questions. The way this thinking seems to me to be most helpful is as a background model to help avoid confused assumptions when thinking about the future of AI. I do think this has impacted the way I think about AI risk, but I haven't managed to articulate that well yet (maybe in 2026 ...).

Acausal trade: double decrease

owencb9y10

I think the double decrease effect kicks in with uncertainty, but not with confident expectation of a smaller network.

A permutation argument for comparing utility functions

owencb9y00

I'm not sure I've fully followed, but I'm suspicious that you seem to be getting something for nothing in your shift from a type of uncertainty that we don't know how to handle to a type we do.

It seems to me like you must be making an implicit assumption somewhere. My guess is that this is where you used $i$ to pair $S$ with $S^{'}$ . If you'd instead chosen $j = i \circ ρ$ as the matching then you'd have uncertainty between whether $m$ should be $j$ or $ρ^{- 1} \circ j$ . My guess is that generically this gives different recommendations from your approach.

Learning Impact in RL

owencb9y20

Seems to me like there are a bunch of challenges. For example you need extra structure on your space to add things or tell what's small; and you really want to keep track of long-term impact not just at the next time-step. Particularly the long-term one seems thorny (for low-impact in general, not just for this).

Nevertheless I think this idea looks promising enough to explore further, would also like to hear David's reasons.

On motivations for MIRI's highly reliable agent design research

owencb9y20

For #5, OK, there's something to this. But:

It's somewhat plausible that stabilising pivotal acts will be available before world-destroying ones;
Actually there's been a supposition smuggled in already with "the first AI systems capable of performing pivotal acts". Perhaps there will at no point be a system capable of a pivotal act. I'm not quite sure whether it's appropriate to talk about the collection of systems that exist being together capable of pivotal acts if they will not act in concert. Perhaps we'll have a collection of systems which if aligned would produce a win, or if acting together towards an unaligned goal would produce catastrophe. It's unclear if they each have different unaligned goals that we necessarily get catastrophe (though it's certainly not a comfortable scenario).

I like your framing for #1.

On motivations for MIRI's highly reliable agent design research

owencb9y20

Thanks for the write-up, this is helpful for me (Owen).

My initial takes on the five steps of the argument as presented, in approximately decreasing order of how much I am on board:

Number 3 is a logical entailment, no quarrel here
Number 5 is framed as "therefore", but adds the assumption that this will lead to catastrophe. I think this is quite likely if the systems in question are extremely powerful, but less likely if they are of modest power.
Number 4 splits my intuitions. I begin with some intuition that selection pressure would significantly constrain the goal (towards something reasonable in many cases), but the example of Solomonoff Induction was surprising to me and makes me more unsure. I feel inclined to defer intuitions on this to others who have considered it more.
Number 2 I don't have a strong opinion on. I can tell myself stories which point in either direction, and neither feels compelling.
Number 1 is the step I feel most sceptical about. It seems to me likely that the first AIs which can perform pivotal acts will not perform fully general consequentialist reasoning. I expect that they will perform consequentialist reasoning within certain domains (e.g. AlphaGo in some sense reasons about consequences of moves, but has no conception of consequences in the physical world). This isn't enough to alleviate concern: some such domains might be general enough that something misbehaving in them would cause large problems. But it is enough for me to think that paying attention to scope of domains is a promising angle.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments