Stuart Armstrong

Sequences

AI Safety Subprojects
Practical Guide to Anthropics
Anthropic Decision Theory
Subagents and impact measures
If I were a well-intentioned AI...

Comments

How an alien theory of mind might be unlearnable

It's an interesting question as to whether aAlice is actually overconfident. Her predictions about human behaviour may be spot on, at this point - much better than human predictions about ourselves. So her confidence depends on whether she has the right kind of philosophical uncertainty.

Are there alternative to solving value transfer and extrapolation?

I actually don't think that Alice could help a (sufficiently alien) alien. She needs an alien theory of mind to understand what the alien wants, how they would extrapolate, how to help that extrapolation without manipulating it, and so on. Without that, she's just projecting human assumptions in alien behaviour and statements.

General alignment plus human values, or alignment via human values?

Yes, but we would be mostly indifferent to shifts in the distribution that preserve most of the features - eg if the weather was the same but delayed or advanced by six days.

Are there alternative to solving value transfer and extrapolation?

I have some draft posts explaining some of this stuff better, I can share them privately, or hang on another month or two. :)

I'd like to see them. I'll wait for the final (posted) versions, I think.

Research Agenda v0.9: Synthesising a human's preferences into a utility function

Because our preferences are inconsistent, and if an AI says "your true preferences are ", we're likely to react by saying "no! No machine will tell me what my preferences are. My true preferences are , which are different in subtle ways".

General alignment plus human values, or alignment via human values?

Thanks for developing the argument. This is very useful.

The key point seems to be whether we can develop an AI that can successfully behave as a low impact AI - not as a "on balance, things are ok", but a genuinely low impact AI that ensure that we don't move towards a world where our preference might be ambiguous or underdefined.

But consider the following scenario: the AGI knows that, as a consequence of its actions, one AGI design will be deployed rather than another. Both of these designs will push the world into uncharted territory. How should it deal with that situation?

Load More