Stuart Armstrong

Sequences

AI Safety Subprojects
Practical Guide to Anthropics
Anthropic Decision Theory
Subagents and impact measures
If I were a well-intentioned AI...

Comments

General alignment plus human values, or alignment via human values?

Thanks for developing the argument. This is very useful.

The key point seems to be whether we can develop an AI that can successfully behave as a low impact AI - not as a "on balance, things are ok", but a genuinely low impact AI that ensure that we don't move towards a world where our preference might be ambiguous or underdefined.

But consider the following scenario: the AGI knows that, as a consequence of its actions, one AGI design will be deployed rather than another. Both of these designs will push the world into uncharted territory. How should it deal with that situation?

General alignment plus human values, or alignment via human values?

The successor problem is important, but it assumes we have the values already.

I'm imagining algorithms designing successors with imperfect values (that they know to be imperfect). It's a somewhat different problem (though solving the classical successor problem is also important).

General alignment plus human values, or alignment via human values?

I agree there are superintelligent unconstrained AIs that can accomplish tasks (making a cup of tea) without destroying the world. But I feel it would have to have so much of human preferences already (to compute what is and what isn't an acceptable tradeoff in making you your cup of tea) that it may as well be fully aligned anyway - very little remains to define full alignment.

AI, learn to be conservative, then learn to be less so: reducing side-effects, learning preserved features, and going beyond conservatism

Those are very relevant to this project, thanks. I want to see how far we can push these approaches; maybe some people you know would like to take part?

Force neural nets to use models, then detect these

Vertigo, lust, pain reactions, some fear responses, and so on, don't involve a model. Some versions of "learning that it's cold outside" don't involve a model, just looking out and shivering; the model aspect comes in when you start reasoning about what to do about it. People often drive to work without consciously modelling anything on the way.

Think model-based learning versus Q-learning. Anything that's more Q-learning is not model based.

Force neural nets to use models, then detect these

I think the question of whether any particular plastic synapse is or is not part of the information content of the model will have a straightforward yes-or-no answer.

I don't think it has an easy yes or no answer (at least without some thought as to what constitutes a model within the mess of human reasoning) and I'm sure that even if it does, it's not straightforward.

since we probably won't have those kinds of real-time-brain-scanning technologies, right?

One hope would be that, by the time we have those technologies, we'd know what to look for.

What does GPT-3 understand? Symbol grounding and Chinese rooms

I have only very limited access to GPT-3; it would be interesting if others played around with my instructions, making them easier for humans to follow, while still checking that GPT-3 failed.

Load More