Stuart Armstrong


Concept Extrapolation
AI Safety Subprojects
Practical Guide to Anthropics
Anthropic Decision Theory
Subagents and impact measures
If I were a well-intentioned AI...


GPT-3 and concept extrapolation

The aim of this post is not to catch out GPT-3; it's to see what concept extrapolation could look like for a language model.

Attainable Utility Preservation: Scaling to Superhuman

To see this, imagine the AUP agent builds a subagent to make for all future , in order to neutralize the penalty term. This means it can't make the penalty vanish without destroying its ability to better optimize its primary reward, as the (potentially catastrophically) powerful subagent makes sure the penalty term stays neutralized.

I believe this is incorrect. The and are the actions of the AUP agent. The subagent just needs to cripple the AUP agent so that all actions are equivalent, then go about maximising to the upmost.

$100/$50 rewards for good references

Hey there! Sorry for the delay. $50 awarded to you for fastest good reference. PM me your bank details.

The Goldbach conjecture is probably correct; so was Fermat's last theorem

I'm not sure why you picked .

Because it's the first case I thought of where the probability numbers work out, and I just needed one example to round off the post :-)

Why I'm co-founding Aligned AI

It's worth you write up your point and post it - that tends to clarify the issue, for yourself as well as for others.

Why I'm co-founding Aligned AI

I've posted on the theoretical difficulties of aggregating the utilities of different agents. But doing it in practice is much more feasible (scale the utilities to some not-too-unreasonable scale, add them, maximise sum).

But value extrapolation is different from human value aggregation; for example, low power (or low impact) AIs can be defined with value extrapolation, and that doesn't need human value aggregation.

Why I'm co-founding Aligned AI

Yes, those are important to provide, and we will.

Why I'm co-founding Aligned AI

I do not put too much weight on that intuition, except as an avenue to investigate (how do humans do it, exactly? If it depends on the social environment, can the conditions of that be replicated?).

Why I'm co-founding Aligned AI

We're aiming to solve the problem in a way that is acceptable to one given human, and then generalise from that.

Load More