Stuart Armstrong


Anthropic Decision Theory
Subagents and impact measures
If I were a well-intentioned AI...


Non-poisonous cake: anthropic updates are normal

More SIAish for conventional anthropic problems. Other theories are more applicable for more specific situations, specific questions, and for duplicate issues.

The reverse Goodhart problem

Cheers, these are useful classifications.

The reverse Goodhart problem

The idea that maximising the proxy will inevitably end up reducing the true utility seems a strong implicit part of Goodharting the way it's used in practice.

After all, if the deviation is upwards, Goodharting is far less of a problem. It's "suboptimal improvement" rather than "inevitable disaster".

Introduction To The Infra-Bayesianism Sequence

I want a formalism capable of modelling and imitating how humans handle these situations, and we don't usually have dynamic consistency (nor do boundedly rational agents).

Now, I don't want to weaken requirements "just because", but it may be that dynamic consistency is too strong a requirement to properly model what's going on. It's also useful to have AIs model human changes of morality, to figure out what humans count as values, so getting closer to human reasoning would be necessary.

Introduction To The Infra-Bayesianism Sequence

Hum... how about seeing enforcement of dynamic consistency as having a complexity/computation cost, and Dutch books (by other agents or by the environment) providing incentives to pay the cost? And hence the absence of these Dutch books meaning there is little incentive to pay that cost?

Introduction To The Infra-Bayesianism Sequence

Desideratum 1: There should be a sensible notion of what it means to update a set of environments or a set of distributions, which should also give us dynamic consistency.

I'm not sure how important dynamic consistency should be. When I talk about model splintering, I'm thinking of a bounded agent making fundamental changes to their model (though possibly gradually), a process that is essentially irreversible and contingent the circumstance of discovering new scenarios. The strongest arguments for dynamic consistency are the Dutch-book type arguments, which depend on returning to a scenario very similar to the starting scenario, and these seem absent from model splintering as I'm imagining it.

Now, adding dynamic inconsistency is not useful, it just seems that removing all of it (especially for a bounded agent) doesn't seem worth the effort.

Is there some form of "not loose too much utility to dynamic inconsistency" requirement that could be formalised?

Human priors, features and models, languages, and Solmonoff induction

For real humans, I think this is a more gradual process - they learn and use some distinctions, and forget others, until their mental models are quite different a few years down the line.

The splintering can happen when a single feature splinters; it doesn't have to be dramatic.

Counterfactual control incentives

Thanks. I think we mainly agree here.

Preferences and biases, the information argument

Look at the paper linked for more details ( ).

Basically "humans are always fully rational and always take the action they want to" is a full explanation of all of human behaviour, that is strictly simpler than any explanation which includes human biases and bounded rationality.

Model splintering: moving from one imperfect model to another

But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation

I'm more thinking of how we could automate the navigating of these situations. The detection will be part of this process, and it's not a Boolean yes/no, but a matter of degree.

Load More