Stuart Armstrong


Anthropic Decision Theory
Subagents and impact measures
If I were a well-intentioned AI...


Counterfactual control incentives

Thanks. I think we mainly agree here.

Preferences and biases, the information argument

Look at the paper linked for more details ( ).

Basically "humans are always fully rational and always take the action they want to" is a full explanation of all of human behaviour, that is strictly simpler than any explanation which includes human biases and bounded rationality.

Model splintering: moving from one imperfect model to another

But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation

I'm more thinking of how we could automate the navigating of these situations. The detection will be part of this process, and it's not a Boolean yes/no, but a matter of degree.

Model splintering: moving from one imperfect model to another

I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.

I'm most interested in mitigation options the agent can take itself, when it suspects it's out-of-distribution (and without being turned off, ideally).

Model splintering: moving from one imperfect model to another

Thanks! Lots of useful insights in there.

So I might classify moving out-of-distribution as something that happens to a classifier or agent, and model splintering as something that the machine learning system does to itself.

Why do you think it's important to distinguish these two situations? It seems that the insights for dealing with one situation may apply to the other, and vice versa.

Generalised models as a category

Cheers! My opinion on category theory has changed a bit, because of this post; by making things fit into the category formulation, I developed insights into how general relations could be used to connect different generalised models.

Stuart_Armstrong's Shortform

Partial probability distribution

A concept that's useful for some of my research: a partial probability distribution.

That's a that defines for some but not all and (with for being the whole set of outcomes).

This is a partial probability distribution iff there exists a probability distribution that is equal to wherever is defined. Call this a full extension of .

Suppose that is not defined. We can, however, say that is a logical implication of if all full extension has .

Eg: , , will logically imply the value of .

Introduction to Cartesian Frames

I like it. I'll think about how it fits with my ways of thinking (eg model splintering).

Counterfactual control incentives

Cheers; Rebecca likes the "instrumental control incentive" terminology; she claims it's more in line with control theory terminology.

We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent.

I think it's more dangerous than that. When there is mutual information, the agent can learn to behave as if it was specifically manipulating X; the counterfactual approach doesn't seem to do what it intended.

Load More