Stuart Armstrong

Sequences

Subagents and impact measures
If I were a well-intentioned AI...

Comments

Model splintering: moving from one imperfect model to another

I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.

I'm most interested in mitigation options the agent can take itself, when it suspects it's out-of-distribution (and without being turned off, ideally).

Model splintering: moving from one imperfect model to another

Thanks! Lots of useful insights in there.

So I might classify moving out-of-distribution as something that happens to a classifier or agent, and model splintering as something that the machine learning system does to itself.

Why do you think it's important to distinguish these two situations? It seems that the insights for dealing with one situation may apply to the other, and vice versa.

Generalised models as a category

Cheers! My opinion on category theory has changed a bit, because of this post; by making things fit into the category formulation, I developed insights into how general relations could be used to connect different generalised models.

Introduction to Cartesian Frames

I like it. I'll think about how it fits with my ways of thinking (eg model splintering).

Counterfactual control incentives

Cheers; Rebecca likes the "instrumental control incentive" terminology; she claims it's more in line with control theory terminology.

We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent.

I think it's more dangerous than that. When there is mutual information, the agent can learn to behave as if it was specifically manipulating X; the counterfactual approach doesn't seem to do what it intended.

AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

(I don't think I can explain why here, though I am working on a longer explanation of what framings I like and why.)

Cheers, that would be very useful.

AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

(I do think ontological shifts continue to be relevant to my description of the problem, but I've never been convinced that we should be particularly worried about ontological shifts, except inasmuch as they are one type of possible inner alignment / robustness failure.)

I feel that the whole AI alignment problem can be seen as problems with ontological shifts: https://www.lesswrong.com/posts/k54rgSg7GcjtXnMHX/model-splintering-moving-from-one-imperfect-model-to-another-1

Load More