# All of Stuart_Armstrong's Comments + Replies

GPT-3 and concept extrapolation

The aim of this post is not to catch out GPT-3; it's to see what concept extrapolation could look like for a language model.

3Daniel Kokotajlo2mo
OK, cool. I think I was confused.
Attainable Utility Preservation: Scaling to Superhuman

To see this, imagine the AUP agent builds a subagent to make for all future , in order to neutralize the penalty term. This means it can't make the penalty vanish without destroying its ability to better optimize its primary reward, as the (potentially catastrophically) powerful subagent makes sure the penalty term stays neutralized.

I believe this is incorrect. The and are the actions of the AUP agent. The subagent just needs to cripple the AUP agent so that all actions are equivalent, then go about maximising to the upmost.

$100/$50 rewards for good references

Hey there! Sorry for the delay. \$50 awarded to you for fastest good reference. PM me your bank details.

The Goldbach conjecture is probably correct; so was Fermat's last theorem

I'm not sure why you picked .

Because it's the first case I thought of where the probability numbers work out, and I just needed one example to round off the post :-)

Why I'm co-founding Aligned AI

It's worth you write up your point and post it - that tends to clarify the issue, for yourself as well as for others.

Why I'm co-founding Aligned AI

I've posted on the theoretical difficulties of aggregating the utilities of different agents. But doing it in practice is much more feasible (scale the utilities to some not-too-unreasonable scale, add them, maximise sum).

But value extrapolation is different from human value aggregation; for example, low power (or low impact) AIs can be defined with value extrapolation, and that doesn't need human value aggregation.

3David Manheim4mo
I'm skeptical that many of the problems with aggregation don't both apply to actual individual human values once extrapolated, and generalize to AIs with closely related values, but I'd need to lay out the case for that more clearly. (I did discuss the difficulty of cooperation even given compatible goals a bit in this paper [https://deepai.org/publication/overoptimization-failures-and-specification-gaming-in-multi-agent-systems] , but it's nowhere near complete in addressing this issue.)
Why I'm co-founding Aligned AI

Yes, those are important to provide, and we will.

Why I'm co-founding Aligned AI

I do not put too much weight on that intuition, except as an avenue to investigate (how do humans do it, exactly? If it depends on the social environment, can the conditions of that be replicated?).

Why I'm co-founding Aligned AI

We're aiming to solve the problem in a way that is acceptable to one given human, and then generalise from that.

2David Manheim4mo
This seems fragile in ways that make me less optimistic about the approach overall. We have strong reasons to think that value aggregation is intractable, and (by analogy,) in some ways the problem of coherence in CEV is the tricky part. That is, the problem of making sure that we're not Dutch book-able is, IIRC, NP-complete, and even worse, the problem of aggregating preferences has several impossibility results. Edit: To clarify, I'm excited about the approach overall, and think it's likely to be valuable, but this part seems like a big problem.
Why I'm co-founding Aligned AI

CEV is based on extrapolating the person; the values are what the person would have had, had they been smarter, known more, had more self-control, etc... Once you have defined the idealised person, the values emerge as a consequence. I've criticised this idea in the past, mainly because the process to generate the idealised person seems vulnerable to negative attractors (Eliezer's most recent version of CEV has less of this problem).

Value extrapolation and model splintering are based on extrapolating features and concepts in models, to other models. This c... (read more)

Why I'm co-founding Aligned AI

UK based currently, Rebecca Gorman other co-founder.

Why I'm co-founding Aligned AI

Firstly, because the problem feels central to AI alignment, in the way that other approaches didn't. So making progress in this is making general AI alignment progress; there won't be such a "one error detected and all the work is useless" problem. Secondly, we've had success generating some key concepts, implying the problem is ripe for further progress.

How an alien theory of mind might be unlearnable

It's an interesting question as to whether aAlice is actually overconfident. Her predictions about human behaviour may be spot on, at this point - much better than human predictions about ourselves. So her confidence depends on whether she has the right kind of philosophical uncertainty.

Are there alternative to solving value transfer and extrapolation?

I actually don't think that Alice could help a (sufficiently alien) alien. She needs an alien theory of mind to understand what the alien wants, how they would extrapolate, how to help that extrapolation without manipulating it, and so on. Without that, she's just projecting human assumptions in alien behaviour and statements.

2Rohin Shah7mo
Absolutely, I would think that the first order of business would be to learn that alien theory of mind (and be very conservative until that's done). Maybe you're saying that this alien theory of mind is unlearnable, even for a very intelligent Alice? That seems pretty surprising, and I don't feel the force of that intuition (despite the Occam's razor impossibility result).
General alignment plus human values, or alignment via human values?

Yes, but we would be mostly indifferent to shifts in the distribution that preserve most of the features - eg if the weather was the same but delayed or advanced by six days.

Are there alternative to solving value transfer and extrapolation?

I have some draft posts explaining some of this stuff better, I can share them privately, or hang on another month or two. :)

I'd like to see them. I'll wait for the final (posted) versions, I think.

Research Agenda v0.9: Synthesising a human's preferences into a utility function

Because our preferences are inconsistent, and if an AI says "your true preferences are ", we're likely to react by saying "no! No machine will tell me what my preferences are. My true preferences are , which are different in subtle ways".

1Evan R. Murphy7mo
So the subtle manipulation is to compensate for those rebellious impulses making UHunstable? Why not just let the human have those moments and alter theirUHif that's what they think they want? Over time, then they may learn that being capricious with their AI doesn't ultimately serve them very well. But if they find out the AI is trying to manipulate them, that could make them want to rebel even more and have less trust for the AI.
General alignment plus human values, or alignment via human values?

Thanks for developing the argument. This is very useful.

The key point seems to be whether we can develop an AI that can successfully behave as a low impact AI - not as a "on balance, things are ok", but a genuinely low impact AI that ensure that we don't move towards a world where our preference might be ambiguous or underdefined.

But consider the following scenario: the AGI knows that, as a consequence of its actions, one AGI design will be deployed rather than another. Both of these designs will push the world into uncharted territory. How should it deal with that situation?

2Steve Byrnes8mo
Hmm, 1. I want the AI to have criteria that qualifies actions as acceptable, e.g. "it pattern-matches less than 1% to 'I'm causing destruction', and it pattern-matches less than 1% to 'the supervisor wouldn't like this', and it pattern-matches less than 1% to 'I'm changing my own motivation and control systems', and … etc. etc." 2. If no action is acceptable, I want NOOP to be hardcoded as an always-acceptable default—a.k.a. "being paralyzed by indecision" in the face of a situation where all the options seem problematic. And then we humans are responsible for not putting the AI in situations where fast decisions are necessary and inaction is dangerous, like running the electric grid or driving a car. (At some point we do want an AI that can run the electric grid and drive a car etc. But maybe we can bootstrap our way there, and/or use less-powerful narrow AIs in the meantime.) 3. A failure mode of (2) is that we could get an AI that is paralyzed by indecision always, and never does anything. To avoid this failure mode, we want the AI to be able to (and motivated to) gather evidence that might show that a course of action deemed problematic is in fact acceptable after all. This would probably involve asking questions to the human supervisor. 4. A failure mode of (3) is that the AI frames the questions in order to get an answer that it wants. To avoid this failure mode, we would set things up such that the AI's normal motivation system is not in charge of choosing what words to say when querying the human [https://www.lesswrong.com/posts/frApEhpyKQAcFvbXJ/reward-is-not-enough#3__Deceptive_AGIs] . For example, maybe the AI is not really "asking a question" at all, at least not in the normal sense; instead it's sending a data-dump to the human, and the human then inspects this data-dump with interpretability tools, and makes an edit to the AI's motivation parameter
General alignment plus human values, or alignment via human values?

The successor problem is important, but it assumes we have the values already.

I'm imagining algorithms designing successors with imperfect values (that they know to be imperfect). It's a somewhat different problem (though solving the classical successor problem is also important).

General alignment plus human values, or alignment via human values?

I agree there are superintelligent unconstrained AIs that can accomplish tasks (making a cup of tea) without destroying the world. But I feel it would have to have so much of human preferences already (to compute what is and what isn't an acceptable tradeoff in making you your cup of tea) that it may as well be fully aligned anyway - very little remains to define full alignment.

Ah, so you are arguing against (3)? (And what's your stance on (1)?)

Let's say you are assigned to be Alice's personal assistant.

• Suppose Alice says "Try to help me as much as you can, while being VERY sure to avoid actions that I would regard as catastrophically bad. When in doubt, just don't do anything at all, that's always OK with me." I feel like Alice is not asking too much of you here. You'll observe her a lot, and ask her a lot of questions especially early on, and sometimes you'll fail to be useful, because helping her would require choosing among o
Force neural nets to use models, then detect these

Thanks! Lots of useful thoughts here.

AI, learn to be conservative, then learn to be less so: reducing side-effects, learning preserved features, and going beyond conservatism

Those are very relevant to this project, thanks. I want to see how far we can push these approaches; maybe some people you know would like to take part?

3Rohin Shah9mo
Hmm, you might want to reach out to CHAI folks, though I don't have a specific person in mind at the moment. (I myself am working on different things now.)
Force neural nets to use models, then detect these

Vertigo, lust, pain reactions, some fear responses, and so on, don't involve a model. Some versions of "learning that it's cold outside" don't involve a model, just looking out and shivering; the model aspect comes in when you start reasoning about what to do about it. People often drive to work without consciously modelling anything on the way.

Think model-based learning versus Q-learning. Anything that's more Q-learning is not model based.

Force neural nets to use models, then detect these

I think the question of whether any particular plastic synapse is or is not part of the information content of the model will have a straightforward yes-or-no answer.

I don't think it has an easy yes or no answer (at least without some thought as to what constitutes a model within the mess of human reasoning) and I'm sure that even if it does, it's not straightforward.

since we probably won't have those kinds of real-time-brain-scanning technologies, right?

One hope would be that, by the time we have those technologies, we'd know what to look for.

1Steve Byrnes9mo
I was writing a kinda long reply but maybe I should first clarify: what do you mean by "model"? Can you give examples of ways that I could learn something (or otherwise change my synapses within a lifetime) that you wouldn't characterize as "changes to my mental model"? For example, which of the following would be "changes to my mental model"? 1. I learn that Brussels is the capital of Belgium 2. I learn that it's cold outside right now 3. I taste a new brand of soup and find that I really like it 4. I learn to ride a bicycle, including 1. maintaining balance via fast hard-to-describe responses where I shift my body in certain ways in response to different sensations and perceptions 2. being able to predict how the bicycle and me would move if I swung my arm around 5. I didn't sleep well so now I'm grumpy FWIW my inclination is to say that 1-4 are all "changes to my mental model". And 5 involves both changes to my mental model (knowing that I'm grumpy), and changes to the inputs to my mental model (I feel different "feelings" than I otherwise would—I think of those as inputs going into the model, just like visual inputs go into the model). Is there anything wrong / missing / suboptimal about that definition?
What does GPT-3 understand? Symbol grounding and Chinese rooms

I have only very limited access to GPT-3; it would be interesting if others played around with my instructions, making them easier for humans to follow, while still checking that GPT-3 failed.

Non-poisonous cake: anthropic updates are normal

More SIAish for conventional anthropic problems. Other theories are more applicable for more specific situations, specific questions, and for duplicate issues.

The reverse Goodhart problem

Cheers, these are useful classifications.

The reverse Goodhart problem

The idea that maximising the proxy will inevitably end up reducing the true utility seems a strong implicit part of Goodharting the way it's used in practice.

After all, if the deviation is upwards, Goodharting is far less of a problem. It's "suboptimal improvement" rather than "inevitable disaster".

3G Gordon Worley III1y
Ah, yeah, that's true, there's not much concern about getting too much of a good thing and that actually being good, which does seem like a reasonable category for anti-Goodharting. It's a bit hard to think when this would actually happen, though, since usually you have to give something up, even if it's just the opportunity to have done less. For example, maybe I'm trying to get a B on a test because that will let me pass the class and graduate, but I accidentally get an A. The A is actually better and I don't mind getting it, but then I'm potentially left with regret that I put in too much effort. Most examples I can think of that look like potential anti-Goodharting seem the same: I don't mind that I overshot the target, but I do mind that I wasn't as efficient as I could have been.
Introduction To The Infra-Bayesianism Sequence

I want a formalism capable of modelling and imitating how humans handle these situations, and we don't usually have dynamic consistency (nor do boundedly rational agents).

Now, I don't want to weaken requirements "just because", but it may be that dynamic consistency is too strong a requirement to properly model what's going on. It's also useful to have AIs model human changes of morality, to figure out what humans count as values, so getting closer to human reasoning would be necessary.

1Vanessa Kosoy1y
Boundedly rational agents definitely can have dynamic consistency, I guess it depends on just how bounded you want them to be. IIUC what you're looking for is a model that can formalize "approximately rational but doesn't necessary satisfy any crisp desideratum". In this case, I would use something like my quantitative AIT definition of intelligence [https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=Tg7A7rSYQSZPASm9s] .
Introduction To The Infra-Bayesianism Sequence

Hum... how about seeing enforcement of dynamic consistency as having a complexity/computation cost, and Dutch books (by other agents or by the environment) providing incentives to pay the cost? And hence the absence of these Dutch books meaning there is little incentive to pay that cost?

Introduction To The Infra-Bayesianism Sequence

Desideratum 1: There should be a sensible notion of what it means to update a set of environments or a set of distributions, which should also give us dynamic consistency.

I'm not sure how important dynamic consistency should be. When I talk about model splintering, I'm thinking of a bounded agent making fundamental changes to their model (though possibly gradually), a process that is essentially irreversible and contingent the circumstance of discovering new scenarios. The strongest arguments for dynamic consistency are the Dutch-book type arguments, wh... (read more)

1Vanessa Kosoy1y
I'm not sure why would we need a weaker requirement if the formalism already satisfies a stronger requirement? Certainly when designing concrete learning algorithms we might want to use some kind of simplified update rule, but I expect that to be contingent on the type of algorithm and design constraints. We do have some speculations in that vein, for example I suspect that, for communicating infra-MDPs, an update rule that forgets everything except the current state would only lose something like O(1−γ) expected utility.
1Diffractor1y
I don't know, we're hunting for it, relaxations of dynamic consistency would be extremely interesting if found, and I'll let you know if we turn up with anything nifty.
Human priors, features and models, languages, and Solmonoff induction

For real humans, I think this is a more gradual process - they learn and use some distinctions, and forget others, until their mental models are quite different a few years down the line.

The splintering can happen when a single feature splinters; it doesn't have to be dramatic.

Counterfactual control incentives

Thanks. I think we mainly agree here.

Preferences and biases, the information argument

Look at the paper linked for more details ( https://arxiv.org/abs/1712.05812 ).

Basically "humans are always fully rational and always take the action they want to" is a full explanation of all of human behaviour, that is strictly simpler than any explanation which includes human biases and bounded rationality.

Model splintering: moving from one imperfect model to another

But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation

I'm more thinking of how we could automate the navigating of these situations. The detection will be part of this process, and it's not a Boolean yes/no, but a matter of degree.

Model splintering: moving from one imperfect model to another

I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.

I'm most interested in mitigation options the agent can take itself, when it suspects it's out-of-distribution (and without being turned off, ideally).

1Koen Holtman1y
Model splintering: moving from one imperfect model to another

Thanks! Lots of useful insights in there.

So I might classify moving out-of-distribution as something that happens to a classifier or agent, and model splintering as something that the machine learning system does to itself.

Why do you think it's important to distinguish these two situations? It seems that the insights for dealing with one situation may apply to the other, and vice versa.

3Koen Holtman1y
The distinction is important if you want to design countermeasures that lower the probability that you land in the bad situation in the first place. For the first case, you might look at improving the agent's environment, or in making the agent detect when its environment moves off the training distribution. For the second case, you might look at adding features to the machine learning system itself. so that dangerous types of splintering become less likely. I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.
Generalised models as a category

Cheers! My opinion on category theory has changed a bit, because of this post; by making things fit into the category formulation, I developed insights into how general relations could be used to connect different generalised models.

3Koen Holtman1y
Definitely, it has also been my experience that you can often get new insights by constructing mappings to different models or notations.
Stuart_Armstrong's Shortform

Partial probability distribution

A concept that's useful for some of my research: a partial probability distribution.

That's a that defines for some but not all and (with for being the whole set of outcomes).

This is a partial probability distribution iff there exists a probability distribution that is equal to wherever is defined. Call this a full extension of .

Suppose that is not defined. We can, however, say that is a logical implication of if all full extension has .

Eg: , , w... (read more)

3Diffractor1y
Sounds like a special case of crisp infradistributions (ie, all partial probability distributions have a unique associated crisp infradistribution) Given someQ, we can consider the (nonempty) set of probability distributions equal toQwhereQis defined. This set is convex (clearly, a mixture of two probability distributions which agree withQabout the probability of an event will also agree withQabout the probability of an event). Convex (compact) sets of probability distributions = crisp infradistributions.
Introduction to Cartesian Frames

I like it. I'll think about how it fits with my ways of thinking (eg model splintering).

Counterfactual control incentives

Cheers; Rebecca likes the "instrumental control incentive" terminology; she claims it's more in line with control theory terminology.

We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent.

I think it's more dangerous than that. When there is mutual information, the agent can learn to behave as if it was specifically manipulating X; the counterfactual approach doesn't seem to do what it intended.

1Tom Everitt1y
Glad she likes the name :) True, I agree there may be some interesting subtleties lurking there. (Sorry btw for slow reply; I keep missing alignmentforum notifications.)
1Koen Holtman1y
On recent terminology innovation: For exactly the same reason, In my own recent paper Counterfactual Planning [https://arxiv.org/abs/2102.00834], I introduced the termsdirect incentive and indirect incentive, where I frame the removal of a path to value in a planning world diagram as an action that will eliminate a direct incentive, but that may leave other indirect incentives (via other paths to value) intact. In section 6 of the paper and in this post of the sequence [https://www.alignmentforum.org/posts/BZKLf629NDNfEkZzJ/creating-agi-safety-interlocks] I develop and apply this terminology in the case of an agent emergency stop button. In high-level descriptions of what the technique of creating indifference via path removal (or balancing terms) does, I have settled on using the terminology suppresses the incentive instead of removes the incentive. I must admit that I have not read many control theory papers, so any insights from Rebecca about standard terminology from control theory would be welcome. Do they have some standard phrasing where they can say things like 'no value to control' while subtly reminding the reader that 'this does not imply there will be no side effects?'