All of Stuart_Armstrong's Comments + Replies

How an alien theory of mind might be unlearnable

It's an interesting question as to whether aAlice is actually overconfident. Her predictions about human behaviour may be spot on, at this point - much better than human predictions about ourselves. So her confidence depends on whether she has the right kind of philosophical uncertainty.

Are there alternative to solving value transfer and extrapolation?

I actually don't think that Alice could help a (sufficiently alien) alien. She needs an alien theory of mind to understand what the alien wants, how they would extrapolate, how to help that extrapolation without manipulating it, and so on. Without that, she's just projecting human assumptions in alien behaviour and statements.

2Rohin Shah2moAbsolutely, I would think that the first order of business would be to learn that alien theory of mind (and be very conservative until that's done). Maybe you're saying that this alien theory of mind is unlearnable, even for a very intelligent Alice? That seems pretty surprising, and I don't feel the force of that intuition (despite the Occam's razor impossibility result).
General alignment plus human values, or alignment via human values?

Yes, but we would be mostly indifferent to shifts in the distribution that preserve most of the features - eg if the weather was the same but delayed or advanced by six days.

Are there alternative to solving value transfer and extrapolation?

I have some draft posts explaining some of this stuff better, I can share them privately, or hang on another month or two. :)

I'd like to see them. I'll wait for the final (posted) versions, I think.

Research Agenda v0.9: Synthesising a human's preferences into a utility function

Because our preferences are inconsistent, and if an AI says "your true preferences are ", we're likely to react by saying "no! No machine will tell me what my preferences are. My true preferences are , which are different in subtle ways".

1Evan R. Murphy2moSo the subtle manipulation is to compensate for those rebellious impulses making UHunstable? Why not just let the human have those moments and alter theirUHif that's what they think they want? Over time, then they may learn that being capricious with their AI doesn't ultimately serve them very well. But if they find out the AI is trying to manipulate them, that could make them want to rebel even more and have less trust for the AI.
General alignment plus human values, or alignment via human values?

Thanks for developing the argument. This is very useful.

The key point seems to be whether we can develop an AI that can successfully behave as a low impact AI - not as a "on balance, things are ok", but a genuinely low impact AI that ensure that we don't move towards a world where our preference might be ambiguous or underdefined.

But consider the following scenario: the AGI knows that, as a consequence of its actions, one AGI design will be deployed rather than another. Both of these designs will push the world into uncharted territory. How should it deal with that situation?

2Steve Byrnes3moHmm, 1. I want the AI to have criteria that qualifies actions as acceptable, e.g. "it pattern-matches less than 1% to 'I'm causing destruction', and it pattern-matches less than 1% to 'the supervisor wouldn't like this', and it pattern-matches less than 1% to 'I'm changing my own motivation and control systems', and … etc. etc." 2. If no action is acceptable, I want NOOP to be hardcoded as an always-acceptable default—a.k.a. "being paralyzed by indecision" in the face of a situation where all the options seem problematic. And then we humans are responsible for not putting the AI in situations where fast decisions are necessary and inaction is dangerous, like running the electric grid or driving a car. (At some point we do want an AI that can run the electric grid and drive a car etc. But maybe we can bootstrap our way there, and/or use less-powerful narrow AIs in the meantime.) 3. A failure mode of (2) is that we could get an AI that is paralyzed by indecision always, and never does anything. To avoid this failure mode, we want the AI to be able to (and motivated to) gather evidence that might show that a course of action deemed problematic is in fact acceptable after all. This would probably involve asking questions to the human supervisor. 4. A failure mode of (3) is that the AI frames the questions in order to get an answer that it wants. To avoid this failure mode, we would set things up such that the AI's normal motivation system is not in charge of choosing what words to say when querying the human [https://www.lesswrong.com/posts/frApEhpyKQAcFvbXJ/reward-is-not-enough#3__Deceptive_AGIs] . For example, maybe the AI is not really "asking a question" at all, at least not in the normal sense; instead it's sending a data-dump to the human, and the human then inspects this data-dump with interpretability tools, and makes an edit to the AI's motivation parameter
General alignment plus human values, or alignment via human values?

The successor problem is important, but it assumes we have the values already.

I'm imagining algorithms designing successors with imperfect values (that they know to be imperfect). It's a somewhat different problem (though solving the classical successor problem is also important).

General alignment plus human values, or alignment via human values?

I agree there are superintelligent unconstrained AIs that can accomplish tasks (making a cup of tea) without destroying the world. But I feel it would have to have so much of human preferences already (to compute what is and what isn't an acceptable tradeoff in making you your cup of tea) that it may as well be fully aligned anyway - very little remains to define full alignment.

Ah, so you are arguing against (3)? (And what's your stance on (1)?)

Let's say you are assigned to be Alice's personal assistant.

  • Suppose Alice says "Try to help me as much as you can, while being VERY sure to avoid actions that I would regard as catastrophically bad. When in doubt, just don't do anything at all, that's always OK with me." I feel like Alice is not asking too much of you here. You'll observe her a lot, and ask her a lot of questions especially early on, and sometimes you'll fail to be useful, because helping her would require choosing among o
... (read more)
AI, learn to be conservative, then learn to be less so: reducing side-effects, learning preserved features, and going beyond conservatism

Those are very relevant to this project, thanks. I want to see how far we can push these approaches; maybe some people you know would like to take part?

3Rohin Shah4moHmm, you might want to reach out to CHAI folks, though I don't have a specific person in mind at the moment. (I myself am working on different things now.)
Force neural nets to use models, then detect these

Vertigo, lust, pain reactions, some fear responses, and so on, don't involve a model. Some versions of "learning that it's cold outside" don't involve a model, just looking out and shivering; the model aspect comes in when you start reasoning about what to do about it. People often drive to work without consciously modelling anything on the way.

Think model-based learning versus Q-learning. Anything that's more Q-learning is not model based.

Force neural nets to use models, then detect these

I think the question of whether any particular plastic synapse is or is not part of the information content of the model will have a straightforward yes-or-no answer.

I don't think it has an easy yes or no answer (at least without some thought as to what constitutes a model within the mess of human reasoning) and I'm sure that even if it does, it's not straightforward.

since we probably won't have those kinds of real-time-brain-scanning technologies, right?

One hope would be that, by the time we have those technologies, we'd know what to look for.

1Steve Byrnes4moI was writing a kinda long reply but maybe I should first clarify: what do you mean by "model"? Can you give examples of ways that I could learn something (or otherwise change my synapses within a lifetime) that you wouldn't characterize as "changes to my mental model"? For example, which of the following would be "changes to my mental model"? 1. I learn that Brussels is the capital of Belgium 2. I learn that it's cold outside right now 3. I taste a new brand of soup and find that I really like it 4. I learn to ride a bicycle, including 1. maintaining balance via fast hard-to-describe responses where I shift my body in certain ways in response to different sensations and perceptions 2. being able to predict how the bicycle and me would move if I swung my arm around 5. I didn't sleep well so now I'm grumpy FWIW my inclination is to say that 1-4 are all "changes to my mental model". And 5 involves both changes to my mental model (knowing that I'm grumpy), and changes to the inputs to my mental model (I feel different "feelings" than I otherwise would—I think of those as inputs going into the model, just like visual inputs go into the model). Is there anything wrong / missing / suboptimal about that definition?
What does GPT-3 understand? Symbol grounding and Chinese rooms

I have only very limited access to GPT-3; it would be interesting if others played around with my instructions, making them easier for humans to follow, while still checking that GPT-3 failed.

Non-poisonous cake: anthropic updates are normal

More SIAish for conventional anthropic problems. Other theories are more applicable for more specific situations, specific questions, and for duplicate issues.

The reverse Goodhart problem

Cheers, these are useful classifications.

The reverse Goodhart problem

The idea that maximising the proxy will inevitably end up reducing the true utility seems a strong implicit part of Goodharting the way it's used in practice.

After all, if the deviation is upwards, Goodharting is far less of a problem. It's "suboptimal improvement" rather than "inevitable disaster".

3G Gordon Worley III8moAh, yeah, that's true, there's not much concern about getting too much of a good thing and that actually being good, which does seem like a reasonable category for anti-Goodharting. It's a bit hard to think when this would actually happen, though, since usually you have to give something up, even if it's just the opportunity to have done less. For example, maybe I'm trying to get a B on a test because that will let me pass the class and graduate, but I accidentally get an A. The A is actually better and I don't mind getting it, but then I'm potentially left with regret that I put in too much effort. Most examples I can think of that look like potential anti-Goodharting seem the same: I don't mind that I overshot the target, but I do mind that I wasn't as efficient as I could have been.
Introduction To The Infra-Bayesianism Sequence

I want a formalism capable of modelling and imitating how humans handle these situations, and we don't usually have dynamic consistency (nor do boundedly rational agents).

Now, I don't want to weaken requirements "just because", but it may be that dynamic consistency is too strong a requirement to properly model what's going on. It's also useful to have AIs model human changes of morality, to figure out what humans count as values, so getting closer to human reasoning would be necessary.

1Vanessa Kosoy8moBoundedly rational agents definitely can have dynamic consistency, I guess it depends on just how bounded you want them to be. IIUC what you're looking for is a model that can formalize "approximately rational but doesn't necessary satisfy any crisp desideratum". In this case, I would use something like my quantitative AIT definition of intelligence [https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=Tg7A7rSYQSZPASm9s] .
Introduction To The Infra-Bayesianism Sequence

Hum... how about seeing enforcement of dynamic consistency as having a complexity/computation cost, and Dutch books (by other agents or by the environment) providing incentives to pay the cost? And hence the absence of these Dutch books meaning there is little incentive to pay that cost?

Introduction To The Infra-Bayesianism Sequence

Desideratum 1: There should be a sensible notion of what it means to update a set of environments or a set of distributions, which should also give us dynamic consistency.

I'm not sure how important dynamic consistency should be. When I talk about model splintering, I'm thinking of a bounded agent making fundamental changes to their model (though possibly gradually), a process that is essentially irreversible and contingent the circumstance of discovering new scenarios. The strongest arguments for dynamic consistency are the Dutch-book type arguments, wh... (read more)

1Vanessa Kosoy8moI'm not sure why would we need a weaker requirement if the formalism already satisfies a stronger requirement? Certainly when designing concrete learning algorithms we might want to use some kind of simplified update rule, but I expect that to be contingent on the type of algorithm and design constraints. We do have some speculations in that vein, for example I suspect that, for communicating infra-MDPs, an update rule that forgets everything except the current state would only lose something like O(1−γ) expected utility.
1Diffractor9moI don't know, we're hunting for it, relaxations of dynamic consistency would be extremely interesting if found, and I'll let you know if we turn up with anything nifty.
Human priors, features and models, languages, and Solmonoff induction

For real humans, I think this is a more gradual process - they learn and use some distinctions, and forget others, until their mental models are quite different a few years down the line.

The splintering can happen when a single feature splinters; it doesn't have to be dramatic.

Counterfactual control incentives

Thanks. I think we mainly agree here.

Preferences and biases, the information argument

Look at the paper linked for more details ( https://arxiv.org/abs/1712.05812 ).

Basically "humans are always fully rational and always take the action they want to" is a full explanation of all of human behaviour, that is strictly simpler than any explanation which includes human biases and bounded rationality.

Model splintering: moving from one imperfect model to another

But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation

I'm more thinking of how we could automate the navigating of these situations. The detection will be part of this process, and it's not a Boolean yes/no, but a matter of degree.

Model splintering: moving from one imperfect model to another

I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.

I'm most interested in mitigation options the agent can take itself, when it suspects it's out-of-distribution (and without being turned off, ideally).

1Koen Holtman1yOK. Reading the post originally, my impression was that you were trying to model ontological crisis problems that might happen by themselves inside the ML system when it learns of self-improves. This is a subcase that can be expressed in by your model, but after the Q&A in your SSC talk yesterday, my feeling is that your main point of interest and reason for optimisim with this work is different. It is in the problem of the agent handling ontological shifts that happen in human models of what their goals and values are. I might phrase this question as: If the humans start to splinter their idea of what a certain kind morality-related word they have been using for ages really means, how is the agent supposed to find out about this, and what should it do next to remain aligned? The ML literature is full of uncertainty metrics that might be used to measure such splits (this paper [https://papers.nips.cc/paper/7253-inverse-reward-design.pdf] comes to mind as a memorable lava-based example). It is also full of proposals for mitigation like 'ask the supervisor' or 'slow down' or 'avoid going into that part of the state space'. The general feeling I have, which I think is also the feeling in the ML community, is that such uncertainty metrics are great for suppressing all kinds of failure scenarios. But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation (that the agent will see every unknown unknown coming before it can hurt you), you will be disappointed. So I'd like to ask you: what is your sense of optimism or pessimism in this area?
Model splintering: moving from one imperfect model to another

Thanks! Lots of useful insights in there.

So I might classify moving out-of-distribution as something that happens to a classifier or agent, and model splintering as something that the machine learning system does to itself.

Why do you think it's important to distinguish these two situations? It seems that the insights for dealing with one situation may apply to the other, and vice versa.

3Koen Holtman1yThe distinction is important if you want to design countermeasures that lower the probability that you land in the bad situation in the first place. For the first case, you might look at improving the agent's environment, or in making the agent detect when its environment moves off the training distribution. For the second case, you might look at adding features to the machine learning system itself. so that dangerous types of splintering become less likely. I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.
Generalised models as a category

Cheers! My opinion on category theory has changed a bit, because of this post; by making things fit into the category formulation, I developed insights into how general relations could be used to connect different generalised models.

3Koen Holtman1yDefinitely, it has also been my experience that you can often get new insights by constructing mappings to different models or notations.
Stuart_Armstrong's Shortform

Partial probability distribution

A concept that's useful for some of my research: a partial probability distribution.

That's a that defines for some but not all and (with for being the whole set of outcomes).

This is a partial probability distribution iff there exists a probability distribution that is equal to wherever is defined. Call this a full extension of .

Suppose that is not defined. We can, however, say that is a logical implication of if all full extension has .

Eg: , , w... (read more)

3Diffractor10moSounds like a special case of crisp infradistributions (ie, all partial probability distributions have a unique associated crisp infradistribution) Given someQ, we can consider the (nonempty) set of probability distributions equal toQwhereQis defined. This set is convex (clearly, a mixture of two probability distributions which agree withQabout the probability of an event will also agree withQabout the probability of an event). Convex (compact) sets of probability distributions = crisp infradistributions.
Introduction to Cartesian Frames

I like it. I'll think about how it fits with my ways of thinking (eg model splintering).

Counterfactual control incentives

Cheers; Rebecca likes the "instrumental control incentive" terminology; she claims it's more in line with control theory terminology.

We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent.

I think it's more dangerous than that. When there is mutual information, the agent can learn to behave as if it was specifically manipulating X; the counterfactual approach doesn't seem to do what it intended.

1Tom Everitt10moGlad she likes the name :) True, I agree there may be some interesting subtleties lurking there. (Sorry btw for slow reply; I keep missing alignmentforum notifications.)
1Koen Holtman1yOn recent terminology innovation: For exactly the same reason, In my own recent paper Counterfactual Planning [https://arxiv.org/abs/2102.00834], I introduced the termsdirect incentive and indirect incentive, where I frame the removal of a path to value in a planning world diagram as an action that will eliminate a direct incentive, but that may leave other indirect incentives (via other paths to value) intact. In section 6 of the paper and in this post of the sequence [https://www.alignmentforum.org/posts/BZKLf629NDNfEkZzJ/creating-agi-safety-interlocks] I develop and apply this terminology in the case of an agent emergency stop button. In high-level descriptions of what the technique of creating indifference via path removal (or balancing terms) does, I have settled on using the terminology suppresses the incentive instead of removes the incentive. I must admit that I have not read many control theory papers, so any insights from Rebecca about standard terminology from control theory would be welcome. Do they have some standard phrasing where they can say things like 'no value to control' while subtly reminding the reader that 'this does not imply there will be no side effects?'
AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

(I don't think I can explain why here, though I am working on a longer explanation of what framings I like and why.)

Cheers, that would be very useful.

AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

(I do think ontological shifts continue to be relevant to my description of the problem, but I've never been convinced that we should be particularly worried about ontological shifts, except inasmuch as they are one type of possible inner alignment / robustness failure.)

I feel that the whole AI alignment problem can be seen as problems with ontological shifts: https://www.lesswrong.com/posts/k54rgSg7GcjtXnMHX/model-splintering-moving-from-one-imperfect-model-to-another-1

4Rohin Shah1yI think I agree at least that many problems can be seen this way, but I suspect that other framings are more useful for solutions. (I don't think I can explain why here, though I am working on a longer explanation of what framings I like and why.) What I was claiming in the sentence you quoted was that I don't see ontological shifts as a huge additional category of problem that isn't covered by other problems, which is compatible with saying that ontological shifts can also represent many other problems.
Extortion beats brinksmanship, but the audience matters

The connection to AI alignment is combining the different utilities of different entities without extortion ruining the combination, and dealing with threats and acausal trade.

Extortion beats brinksmanship, but the audience matters

That's a misspelling that's entirely my fault, and has now been corrected.

Extortion beats brinksmanship, but the audience matters

(1) You say that releasing nude photos is in the blackmail category. But who's the audience?

The other people of whom you have nude photos, who are now incentivised to pay up rather than kick up a fuss.

(2) For n=1, m large: Is an example of brinkmanship here a monopolistic buyer who will only choose suppliers giving cutrate prices?

Interesting example that I hadn't really considered. I'd say its fits more under extortion than brinksmanship, though. A small supplier has to sell, or they won't stay in business. If there's a single buyer, "I won't buy ... (read more)

2romeostevensit1yReleasing one photo from a previously believed to be secure set of photos, where other photos in the same set are compromising can suffice for single member audience case.
Anthropomorphisation vs value learning: type 1 vs type 2 errors

A boundedly-rational agent is assumed to be mostly rational, failing to be fully rational because of a failure to figure things out in enough detail.

Humans are occasionally rational, often biased, often inconsistent, sometimes consciously act against their best interests, often follow heuristics without thinking, sometimes do think things through. This doesn't seem to correspond to what is normally understood as "boundedly-rational".

1Steve Byrnes1yGotcha, thanks. I have corrected my comment two above [https://www.lesswrong.com/posts/LkytHQSKbQFf6toW5/anthropomorphisation-vs-value-learning-type-1-vs-type-2?commentId=aN6BRpDtqC6pHg7eG#2jhXuQhx7d2qPBKx5] by striking out the words "boundedly-rational", but I think the point of that comment still stands.
Anthropomorphisation vs value learning: type 1 vs type 2 errors

that paper was about fitting observations of humans to a mathematical model of "boundedly-rational agent pursuing a utility function"

It was "any sort of agent pursuing a reward function".

2Steve Byrnes1ySorry for the stupid question, but what's the difference between "boundedly-rational agent pursuing a reward function" and "any sort of agent pursuing a reward function"?
Anthropomorphisation vs value learning: type 1 vs type 2 errors

We don't need a special module to get an everyday definition of doorknobs, and likewise I don't think we don't need a special module to get an everyday definition of human motivation.

I disagree. Doornobs exist in the world (even if the category is loosely defined, and has lots of edge cases), whereas goals/motivations are interpretations that we put upon agents. The main result of the Occam's razor paper is that there the goals of an agent are not something that you can know without putting your own interpretation on it - even if you know every physical... (read more)

1Steve Byrnes1yIt's your first day working at the factory, and you're assigned to shadow Alice as she monitors the machines on the line. She walks over to the Big Machine and says, "Looks like it's flooping again," whacks it, and then says "I think that fixed it". This happens a few times a day perpetually. Over time, you learn what flooping is, kinda. When the Big Machine is flooping, it usually (but not always) makes a certain noise, it usually (but not always) has a loose belt, and it usually (but not always) has a gear that shifted out of place. Now you know what it means for the Big Machine to be flooping, although there are lots of edge cases where neither you nor Alice has a good answer for whether or not it's truly flooping, vs sorta flooping, vs not flooping. By the same token, you could give some labeled examples of "wants to take a walk" to the aliens, and they can find what those examples have in common and develop a concept of "wants to take a walk", albeit with edge cases. Then you can also give labeled examples of "wants to braid their hair", "wants to be accepted", etc., and after enough cycles of this, they'll get the more general concept of "want", again with edge cases. I don't think I'm saying anything that goes against your Occam's razor paper. As I understood it (and you can correct me!!), that paper was about fitting observations of humans to a mathematical model of "boundedly-rational agent pursuing a utility function", and proved that there's no objectively best way to do it, where "objectively best" includes things like fidelity and simplicity. (My perspective on that is, "Well yeah, duh, humans are not boundedly-rational agents pursuing a utility function! The model doesn't fit! There's no objectively best way to hammer a square peg into a round hole! (ETA: the model doesn't fit except insofar as the model is tautologically applicable to anything [https://www.lesswrong.com/s/4dHMdK5TLN6xcqtyc/p/NxF5G6CJiof6cemTw])") I don't see how the paper rules o
Learning human preferences: black-box, white-box, and structured white-box access

Humans have a theory of mind, that makes certain types of modularizations easier. That doesn't mean that the same modularization is simple for an agent that doesn't share that theory of mind.

Then again, it might be. This is worth digging into empirically. See my post on the optimistic and pessimistic scenarios; in the optimistic scenario, preferences, human theory of mind, and all the other elements, are easy to deduce (there's an informal equivalence result; if one of those is easy to deduce, all the others are).

So we need to figure out if we're in the optimistic or the pessimistic scenario.

Load More