All of Stuart_Armstrong's Comments + Replies

Non-poisonous cake: anthropic updates are normal

More SIAish for conventional anthropic problems. Other theories are more applicable for more specific situations, specific questions, and for duplicate issues.

The reverse Goodhart problem

Cheers, these are useful classifications.

The reverse Goodhart problem

The idea that maximising the proxy will inevitably end up reducing the true utility seems a strong implicit part of Goodharting the way it's used in practice.

After all, if the deviation is upwards, Goodharting is far less of a problem. It's "suboptimal improvement" rather than "inevitable disaster".

3G Gordon Worley III2moAh, yeah, that's true, there's not much concern about getting too much of a good thing and that actually being good, which does seem like a reasonable category for anti-Goodharting. It's a bit hard to think when this would actually happen, though, since usually you have to give something up, even if it's just the opportunity to have done less. For example, maybe I'm trying to get a B on a test because that will let me pass the class and graduate, but I accidentally get an A. The A is actually better and I don't mind getting it, but then I'm potentially left with regret that I put in too much effort. Most examples I can think of that look like potential anti-Goodharting seem the same: I don't mind that I overshot the target, but I do mind that I wasn't as efficient as I could have been.
Introduction To The Infra-Bayesianism Sequence

I want a formalism capable of modelling and imitating how humans handle these situations, and we don't usually have dynamic consistency (nor do boundedly rational agents).

Now, I don't want to weaken requirements "just because", but it may be that dynamic consistency is too strong a requirement to properly model what's going on. It's also useful to have AIs model human changes of morality, to figure out what humans count as values, so getting closer to human reasoning would be necessary.

1Vanessa Kosoy2moBoundedly rational agents definitely can have dynamic consistency, I guess it depends on just how bounded you want them to be. IIUC what you're looking for is a model that can formalize "approximately rational but doesn't necessary satisfy any crisp desideratum". In this case, I would use something like my quantitative AIT definition of intelligence [] .
Introduction To The Infra-Bayesianism Sequence

Hum... how about seeing enforcement of dynamic consistency as having a complexity/computation cost, and Dutch books (by other agents or by the environment) providing incentives to pay the cost? And hence the absence of these Dutch books meaning there is little incentive to pay that cost?

Introduction To The Infra-Bayesianism Sequence

Desideratum 1: There should be a sensible notion of what it means to update a set of environments or a set of distributions, which should also give us dynamic consistency.

I'm not sure how important dynamic consistency should be. When I talk about model splintering, I'm thinking of a bounded agent making fundamental changes to their model (though possibly gradually), a process that is essentially irreversible and contingent the circumstance of discovering new scenarios. The strongest arguments for dynamic consistency are the Dutch-book type arguments, wh... (read more)

1Vanessa Kosoy2moI'm not sure why would we need a weaker requirement if the formalism already satisfies a stronger requirement? Certainly when designing concrete learning algorithms we might want to use some kind of simplified update rule, but I expect that to be contingent on the type of algorithm and design constraints. We do have some speculations in that vein, for example I suspect that, for communicating infra-MDPs, an update rule that forgets everything except the current state would only lose something like O(1−γ) expected utility.
1Diffractor3moI don't know, we're hunting for it, relaxations of dynamic consistency would be extremely interesting if found, and I'll let you know if we turn up with anything nifty.
Human priors, features and models, languages, and Solmonoff induction

For real humans, I think this is a more gradual process - they learn and use some distinctions, and forget others, until their mental models are quite different a few years down the line.

The splintering can happen when a single feature splinters; it doesn't have to be dramatic.

Counterfactual control incentives

Thanks. I think we mainly agree here.

Preferences and biases, the information argument

Look at the paper linked for more details ( ).

Basically "humans are always fully rational and always take the action they want to" is a full explanation of all of human behaviour, that is strictly simpler than any explanation which includes human biases and bounded rationality.

Model splintering: moving from one imperfect model to another

But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation

I'm more thinking of how we could automate the navigating of these situations. The detection will be part of this process, and it's not a Boolean yes/no, but a matter of degree.

Model splintering: moving from one imperfect model to another

I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.

I'm most interested in mitigation options the agent can take itself, when it suspects it's out-of-distribution (and without being turned off, ideally).

1Koen Holtman5moOK. Reading the post originally, my impression was that you were trying to model ontological crisis problems that might happen by themselves inside the ML system when it learns of self-improves. This is a subcase that can be expressed in by your model, but after the Q&A in your SSC talk yesterday, my feeling is that your main point of interest and reason for optimisim with this work is different. It is in the problem of the agent handling ontological shifts that happen in human models of what their goals and values are. I might phrase this question as: If the humans start to splinter their idea of what a certain kind morality-related word they have been using for ages really means, how is the agent supposed to find out about this, and what should it do next to remain aligned? The ML literature is full of uncertainty metrics that might be used to measure such splits (this paper [] comes to mind as a memorable lava-based example). It is also full of proposals for mitigation like 'ask the supervisor' or 'slow down' or 'avoid going into that part of the state space'. The general feeling I have, which I think is also the feeling in the ML community, is that such uncertainty metrics are great for suppressing all kinds of failure scenarios. But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation (that the agent will see every unknown unknown coming before it can hurt you), you will be disappointed. So I'd like to ask you: what is your sense of optimism or pessimism in this area?
Model splintering: moving from one imperfect model to another

Thanks! Lots of useful insights in there.

So I might classify moving out-of-distribution as something that happens to a classifier or agent, and model splintering as something that the machine learning system does to itself.

Why do you think it's important to distinguish these two situations? It seems that the insights for dealing with one situation may apply to the other, and vice versa.

3Koen Holtman5moThe distinction is important if you want to design countermeasures that lower the probability that you land in the bad situation in the first place. For the first case, you might look at improving the agent's environment, or in making the agent detect when its environment moves off the training distribution. For the second case, you might look at adding features to the machine learning system itself. so that dangerous types of splintering become less likely. I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.
Generalised models as a category

Cheers! My opinion on category theory has changed a bit, because of this post; by making things fit into the category formulation, I developed insights into how general relations could be used to connect different generalised models.

3Koen Holtman5moDefinitely, it has also been my experience that you can often get new insights by constructing mappings to different models or notations.
Stuart_Armstrong's Shortform

Partial probability distribution

A concept that's useful for some of my research: a partial probability distribution.

That's a that defines for some but not all and (with for being the whole set of outcomes).

This is a partial probability distribution iff there exists a probability distribution that is equal to wherever is defined. Call this a full extension of .

Suppose that is not defined. We can, however, say that is a logical implication of if all full extension has .

Eg: , , w... (read more)

3Diffractor4moSounds like a special case of crisp infradistributions (ie, all partial probability distributions have a unique associated crisp infradistribution) Given someQ, we can consider the (nonempty) set of probability distributions equal toQwhereQis defined. This set is convex (clearly, a mixture of two probability distributions which agree withQabout the probability of an event will also agree withQabout the probability of an event). Convex (compact) sets of probability distributions = crisp infradistributions.
Introduction to Cartesian Frames

I like it. I'll think about how it fits with my ways of thinking (eg model splintering).

Counterfactual control incentives

Cheers; Rebecca likes the "instrumental control incentive" terminology; she claims it's more in line with control theory terminology.

We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent.

I think it's more dangerous than that. When there is mutual information, the agent can learn to behave as if it was specifically manipulating X; the counterfactual approach doesn't seem to do what it intended.

1Tom Everitt4moGlad she likes the name :) True, I agree there may be some interesting subtleties lurking there. (Sorry btw for slow reply; I keep missing alignmentforum notifications.)
1Koen Holtman5moOn recent terminology innovation: For exactly the same reason, In my own recent paper Counterfactual Planning [], I introduced the termsdirect incentive and indirect incentive, where I frame the removal of a path to value in a planning world diagram as an action that will eliminate a direct incentive, but that may leave other indirect incentives (via other paths to value) intact. In section 6 of the paper and in this post of the sequence [] I develop and apply this terminology in the case of an agent emergency stop button. In high-level descriptions of what the technique of creating indifference via path removal (or balancing terms) does, I have settled on using the terminology suppresses the incentive instead of removes the incentive. I must admit that I have not read many control theory papers, so any insights from Rebecca about standard terminology from control theory would be welcome. Do they have some standard phrasing where they can say things like 'no value to control' while subtly reminding the reader that 'this does not imply there will be no side effects?'
AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

(I don't think I can explain why here, though I am working on a longer explanation of what framings I like and why.)

Cheers, that would be very useful.

AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

(I do think ontological shifts continue to be relevant to my description of the problem, but I've never been convinced that we should be particularly worried about ontological shifts, except inasmuch as they are one type of possible inner alignment / robustness failure.)

I feel that the whole AI alignment problem can be seen as problems with ontological shifts:

4Rohin Shah7moI think I agree at least that many problems can be seen this way, but I suspect that other framings are more useful for solutions. (I don't think I can explain why here, though I am working on a longer explanation of what framings I like and why.) What I was claiming in the sentence you quoted was that I don't see ontological shifts as a huge additional category of problem that isn't covered by other problems, which is compatible with saying that ontological shifts can also represent many other problems.
Extortion beats brinksmanship, but the audience matters

The connection to AI alignment is combining the different utilities of different entities without extortion ruining the combination, and dealing with threats and acausal trade.

Extortion beats brinksmanship, but the audience matters

That's a misspelling that's entirely my fault, and has now been corrected.

Extortion beats brinksmanship, but the audience matters

(1) You say that releasing nude photos is in the blackmail category. But who's the audience?

The other people of whom you have nude photos, who are now incentivised to pay up rather than kick up a fuss.

(2) For n=1, m large: Is an example of brinkmanship here a monopolistic buyer who will only choose suppliers giving cutrate prices?

Interesting example that I hadn't really considered. I'd say its fits more under extortion than brinksmanship, though. A small supplier has to sell, or they won't stay in business. If there's a single buyer, "I won't buy ... (read more)

2romeostevensit8moReleasing one photo from a previously believed to be secure set of photos, where other photos in the same set are compromising can suffice for single member audience case.
Anthropomorphisation vs value learning: type 1 vs type 2 errors

A boundedly-rational agent is assumed to be mostly rational, failing to be fully rational because of a failure to figure things out in enough detail.

Humans are occasionally rational, often biased, often inconsistent, sometimes consciously act against their best interests, often follow heuristics without thinking, sometimes do think things through. This doesn't seem to correspond to what is normally understood as "boundedly-rational".

1Steve Byrnes10moGotcha, thanks. I have corrected my comment two above [] by striking out the words "boundedly-rational", but I think the point of that comment still stands.
Anthropomorphisation vs value learning: type 1 vs type 2 errors

that paper was about fitting observations of humans to a mathematical model of "boundedly-rational agent pursuing a utility function"

It was "any sort of agent pursuing a reward function".

2Steve Byrnes10moSorry for the stupid question, but what's the difference between "boundedly-rational agent pursuing a reward function" and "any sort of agent pursuing a reward function"?
Anthropomorphisation vs value learning: type 1 vs type 2 errors

We don't need a special module to get an everyday definition of doorknobs, and likewise I don't think we don't need a special module to get an everyday definition of human motivation.

I disagree. Doornobs exist in the world (even if the category is loosely defined, and has lots of edge cases), whereas goals/motivations are interpretations that we put upon agents. The main result of the Occam's razor paper is that there the goals of an agent are not something that you can know without putting your own interpretation on it - even if you know every physical... (read more)

1Steve Byrnes10moIt's your first day working at the factory, and you're assigned to shadow Alice as she monitors the machines on the line. She walks over to the Big Machine and says, "Looks like it's flooping again," whacks it, and then says "I think that fixed it". This happens a few times a day perpetually. Over time, you learn what flooping is, kinda. When the Big Machine is flooping, it usually (but not always) makes a certain noise, it usually (but not always) has a loose belt, and it usually (but not always) has a gear that shifted out of place. Now you know what it means for the Big Machine to be flooping, although there are lots of edge cases where neither you nor Alice has a good answer for whether or not it's truly flooping, vs sorta flooping, vs not flooping. By the same token, you could give some labeled examples of "wants to take a walk" to the aliens, and they can find what those examples have in common and develop a concept of "wants to take a walk", albeit with edge cases. Then you can also give labeled examples of "wants to braid their hair", "wants to be accepted", etc., and after enough cycles of this, they'll get the more general concept of "want", again with edge cases. I don't think I'm saying anything that goes against your Occam's razor paper. As I understood it (and you can correct me!!), that paper was about fitting observations of humans to a mathematical model of "boundedly-rational agent pursuing a utility function", and proved that there's no objectively best way to do it, where "objectively best" includes things like fidelity and simplicity. (My perspective on that is, "Well yeah, duh, humans are not boundedly-rational agents pursuing a utility function! The model doesn't fit! There's no objectively best way to hammer a square peg into a round hole! (ETA: the model doesn't fit except insofar as the model is tautologically applicable to anything [])") I don't see how the paper rules o
Learning human preferences: black-box, white-box, and structured white-box access

Humans have a theory of mind, that makes certain types of modularizations easier. That doesn't mean that the same modularization is simple for an agent that doesn't share that theory of mind.

Then again, it might be. This is worth digging into empirically. See my post on the optimistic and pessimistic scenarios; in the optimistic scenario, preferences, human theory of mind, and all the other elements, are easy to deduce (there's an informal equivalence result; if one of those is easy to deduce, all the others are).

So we need to figure out if we're in the optimistic or the pessimistic scenario.

Learning human preferences: black-box, white-box, and structured white-box access

My understanding of the OP was that there is a robot [...]

That understanding is correct.

Then my question was: what if none of the variables, functions, etc. corresponds to "preferences"? What if "preferences" is a way that we try to interpret the robot, but not a natural subsystem or abstraction or function or anything else that would be useful for the robot's programmer?

I agree that preferences is a way we try to interpret the robot (and how we humans try to interpret each other). The programmer themselves could label the variables; but its also po... (read more)

Learning human preferences: black-box, white-box, and structured white-box access

modularization is super helpful for simplifying things.

The best modularization for simplification will not likely correspond to the best modularization for distinguishing preferences from other parts of the agent's algorithm (that's the "Occam's razor" result).

1John Maxwell1yLet's say I'm trying to describe a hockey game. Modularizing the preferences from other aspects of the team algorithm makes it much easier to describe what happens at the start of the second period, when the two teams switch sides. The fact that humans find an abstraction useful is evidence that an AI will as well. The notion that agents have preferences helps us predict how people will change their plans for achieving their goals when they receive new information. Same for an AI.
Learning human preferences: black-box, white-box, and structured white-box access

but the function f is not part of the algorithm, it's only implemented by us onlookers. Right?

Then isn't that just a model at another level, a (labelled) model in the heads of the onlookers?

1G Gordon Worley III1yAny model is going to be in the head of some onlooker. This is the tough part about the white box approach: it's always an inference about what's "really" going on. Of course, this is true even of the boundaries of black boxes, so it's a fully general problem. And I think that suggests it's not a problem except insofar as we have normal problems setting up correspondence between map and territory.
1Sammy Martin1yGlad you think so! I think that methods like using multiple information sources might be a useful way to reduce the number of [] (potentially mistaken) normative assumptions you need in order to model a single human's preferences. The other area of human preference learning where you seem, inevitably, to need a lot of strong normative assumptions is in preference aggregation. If we assume we have elicited the preferences of lots of individual humans, and we're then trying to aggregate their preferences (with each human's preference represented by a separate model) I think the same basic principle applies, that you can reduce the normative assumptions you need by using a more complicated voting mechanism, in this case one that considers agents' ability to vote strategically as an opportunity to reach stable outcomes. I talk about this idea here [] . As with using approval/actions to improve the elicitation of an individual's preferences, you can't avoid making any normative assumptions by using a more complicated aggregation method, but perhaps you end up having to make fewer of them. Very speculatively, if you can combine a robust method of eliciting preferences with few inbuilt assumptions with a similarly robust method of aggregating preferences, you're on your way to a full solution to ambitious value learning [] .
The ground of optimization

Very good. A lot of potential there, I feel.

Dynamic inconsistency of the inaction and initial state baseline

Why do the absolute values cancel?

Because , so you can remove the absolute values.

Tradeoff between desirable properties for baseline choices in impact measures

I also think the pedestrian example illustrates why we need more semantic structure: "pedestrian alive" -> "pedestrian dead" is bad, but "pigeon on road" -> "pigeon in flight" is fine.

3Vika1yI don't think the pedestrian example shows a need for semantic structure. The example is intended to illustrate that an agent with the stepwise inaction baseline has no incentive to undo the delayed effect that it has set up. We want the baseline to incentivize the agent to undo any delayed effect, whether it involves hitting a pedestrian or making a pigeon fly. The pedestrian and pigeon effects differ in the magnitude of impact, so it is the job of the deviation measure to distinguish between them and penalize the pedestrian effect more. Optionality-based deviation measures (AU and RR) capture this distinction because hitting the pedestrian eliminates more options than making the pigeon fly.
Models, myths, dreams, and Cheshire cat grins

Thanks! Good insights there. Am reproducing the comment here for people less willing to click through:

I haven't read the literature on "how counterfactuals ought to work in ideal reasoners" and have no opinion there. But the part where you suggest an empirical description of counterfactual reasoning in humans, I think I basically agree with what you wrote.

I think the neocortex has a zoo of generative models, and a fast way of detecting when two are compatible, and if they are, snapping them together like Legos into a larger model.

For example, the m

... (read more)
Cortés, Pizarro, and Afonso as Precedents for Takeover

...which also means that they didn't have an empire to back them up?

1Daniel Kokotajlo1yYes. Distinguishing between not having an empire and not being willing to fight all-out, they suffered from the first problem, whereas (perhaps, we shall see) the other port cities suffered from the second.
Cortés, Pizarro, and Afonso as Precedents for Takeover

Thanks for your research, especially the Afonso stuff. One question for that: were these empires used to gaining/losing small pieces of territory? ie did they really dedicate all their might to getting these ports back, or did they eventually write them off as minor losses not worth the cost of fighting (given Portuguese naval advantages)?

1Daniel Kokotajlo1yGood question; I'll find out. Malacca at least was a city-state, so the Portuguese attack was an existential threat.
Cortés, Pizarro, and Afonso as Precedents for Takeover

Based on what I recall reading about Pizzaro's conquest, I feel you might be underestimating the importance of horses. It took centuries for European powers to figure out how to break a heavy cavalry charge with infantry; the amerindians didn't have the time to figure it out (see various battles where small cavalry forces routed thousands of troops). Once they had got more used to horses, later Inca forces (though much diminished) were more able to win open battles against the Spanish.

Maybe this was the problem for these empires: they were used to winning

... (read more)
1Daniel Kokotajlo1yMmm, interesting. I'm now reading a 1400-page history book on the subject (after all the attention my post got, I figured I should read more than just a bunch of wiki pages!) so we'll know one way or another soon enough. Thanks for the tip.
Reward functions and updating assumptions can hide a multitude of sins

My main note is that my comment was just about the concept of rigging a learning process given a fixed prior over rewards. I certainly agree that the general strategy of "update a distribution over reward functions" has lots of as-yet-unsolved problems.

Ah, ok, I see ^_^ Thanks for making me write this post, though, as it has useful things for other people to see, that I had been meaning to write up for some time.

On your main point: if the prior and updating process are over things that are truly beyond the AI's influence, then there will be no rigging (

... (read more)
Load More