All of Stuart_Armstrong's Comments + Replies

Counterfactual control incentives

Thanks. I think we mainly agree here.

Preferences and biases, the information argument

Look at the paper linked for more details ( ).

Basically "humans are always fully rational and always take the action they want to" is a full explanation of all of human behaviour, that is strictly simpler than any explanation which includes human biases and bounded rationality.

Model splintering: moving from one imperfect model to another

But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation

I'm more thinking of how we could automate the navigating of these situations. The detection will be part of this process, and it's not a Boolean yes/no, but a matter of degree.

Model splintering: moving from one imperfect model to another

I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.

I'm most interested in mitigation options the agent can take itself, when it suspects it's out-of-distribution (and without being turned off, ideally).

1Koen Holtman2moOK. Reading the post originally, my impression was that you were trying to model ontological crisis problems that might happen by themselves inside the ML system when it learns of self-improves. This is a subcase that can be expressed in by your model, but after the Q&A in your SSC talk yesterday, my feeling is that your main point of interest and reason for optimisim with this work is different. It is in the problem of the agent handling ontological shifts that happen in human models of what their goals and values are. I might phrase this question as: If the humans start to splinter their idea of what a certain kind morality-related word they have been using for ages really means, how is the agent supposed to find out about this, and what should it do next to remain aligned? The ML literature is full of uncertainty metrics that might be used to measure such splits (this paper [] comes to mind as a memorable lava-based example). It is also full of proposals for mitigation like 'ask the supervisor' or 'slow down' or 'avoid going into that part of the state space'. The general feeling I have, which I think is also the feeling in the ML community, is that such uncertainty metrics are great for suppressing all kinds of failure scenarios. But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation (that the agent will see every unknown unknown coming before it can hurt you), you will be disappointed. So I'd like to ask you: what is your sense of optimism or pessimism in this area?
Model splintering: moving from one imperfect model to another

Thanks! Lots of useful insights in there.

So I might classify moving out-of-distribution as something that happens to a classifier or agent, and model splintering as something that the machine learning system does to itself.

Why do you think it's important to distinguish these two situations? It seems that the insights for dealing with one situation may apply to the other, and vice versa.

3Koen Holtman2moThe distinction is important if you want to design countermeasures that lower the probability that you land in the bad situation in the first place. For the first case, you might look at improving the agent's environment, or in making the agent detect when its environment moves off the training distribution. For the second case, you might look at adding features to the machine learning system itself. so that dangerous types of splintering become less likely. I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.
Generalised models as a category

Cheers! My opinion on category theory has changed a bit, because of this post; by making things fit into the category formulation, I developed insights into how general relations could be used to connect different generalised models.

3Koen Holtman2moDefinitely, it has also been my experience that you can often get new insights by constructing mappings to different models or notations.
Stuart_Armstrong's Shortform

Partial probability distribution

A concept that's useful for some of my research: a partial probability distribution.

That's a that defines for some but not all and (with for being the whole set of outcomes).

This is a partial probability distribution iff there exists a probability distribution that is equal to wherever is defined. Call this a full extension of .

Suppose that is not defined. We can, however, say that is a logical implication of if all full extension has .

Eg: , , w... (read more)

3Diffractor18dSounds like a special case of crisp infradistributions (ie, all partial probability distributions have a unique associated crisp infradistribution) Given someQ, we can consider the (nonempty) set of probability distributions equal toQwhereQis defined. This set is convex (clearly, a mixture of two probability distributions which agree withQabout the probability of an event will also agree withQabout the probability of an event). Convex (compact) sets of probability distributions = crisp infradistributions.
Introduction to Cartesian Frames

I like it. I'll think about how it fits with my ways of thinking (eg model splintering).

Counterfactual control incentives

Cheers; Rebecca likes the "instrumental control incentive" terminology; she claims it's more in line with control theory terminology.

We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent.

I think it's more dangerous than that. When there is mutual information, the agent can learn to behave as if it was specifically manipulating X; the counterfactual approach doesn't seem to do what it intended.

1Tom Everitt1moGlad she likes the name :) True, I agree there may be some interesting subtleties lurking there. (Sorry btw for slow reply; I keep missing alignmentforum notifications.)
1Koen Holtman2moOn recent terminology innovation: For exactly the same reason, In my own recent paper Counterfactual Planning [], I introduced the termsdirect incentive and indirect incentive, where I frame the removal of a path to value in a planning world diagram as an action that will eliminate a direct incentive, but that may leave other indirect incentives (via other paths to value) intact. In section 6 of the paper and in this post of the sequence [] I develop and apply this terminology in the case of an agent emergency stop button. In high-level descriptions of what the technique of creating indifference via path removal (or balancing terms) does, I have settled on using the terminology suppresses the incentive instead of removes the incentive. I must admit that I have not read many control theory papers, so any insights from Rebecca about standard terminology from control theory would be welcome. Do they have some standard phrasing where they can say things like 'no value to control' while subtly reminding the reader that 'this does not imply there will be no side effects?'
AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

(I don't think I can explain why here, though I am working on a longer explanation of what framings I like and why.)

Cheers, that would be very useful.

AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

(I do think ontological shifts continue to be relevant to my description of the problem, but I've never been convinced that we should be particularly worried about ontological shifts, except inasmuch as they are one type of possible inner alignment / robustness failure.)

I feel that the whole AI alignment problem can be seen as problems with ontological shifts:

4Rohin Shah3moI think I agree at least that many problems can be seen this way, but I suspect that other framings are more useful for solutions. (I don't think I can explain why here, though I am working on a longer explanation of what framings I like and why.) What I was claiming in the sentence you quoted was that I don't see ontological shifts as a huge additional category of problem that isn't covered by other problems, which is compatible with saying that ontological shifts can also represent many other problems.
Extortion beats brinksmanship, but the audience matters

The connection to AI alignment is combining the different utilities of different entities without extortion ruining the combination, and dealing with threats and acausal trade.

Extortion beats brinksmanship, but the audience matters

That's a misspelling that's entirely my fault, and has now been corrected.

Extortion beats brinksmanship, but the audience matters

(1) You say that releasing nude photos is in the blackmail category. But who's the audience?

The other people of whom you have nude photos, who are now incentivised to pay up rather than kick up a fuss.

(2) For n=1, m large: Is an example of brinkmanship here a monopolistic buyer who will only choose suppliers giving cutrate prices?

Interesting example that I hadn't really considered. I'd say its fits more under extortion than brinksmanship, though. A small supplier has to sell, or they won't stay in business. If there's a single buyer, "I won't buy ... (read more)

2romeostevensit5moReleasing one photo from a previously believed to be secure set of photos, where other photos in the same set are compromising can suffice for single member audience case.
Anthropomorphisation vs value learning: type 1 vs type 2 errors

A boundedly-rational agent is assumed to be mostly rational, failing to be fully rational because of a failure to figure things out in enough detail.

Humans are occasionally rational, often biased, often inconsistent, sometimes consciously act against their best interests, often follow heuristics without thinking, sometimes do think things through. This doesn't seem to correspond to what is normally understood as "boundedly-rational".

1Steve Byrnes7moGotcha, thanks. I have corrected my comment two above [] by striking out the words "boundedly-rational", but I think the point of that comment still stands.
Anthropomorphisation vs value learning: type 1 vs type 2 errors

that paper was about fitting observations of humans to a mathematical model of "boundedly-rational agent pursuing a utility function"

It was "any sort of agent pursuing a reward function".

2Steve Byrnes7moSorry for the stupid question, but what's the difference between "boundedly-rational agent pursuing a reward function" and "any sort of agent pursuing a reward function"?
Anthropomorphisation vs value learning: type 1 vs type 2 errors

We don't need a special module to get an everyday definition of doorknobs, and likewise I don't think we don't need a special module to get an everyday definition of human motivation.

I disagree. Doornobs exist in the world (even if the category is loosely defined, and has lots of edge cases), whereas goals/motivations are interpretations that we put upon agents. The main result of the Occam's razor paper is that there the goals of an agent are not something that you can know without putting your own interpretation on it - even if you know every physical... (read more)

1Steve Byrnes7moIt's your first day working at the factory, and you're assigned to shadow Alice as she monitors the machines on the line. She walks over to the Big Machine and says, "Looks like it's flooping again," whacks it, and then says "I think that fixed it". This happens a few times a day perpetually. Over time, you learn what flooping is, kinda. When the Big Machine is flooping, it usually (but not always) makes a certain noise, it usually (but not always) has a loose belt, and it usually (but not always) has a gear that shifted out of place. Now you know what it means for the Big Machine to be flooping, although there are lots of edge cases where neither you nor Alice has a good answer for whether or not it's truly flooping, vs sorta flooping, vs not flooping. By the same token, you could give some labeled examples of "wants to take a walk" to the aliens, and they can find what those examples have in common and develop a concept of "wants to take a walk", albeit with edge cases. Then you can also give labeled examples of "wants to braid their hair", "wants to be accepted", etc., and after enough cycles of this, they'll get the more general concept of "want", again with edge cases. I don't think I'm saying anything that goes against your Occam's razor paper. As I understood it (and you can correct me!!), that paper was about fitting observations of humans to a mathematical model of "boundedly-rational agent pursuing a utility function", and proved that there's no objectively best way to do it, where "objectively best" includes things like fidelity and simplicity. (My perspective on that is, "Well yeah, duh, humans are not boundedly-rational agents pursuing a utility function! The model doesn't fit! There's no objectively best way to hammer a square peg into a round hole! (ETA: the model doesn't fit except insofar as the model is tautologically applicable to anything [])") I don't see how the paper rules o
Learning human preferences: black-box, white-box, and structured white-box access

Humans have a theory of mind, that makes certain types of modularizations easier. That doesn't mean that the same modularization is simple for an agent that doesn't share that theory of mind.

Then again, it might be. This is worth digging into empirically. See my post on the optimistic and pessimistic scenarios; in the optimistic scenario, preferences, human theory of mind, and all the other elements, are easy to deduce (there's an informal equivalence result; if one of those is easy to deduce, all the others are).

So we need to figure out if we're in the optimistic or the pessimistic scenario.

Learning human preferences: black-box, white-box, and structured white-box access

My understanding of the OP was that there is a robot [...]

That understanding is correct.

Then my question was: what if none of the variables, functions, etc. corresponds to "preferences"? What if "preferences" is a way that we try to interpret the robot, but not a natural subsystem or abstraction or function or anything else that would be useful for the robot's programmer?

I agree that preferences is a way we try to interpret the robot (and how we humans try to interpret each other). The programmer themselves could label the variables; but its also po... (read more)

Learning human preferences: black-box, white-box, and structured white-box access

modularization is super helpful for simplifying things.

The best modularization for simplification will not likely correspond to the best modularization for distinguishing preferences from other parts of the agent's algorithm (that's the "Occam's razor" result).

1John Maxwell8moLet's say I'm trying to describe a hockey game. Modularizing the preferences from other aspects of the team algorithm makes it much easier to describe what happens at the start of the second period, when the two teams switch sides. The fact that humans find an abstraction useful is evidence that an AI will as well. The notion that agents have preferences helps us predict how people will change their plans for achieving their goals when they receive new information. Same for an AI.
Learning human preferences: black-box, white-box, and structured white-box access

but the function f is not part of the algorithm, it's only implemented by us onlookers. Right?

Then isn't that just a model at another level, a (labelled) model in the heads of the onlookers?

1G Gordon Worley III8moAny model is going to be in the head of some onlooker. This is the tough part about the white box approach: it's always an inference about what's "really" going on. Of course, this is true even of the boundaries of black boxes, so it's a fully general problem. And I think that suggests it's not a problem except insofar as we have normal problems setting up correspondence between map and territory.
1Sammy Martin8moGlad you think so! I think that methods like using multiple information sources might be a useful way to reduce the number of [] (potentially mistaken) normative assumptions you need in order to model a single human's preferences. The other area of human preference learning where you seem, inevitably, to need a lot of strong normative assumptions is in preference aggregation. If we assume we have elicited the preferences of lots of individual humans, and we're then trying to aggregate their preferences (with each human's preference represented by a separate model) I think the same basic principle applies, that you can reduce the normative assumptions you need by using a more complicated voting mechanism, in this case one that considers agents' ability to vote strategically as an opportunity to reach stable outcomes. I talk about this idea here [] . As with using approval/actions to improve the elicitation of an individual's preferences, you can't avoid making any normative assumptions by using a more complicated aggregation method, but perhaps you end up having to make fewer of them. Very speculatively, if you can combine a robust method of eliciting preferences with few inbuilt assumptions with a similarly robust method of aggregating preferences, you're on your way to a full solution to ambitious value learning [] .
The ground of optimization

Very good. A lot of potential there, I feel.

Dynamic inconsistency of the inaction and initial state baseline

Why do the absolute values cancel?

Because , so you can remove the absolute values.

Tradeoff between desirable properties for baseline choices in impact measures

I also think the pedestrian example illustrates why we need more semantic structure: "pedestrian alive" -> "pedestrian dead" is bad, but "pigeon on road" -> "pigeon in flight" is fine.

3Vika9moI don't think the pedestrian example shows a need for semantic structure. The example is intended to illustrate that an agent with the stepwise inaction baseline has no incentive to undo the delayed effect that it has set up. We want the baseline to incentivize the agent to undo any delayed effect, whether it involves hitting a pedestrian or making a pigeon fly. The pedestrian and pigeon effects differ in the magnitude of impact, so it is the job of the deviation measure to distinguish between them and penalize the pedestrian effect more. Optionality-based deviation measures (AU and RR) capture this distinction because hitting the pedestrian eliminates more options than making the pigeon fly.
Models, myths, dreams, and Cheshire cat grins

Thanks! Good insights there. Am reproducing the comment here for people less willing to click through:

I haven't read the literature on "how counterfactuals ought to work in ideal reasoners" and have no opinion there. But the part where you suggest an empirical description of counterfactual reasoning in humans, I think I basically agree with what you wrote.

I think the neocortex has a zoo of generative models, and a fast way of detecting when two are compatible, and if they are, snapping them together like Legos into a larger model.

For example, the m

... (read more)
Cortés, Pizarro, and Afonso as Precedents for Takeover

...which also means that they didn't have an empire to back them up?

1Daniel Kokotajlo1yYes. Distinguishing between not having an empire and not being willing to fight all-out, they suffered from the first problem, whereas (perhaps, we shall see) the other port cities suffered from the second.
Cortés, Pizarro, and Afonso as Precedents for Takeover

Thanks for your research, especially the Afonso stuff. One question for that: were these empires used to gaining/losing small pieces of territory? ie did they really dedicate all their might to getting these ports back, or did they eventually write them off as minor losses not worth the cost of fighting (given Portuguese naval advantages)?

1Daniel Kokotajlo1yGood question; I'll find out. Malacca at least was a city-state, so the Portuguese attack was an existential threat.
Cortés, Pizarro, and Afonso as Precedents for Takeover

Based on what I recall reading about Pizzaro's conquest, I feel you might be underestimating the importance of horses. It took centuries for European powers to figure out how to break a heavy cavalry charge with infantry; the amerindians didn't have the time to figure it out (see various battles where small cavalry forces routed thousands of troops). Once they had got more used to horses, later Inca forces (though much diminished) were more able to win open battles against the Spanish.

Maybe this was the problem for these empires: they were used to winning

... (read more)
1Daniel Kokotajlo1yMmm, interesting. I'm now reading a 1400-page history book on the subject (after all the attention my post got, I figured I should read more than just a bunch of wiki pages!) so we'll know one way or another soon enough. Thanks for the tip.
Reward functions and updating assumptions can hide a multitude of sins

My main note is that my comment was just about the concept of rigging a learning process given a fixed prior over rewards. I certainly agree that the general strategy of "update a distribution over reward functions" has lots of as-yet-unsolved problems.

Ah, ok, I see ^_^ Thanks for making me write this post, though, as it has useful things for other people to see, that I had been meaning to write up for some time.

On your main point: if the prior and updating process are over things that are truly beyond the AI's influence, then there will be no rigging (

... (read more)
How should AIs update a prior over human preferences?

I agree that for such a system, the optimal policy of the actor is to rig the estimator, and to "intentionally" bias it towards easy-to-satisfy rewards like "the human loves heroin".

The part that confuses me is why we're having two separate systems with different objectives where one system is dumb and the other system is smart.

We don't need to have two separate systems. There's two meaning to your "bias it towards" phrase: the first one is the informal human one, where "the human loves heroin" is clearly a bias. The second is some formal definition

... (read more)
2Charlie Steiner1yI think Rohin's point is that the model of is more IRL than CIRL. It doesn't necessarily assume that the human knows their own utility function and is trying to play a cooperative strategy with the AI that maximizes that same utility function. If I knew that what would really maximize utility is having that second hit of heroin, I'd try to indicate it to the AI I was cooperating with. Problems with IRL look like "we modeled the human as an agent based on representative observations, and now we're going to try to maximize the modeled values, and that's bad." Problems with CIRL look like "we're trying to play this cooperative game with the human that involves modeling it as an agent playing the same game, and now we're going to try to take actions that have really high EV in the game, and that's bad."
2Rohin Shah1yThe key point is not that the AI knows what is or isn't "rigging", or that the AI "knows what a bias is". The key point is that in a CIRL game, by construction there is a true (unknown) reward function, and thus an optimal policy must be viewable as being Bayesian about the reward function, and in particular its actions must be consistent with conservation of expected evidence about the reward function; anything which "rigs" the "learning process" does not satisfy this property and so can't be optimal. You might reasonably ask where the magic happens. The CIRL game that you choose would have to commit to some connection between rewards and behavior. It could be that in one episode the human wants heroin (but doesn't know it) and in another episode the human doesn't want heroin (this depends on the prior over rewards). However, it could never be the case that in a single episode (where the reward must be fixed) the human doesn't want heroin, and then later in the same episode the human does want heroin. Perhaps in the real world this can happen; that would make this policy suboptimal in the real world. (What it does then is unclear since it depends on how the policy generalizes out of distribution.) If this doesn't clarify it, I'll probably table this discussion until publishing an upcoming paper on CIRL games (where it will probably be renamed to assistance games). EDIT: Perhaps another way to put this: I agree that if you train an AI system to act such that it maximizes the expected reward under the posterior inferred by a fixed update rule looking at the AI system's actions and resulting states, the AI will tend to gain reward by choosing actions which when plugged into the update rule lead to a posterior that is "easy to maximize". This seems like training the controller but not training the estimator, and so the controller learns information about the world that allows it to "trick" the estimator into updating in a particular direction (something that would b
If I were a well-intentioned AI... I: Image classifier

I think that might be a generally good critique, but I don't think it applies to this post (it may apply better to post #3 in the series).

I used "metal with knobs" and "beefy arm" as human-parsable examples, but the main point is detecting when something is out-off-distribution, which relies on the image being different in AI-detectable ways, not on the specifics of the categories I mentioned.

1Charlie Steiner1yI don't think this is necessarily a critique - after all, it's inevitable that AI-you is going to inherit some anthropomorphic powers. The trick is figuring out what they are and seeing if it seems like a profitable research avenue to try and replicate them :) In this case, I think this is an already-known problem, because detecting out-of-distribution images in a way that matches human requirements requires the AI's distribution to be similar to human distribution (and conversely, mismatches in distribution allow for adversarial examples). But maybe there's something different in part 4 where I think there's some kind of "break down actions in obvious ways" power that might not be as well-analyzed elsewhere (though it's probably related to self-supervised learning of hierarchical planning problems).
Problem relaxation as a tactic

A key AI safety skill is moving back and forth, as needed, between "could we solve problem X if we assume Y?" and "can we assume Y?".

Databases of human behaviour and preferences?

Suggested elsewhere by Max Daniel:

  • Ultimatum game or other widely studied games in psych/behavioral econ?
  • Ebay bidding, or other auctions?
  • Chess or other games?
  • Voting in elections
  • Gambling: casinos, online poker ...
  • Online dating behavior

Suggested by Ozzie Gooen:

  • This sounds a bit to me like psychology experiments with children, or perhaps some well studied psychology experiments (where there are large amounts of data, with relatively narrow options).
  • Websites would have more than enough data for narrow decisions, like, “Which ad will this user click”
... (read more)
Load More