Sami Petersen - AI Alignment Forum

Scott Garrabrant rejects the Independence of Irrelevant Alternatives axiom

*Independence, not IIA. Wikipedia is wrong (as of today).

A simple case for extreme inner misalignment

Two nitpicks and a reference:

an agent’s goals might not be linearly decomposable over possible worlds due to risk-aversion

Risk aversion doesn't violate additive separability. E.g., for we always get $E [u (x)] = \sum_{i} p_{i} x_{i}^{a}$ whether $a = 1$ (risk neutrality) or $a = 1 / 2$ (risk aversion). Though some alternatives to expected utility, like Buchak's REU theory, can allow certain sources of risk aversion to violate separability.

when features have fixed marginal utility, rather than being substitutes

Perfect substitutes have fixed marginal utility. E.g., $v (x, y) = x + 2 y$ always has marginal utilities of 1 and 2.

I'll focus on linearly decomposable goals which can be evaluated by adding together evaluations of many separate subcomponents. More decomposable goals are simpler

There's an old literature on separability in consumer theory that's since been tied to bounded rationality. One move that's made is to grant weak separability accross goups of objects---features---to rationalise the behaviour of optimising accross groups first, and within groups second. Pretnar et al (2021) describe how this can arise from limited cognitive resources.

Meaning & Agency

Sami Petersen8mo30

I argued that the signal-theoretic^[1] analysis of meaning (which is the most common Bayesian analysis of communication) fails to adequately define lying, and fails to offer any distinction between denotation and connotation or literal content vs conversational implicature.

In case you haven't come accross this, here are two papers on lying by the founders of the modern economics literature on communication. I've only skimmed your discussion but if this is relevant, here's a great non-technical discussion of lying in that framework. A common thread in these discussions is that the apparent "no-lying" implication of the analysis of language in the Lewis-Skyrms/Crawford-Sobel signalling tradition relies importantly on common knowledge of rationality and, implicitly, on common knowledge of the game being played, i.e. of the available actions and all the players' preferences.

Invulnerable Incomplete Preferences: A Formal Statement

Sami Petersen10mo10

In your example, DSM permits the agent to end up with either A+ or B. Neither is strictly dominated, and neither has become mandatory for the agent to choose over the other. The agent won't have reason to push probability mass from one towards the other.

You can think of me as trying to run an obvious-to-me assertion test on code which I haven't carefully inspected, to see if the result of the test looks sane.

This is reasonable but I think my response to your comment will mainly involve re-stating what I wrote in the post, so maybe it'll be easier to point to the relevant sections: 3.1. for what DSM mandates when the agent has beliefs about its decision tree, 3.2.2 for what DSM mandates when the agent hadn't considered an actualised continuation of its decision tree, and 3.3. for discussion of these results. In particular, the following paragraphs are meant to illustrate what DSM mandates in the least favourable epistemic state that the agent could be in (unawareness with new options appearing):

It seems we can’t guarantee non-trammelling in general and between all prospects. But we don’t need to guarantee this for all prospects to guarantee it for some, even under awareness growth. Indeed, as we’ve now shown, there are always prospects with respect to which the agent never gets trammelled, no matter how many choices it faces. In fact, whenever the tree expansion does not bring about new prospects, trammelling will never occur (Proposition 7). And even when it does, trammelling is bounded above by the number of comparability classes (Proposition 10).
And it’s intuitive why this would be: we’re simply picking out the best prospects in each class. For instance, suppose prospects were representable as pairs that are comparable iff the $s$ -values are the same, and then preferred to the extent that $c$ is large. Then here’s the process: for each value of $s$ , identify the options that maximise $c$ . Put all of these in a set. Then choice between any options in that set will always remain arbitrary; never trammelled.

Invulnerable Incomplete Preferences: A Formal Statement

Sami Petersen10mo41

Great, I think bits of this comment help me understand what you're pointing to.

the desired behavior implies a revealed preference gap

I think this is roughly right, together with all the caveats about the exact statements of Thornley's impossibility theorems. Speaking precisely here will be cumbersome so for the sake of clarity I'll try to restate what you wrote like this:

Useful agents satisfying completeness and other properties X won't be shutdownable.
Properties X are necessary for an agent to be useful.
So, useful agents satisfying completeness won't be shutdownable.
So, if a useful agent is shutdownable, its preferences are incomplete.

This argument would let us say that observing usefulness and shutdownability reveals a preferential gap.

I think the question I'm interested in is: "do trammelling-style issues imply that DSM agents will not have a revealed preference gap (under reasonable assumptions about their environment and capabilities)?"

A quick distinction: an agent can (i) reveal p, (ii) reveal ¬p, or (iii) neither reveal p nor ¬p. The problem of underdetermination of preference is of the third form.

We can think of some of the properties we've discussed as 'tests' of incomparability, which might or might not reveal preferential gaps. The test in the argument just above is whether the agent is useful and shutdownable. The test I use for my results above (roughly) is 'arbitrary choice'. The reason I use that test is that my results are self-contained; I don't make use of Thornley's various requirements for shutdownability. Of course, arbitrary choice isn't what we want for shutdownability. It's just a test for incomparability that I used for an agent that isn't yet endowed with Thornley's other requirements.

The trammelling results, though, don't give me any reason to think that DSM is problematic for shutdownability. I haven't formally characterised an agent satisfying DSM as well as TND, Stochastic Near-Dominance, and so on, so I can't yet give a definitive or exact answer to how DSM affects the behaviour of a Thornley-style agent. (This is something I'll be working on.) But regarding trammelling, I think my results are reasons for optimism if anything. Even in the least convenient case that I looked at—awareness growth—I wrote this in section 3.3. as an intuition pump:

we’re simply picking out the best prospects in each class. For instance, suppose prospects were representable as pairs that are comparable iff the $s$ -values are the same, and then preferred to the extent that $c$ is large. Then here’s the process: for each value of $s$ , identify the options that maximise $c$ . Put all of these in a set. Then choice between any options in that set will always remain arbitrary; never trammelled.

That is, we retain the preferential gap between the options we want a preferential gap between.

[As an aside, the description in your first paragraph of what we want from a shutdownable agent doesn't quite match Thornley's setup; the relevant part to see this is section 10.1. here.]

Invulnerable Incomplete Preferences: A Formal Statement

Sami Petersen10mo10

On my understanding, the argument isn’t that your DSM agent can be made better off, but that the reason it can’t be made better off is because it is engaging in trammeling/“collusion”, and that the form of “trammeling” you’ve ruled out isn’t the useful kind.

I don't see how this could be right. Consider the bounding results on trammelling under unawareness (e.g. Proposition 10). They show that there will always be a set of options between which DSM does not require choosing one over the other. Suppose these are X and Y. The agent will always be able to choose either one. They might end up always choosing X, always Y, switching back and forth, whatever. This doesn't look like the outcome of two subagents, one preferring X and the other Y, negotiating to get some portion of the picks.

As far as an example goes, consider a sequence of actions which, starting from an unpressed world state, routes through a pressed world state (or series of pressed world states), before eventually returning to an unpressed world state with higher utility than the initial state.

Forgive me; I'm still not seeing it. For coming up with examples, I think for now it's unhelpful to use the shutdown problem, because the actual proposal from Thornley includes several more requirements. I think it's perfectly fine to construct examples about trammelling and subagents using something like this: A is a set of options with typical member . These are all comparable and ranked according to their subscripts. That is, $a_{1}$ is preferred to $a_{2}$ , and so on. Likewise with set B. And all options in A are incomparable to all options in B.

If your proposed DSM agent passes up this action sequence on the grounds that some of the intermediate steps need to bridge between “incomparable” pressed/unpressed trajectories, then it does in fact pass up the certain gain. Conversely, if it doesn’t pass up such a sequence, then its behavior is the same as that of a set of negotiating subagents cooperating in order to form a larger macroagent.

This looks to me like a misunderstanding that I tried to explain in section 3.1. Let me know if not, though, ideally with a worked-out example of the form: "here's the decision tree(s), here's what DSM mandates, here's why it's untrammelled according to the OP definition, and here's why it's problematic."

Invulnerable Incomplete Preferences: A Formal Statement

Sami Petersen10mo43

That makes sense, yeah.

Let me first make some comments about revealed preferences that might clarify how I'm seeing this. Preferences are famously underdetermined by limited choice behaviour. If A and B are available and I pick A, you can't infer that I like A more than B — I might be indifferent or unable to compare them. Worse, under uncertainty, you can't tell why I chose some lottery over another even if you assume I have strict preferences between all options — the lottery I choose depends on my beliefs too. In expected utility theory, beliefs and preferences together induce choice, so if we only observe a choice, we have one equation in two unknowns.^[1] Given my choice, you'd need to read my mind's probabilities to be able to infer my preferences (and vice versa).^[2]

In that sense, preferences (mostly) aren't actually revealed. Economists often assume various things to apply revealed preference theory, e.g. setting beliefs equal to 'objective chances', or assuming a certain functional form for the utility function.

But why do we care about preferences per se, rather than what's revealed? Because we want to predict future behaviour. If you can't infer my preferences from my choices, you can't predict my future choices. In the example above, if my 'revealed preference' between A and B is that I prefer A, then you might make false predictions about my future behaviour (because I might well choose B next time).

Let me know if I'm on the right track for clarifying things. If I am, could you say how you see trammelling/shutdown connecting to revealed preferences as described here, and I'll respond to that?

^{^}
^{^}
The situation is even worse when you can't tell what I'm choosing between, or what my preference relation is defined over.

Invulnerable Incomplete Preferences: A Formal Statement

Sami Petersen10mo10

I disagree; see my reply to John above.

Invulnerable Incomplete Preferences: A Formal Statement

Sami Petersen10mo30

if the subagents representing a set of incomplete preferences would trade with each other to emulate more complete preferences, then an agent with the plain set of incomplete preferences would precommit to act in the same way

My results above on invulnerability preclude the possibility that the agent can predictably be made better off by its own lights through an alternative sequence of actions. So I don't think that's possible, though I may be misreading you. Could you give an example of a precommitment that the agent would take? In my mind, an example of this would have to show that the agent (not the negotiating subagents) strictly prefers the commitment to what it otherwise would've done according to DSM etc.

Yeah, I wasn't using Bradley. The full set of coherent completions is overkill, we just need to nail down the partial order.

I agree the full set won't always be needed, at least when we're just after ordinal preferences, though I personally don't have a clear picture of when exactly that holds.

Invulnerable Incomplete Preferences: A Formal Statement

Sami Petersen10mo30

On John's-simplified-model-of-Thornley's-proposal, we have complete preference orderings over trajectories-in-which-the-button-isn't-pressed and trajectories-in-which-the-button-is-pressed, separately, but no preference between any button-pressed and button-not-pressed trajectory pair.

For the purposes of this discussion, this is right. I don't think the differences between this description and the actual proposal matter in this case.

Represented as subagents, those incomplete preferences require two subagents:
One subagent always prefers button pressed to unpressed, is indifferent between unpressed trajectories, and has the original complete order on pressed trajectories.
The other subagent always prefers button unpressed to pressed, is indifferent between pressed trajectories, and has the original complete order on unpressed trajectories.

I don't think this representation is quite right, although not for a reason I expect to matter for this discussion. It's a technicality but I'll mention it for completeness. If we're using Bradley's representation theorem from section 2.1., the set of subagents must include every coherent completion of the agent's preferences. E.g., suppose there are three possible trajectories. Let denote a pressed trajectory and $u_{1}, u_{2}$ two unpressed trajectories, where $u_{1}$ gets you strictly more coins than $u_{2}$ . Then there'll be five (ordinal) subagents, described in order of preference: $⟨ u_{1}, u_{2}, p ⟩$ , $⟨ u_{1}, u_{2} p ⟩$ , $⟨ u_{1}, p, u_{2} ⟩$ , $⟨ u_{1} p, u_{2} ⟩$ , and $⟨ p, u_{1}, u_{2} ⟩$ .

But you and Thornley want to interpret "no preference" as "just choose arbitrarily/randomly" rather than "don't trade either for the other", so... this agent just chooses arbitrarily/randomly between the best-available pressed-option and the best-available unpressed-option? But that would imply that the agent is choosing (albeit arbitrarily/randomly) between button-pressed and button-unpressed, which is not what we want, so presumably you're imagining something else?

Indeed, this wouldn't be good, and isn't what Thornley's proposal does. The agent doesn't choose arbitrarily between the best pressed vs unpressed options. Thornley's proposal adds more requirements on the agent to ensure this. My use of 'arbitrary' in the post is a bit misleading in that context. I'm only using it to identify when the agent has multiple permissible options available, which is what we're after to get TND. If no other requirements are added to the agent, and it's acting under certainty, this could well lead it to actually choose arbitrarily. But it doesn't have to in general, and under uncertainty and together with the rest of Thornley's requirements, it doesn't. (The requirements are described in his proposal.)

With that flagged, on to the trammelling issue. The potential problem is that the two subagents might want to trade, so that the system sometimes tries to make the button be pressed (in cases which get relatively high pressed-utility), and sometimes tries to make the button be unpressed (in cases which get relatively high unpressed-utility). (E.g. the pink arrows in the diagram.) And... locking in a trajectory at the start doesn't seem to help that problem at all? Like, the subagents do that trading in logical time (i.e. time zero), that adds preferences, and then sometimes they lock in a plan which involves manipulating the button.

I'll first flag that the results don't rely on subagents. Creating a group agent out of multiple subagents is possibly an interesting way to create an agent representable as having incomplete preferences, but this isn't the same as creating a single agent whose single preference relation happens not to satisfy completeness.

That said, I will spend some more time thinking about the subagent idea, and I do think collusion between them seems like the major initial hurdle for this approach to creating an agent with preferential gaps.

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments