MDP models are determined by the agent architecture and the environmental dynamics

This was why gave a precise definition of model-irrelevance. I'll step through your points using the definition,

- Consider the underlying environment (assumed Markovian)
- Consider different state/action encodings (
**model-irrelevant abstractions**) we might supply the agent. - For each, fix a reward function distribution
- See what the theory predict

The problem I'm trying to highlight lies in point three. Each task is a reward function you could have the agent attempt to optimize. Every abstraction/encoding fixes a set of rewards under which the abstraction is mo... (read more)

81y

I read your formalism, but I didn't understand what prompted you to write it. I
don't yet see the connection to my claims.
Yeah, I don't want you to spend too much time on a bulletproof grounding of your
argument, because I'm not yet convinced we're talking about the same thing.
In particular, if the argument's like, "we usually express reward functions in
some featurized or abstracted way, and it's not clear how the abstraction will
interact with your theorems" / "we often use different abstractions to express
different task objectives", then that's something I've been thinking about but
not what I'm covering here. I'm not considering practical expressibility issues
over the encoded MDP: ("That's also a claim that we can, in theory, specify
reward functions which distinguish between 5 googolplex variants of
red-ghost-game-over.")
If this doesn't answer your objection - can you give me an english description
of a situation where the objection holds? (Let's taboo 'model', because it's
overloaded in this context)

MDP models are determined by the agent architecture and the environmental dynamics

I still see room for reasonable objection.

An MDP model (technically, a rewardless MDP) is a tuple

I need to be pedantic. The equivocation here is where I think the problem is. To assign a reward function we need a map from the state-action space to the reals. It's not enough to just consider a 'rewardless MDP'.

When we define state and action encodings, this implicitly defines an "interface" between the agent and the environment.

As you note, the choice of state-action encoding is an implicit modeling assumption. It could be wrong, but to even... (read more)

81y

Why would we need that, and what is the motivation for "models"? The moment we
give the agent sensors and actions, we're done specifying the rewardless MDP
(and its model).
ETA: potential confusion - in some MDP theory, the “model” is a model of the
environment dynamics. Eg in deterministic environments, the model is shown with
a directed graph. i don’t use “model” to refer to an agent’s world model over
which it may have an objective function. I should have chosen a better word, or
clarified the distinction.
If, by "tasks", you mean "different agent deployment scenarios" - I'm not
claiming that. I'm saying that if we want to predict what happens, we:
1. Consider the underlying environment (assumed Markovian)
2. Consider different state/action encodings we might supply the agent.
3. For each, fix a reward function distribution (what goals we expect to assign
to the agent)
4. See what the theory predicts.
There's a further claim (which seems plausible, but which I'm not yet making)
that (2) won't affect (4) very much in practice. The point of this post is that
if you say "the MDP has a different model", you're either disagreeing with (1)
the actual dynamics, or claiming that we will physically supply the agent with a
different state/action encoding (2).
I don't follow. Can you give a concrete example?

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

The main point, as I see it, is essentially that functions with good generalisation correspond to large volumes in parameter-space, and that SGD finds functions with a probability roughly proportional to their volume.

What I'm suggesting is that volume in high-dimensions can concentrate on the boundary. To be clear, when I say SGD only typically reaches the boundary, I'm talking about early stopping and the main experimental setup in your paper where training is stopped upon reaching zero train error.

... (read more)We have done overtraining, which should allow SGD to

21y

Yes. I imagine this is why overtraining doesn't make a huge difference.
See e.g., page 47 in the main paper [https://arxiv.org/pdf/2006.15191.pdf].

Developmental Stages of GPTs

The paper doesn't draw the causal diagram "Power → instrumental convergence", it gives sufficient conditions for power-seeking being instrumentally convergent. Cycle reachability preservation is one of those conditions.

This definitely feels like the place where I'm missing something. What is the formal definition of 'power seeking'? My understanding is that power is the rescaled value function, in the limit of farsightedness is decreasing, and in the context of terminal state reachability always goes to zero. The agent literally *gives up* power to achiev... (read more)

21y

The freshly updated paper [https://arxiv.org/pdf/1912.01683.pdf] answers this
question in great detail; see section 6 and also appendix B.

12y

Great question. One thing you could say is that an action is power-seeking
compared to another, if your expected (non-dominated subgraph; see Figure 19)
power is greater for that action than for the other.
Power is kinda weird when defined for optimal agents, as you say - whenγ=1,
POWER can only decrease. See Power as Easily Exploitable Opportunities
[https://www.lesswrong.com/posts/eqov4SEYEbeFMXegR/power-as-easily-exploitable-opportunities]
for more on this.
Shortly after Theorem 19, the paper says: "In appendix C.6.2, we extend this
reasoning to k-cycles (k >1) via theorem 53 and explain how theorem19 correctly
handles fig. 7". In particular, see Figure 19.
The key insight is that Theorem 19 talks about how many agents end up in a set
of terminal states, not how many go through a state to get there. If you have
two states with disjoint reachable terminal state sets, you can reason about the
phenomenon pretty easily. Practically speaking, this should often suffice: for
example, the off-switch state is disjoint from everything else.
If not, you can sometimes consider the non-dominated subgraph in order to regain
disjointness. This isn't in the main part of the paper, but basically you toss
out transitions which aren't part of a trajectory which is strictly optimal for
some reward function. Figure 19 gives an example of this.
The main idea, though, is that you're reasoning about what the agent's end goals
tend to be, and then say "it's going to pursue some way of getting there with
much higher probability, compared to this small set of terminal states (ie
shutdown)". Theorem 17 tells us that in the limit, cycle reachability totally
controls POWER.
I think I still haven't clearly communicated all my mental models here, but I
figured I'd write a reply now while I update the paper.
Thank you for these comments, by the way. You're pointing out important
underspecifications. :)
I think one problem is that power-seeking agents are generally not that
corrigible, w

Developmental Stages of GPTs

Thanks for the comment! I think max-ent brings up a related point. In IRL we observed behavior and infer a reward function (using max-ent also?). Ultimately, there is a relationship between state/action frequency and reward. This would considerably constrain the distribution of reward functions to be considered in instrumental/power analysis.

I think I get confused about the usage of power the most. It seems like you can argue that given a random reward to optimize the agent will try to avoid getting turned off without invoking power. If there's a collectio... (read more)

22y

To clarify, I don't assume that. The terminal states, even those representing
the off-switch, also have their reward drawn from the same distribution. When
you distribute reward IID over states, the off-state is in fact optimal for some
low-measure subset of reward functions.
But, maybe you're saying "for realistic distributions, the agent won't get any
reward for being shut off and thereforeπ∗won't ever let itself be shut off". I
agree, and this kind of reasoning is captured by Theorem 3 of Generalizing the
Power-Seeking Theorems
[https://www.lesswrong.com/posts/nyDnLif4cjeRe9DSv/generalizing-the-power-seeking-theorems#Reward_distribution_generalization]
. The problem is that this is just a narrow example of the more general
phenomenon. What if we add transient "obedience" rewards, what then? For some
level of farsightedness (γclose enough to 1), the agent will still disobey, and
simultaneously disobedience gives it more control over the future.
The paper doesn't draw the causal diagram "Power→instrumental convergence", it
gives sufficient conditions for power-seeking being instrumentally convergent.
Cycle reachability preservation is one of those conditions.
Yes, right. The point isn't that alignment is impossible, but that you have to
hit a low-measure set of goals which will give you aligned or non-power-seeking
behavior. The paper helps motivate why alignment is generically hard and
catastrophic if you fail.
Yes, ifr=h, introduce the agent. You can formalize a kind of "alignment
capability" by introducing a joint distribution over the human's goals and the
induced agent goals (preliminary Overleaf notes
[https://www.overleaf.com/read/jjfmwmymtvtg]). So, if we had goal X, we'd
implement an agent with goal X', and so on. You then take our expected optimal
value under this distribution and find whether you're good at alignment, or
whether you're bad and you'll build agents whose optimal policies tend to
obstruct you.
The doubling depends on the environment st

Developmental Stages of GPTs

I think this is a slight misunderstanding of the theory in the paper.

I disagree. What I'm trying to do is outline a reinterpretation of the 'power seeking' claim. I'm citing the pre-task section and theorem 17 to insist that power-seeking can only really happen in the pre-task because,

The way the theory does this is by saying that first a reward function is drawn from the distribution, then it is given to the agent, then the agent thinks really hard, and then the agent executes the optimal policy.

The agent is done optimizing before the main portion ... (read more)

32y

I do think this is mostly a matter of translation of math to English being hard.
Like, when Alex says "optimal agents seek power", I think you should translate
it as "when we don't know what goal an optimal agent has, we should assign
higher probability that it will go to states that have higher power", even
though the agent itself is not thinking "ah, this state is powerful, I'll go
there".

Developmental Stages of GPTs

I appreciate the more concrete definition of IC presented here. However, I have an interpretation that is a bit different from you. I'm following the formal presentation.

My base understanding is that a cycle with max average reward is optimal. This is essentially just a definition. In the case the agent doesn't know the reward function, it seems clear that the agent ought to position it's self in a state which gives it access to as many of these cycles as possible.

In your paper, theorem 19 suggests that given a choice between two sets of 1-cycles and ... (read more)

12y

Great observation. Similarly, a hypothesis called "Maximum Causal Entropy" once
claimed that physical systems involving intelligent actors tended tended towards
states where the future could be specialized towards many different final
states, and that maybe this was even part of what intelligence was. However,
people objected: (monogamous) individuals don't perpetually maximize their
potential partners -- they actually pick a partner, eventually.
My position on the issue is: most agents steer towards states which afford them
greater power, and sometimes most agents give up that power to achieve their
specialized goals. The point, however, is that they end up in the high-power
states at some point in time along their optimal trajectory. I imagine that this
is sufficient for the catastrophic power-stealing incentives: the AI only has to
disempower us once for things to go irreversibly wrong.

32y

I think this is a slight misunderstanding of the theory in the paper. I'd
translate the theory of the paper to English as:
Any time the paper talks about "distributions" over reward functions, it's
talking from our perspective. The way the theory does this is by saying that
first a reward function is drawn from the distribution, then it is given to the
agent, then the agent thinks really hard, and then the agent executes the
optimal policy. All of the theoretical analysis in the paper is done "before"
the reward function is drawn, but there is no step where the agent is doing
optimization but doesn't know its reward.
I'd rewrite this as:

I don’t think it’s a good use of time to get into this if you weren’t being specific about your usage of ‘model’ or the claim you made previously because I already pointed out a concrete difference: I claim it’s reasonable to say there are three alternatives while you claim there are two alternatives.

(If it helps you, you can search-replace model-irrelevant to state-abstraction because I don’t use the term model in my previous reply anyway.)