Repetitive Experimenter

Wiki Contributions


MDP models are determined by the agent architecture and the environmental dynamics

I don’t think it’s a good use of time to get into this if you weren’t being specific about your usage of ‘model’ or the claim you made previously because I already pointed out a concrete difference: I claim it’s reasonable to say there are three alternatives while you claim there are two alternatives.

(If it helps you, you can search-replace model-irrelevant to state-abstraction because I don’t use the term model in my previous reply anyway.)

MDP models are determined by the agent architecture and the environmental dynamics

This was why gave a precise definition of model-irrelevance. I'll step through your points using the definition,

  1. Consider the underlying environment (assumed Markovian)
  2. Consider different state/action encodings (model-irrelevant abstractions) we might supply the agent.
  3. For each, fix a reward function distribution
  4. See what the theory predict

The problem I'm trying to highlight lies in point three. Each task is a reward function you could have the agent attempt to optimize. Every abstraction/encoding fixes a set of rewards under which the abstraction is model-irrelevant. This means the agent can successfully optimize these rewards.

[I]f you say "the MDP has a different model", you're either disagreeing with (1) the actual dynamics, or claiming that we will physically supply the agent with a different state/action encoding (2).

My claim is that there is a third alternative: you may claim that the reward function given to the agent does not satisfy model-irrelevance. This can be the case even if the underlying dynamics are markovian and the abstraction of the transitions satisfies model-irrelevance.

I don't follow. Can you give a concrete example?

That may take a while. The argument above is a reasonable candidate for a lemma. A useful example would show that the third alternative exists. Do you agree this is the crux of your disagreement with my objection? If so, I might try to formalize it.

MDP models are determined by the agent architecture and the environmental dynamics

I still see room for reasonable objection.

An MDP model (technically, a rewardless MDP) is a tuple

I need to be pedantic. The equivocation here is where I think the problem is. To assign a reward function we need a map from the state-action space to the reals. It's not enough to just consider a 'rewardless MDP'.

When we define state and action encodings, this implicitly defines an "interface" between the agent and the environment.

As you note, the choice of state-action encoding is an implicit modeling assumption. It could be wrong, but to even discuss that we do have to be technical. To be concrete, perhaps we agree that there’s some underlying dynamics that is Markovian. The moment we give the agent sensors we create our state abstraction for the MDP. Moreover, say we agree that our state abstraction needs to be model-irrelevant. Given a 'true' MDP and state abstraction that operates on we'll say that is model-irrelevant if where and we have, Strictly speaking, model-irrelevance is at least as hard to satisfy for a collection of MDPs than for a single MDP. In other words, we may be able to properly model a single task with an MDP, but a priori there should be skepticism that all tasks can be modeled with a specific state-abstraction. Later on you seem to agree with this conclusion,

That's also a claim that we can, in theory, specify reward functions which distinguish between 5 googolplex variants of red-ghost-game-over. If that were true, then yes - optimal policies really would tend to "die" immediately, since they'd have so many choices.

Specifically, the agent architecture is an implicit constraint on available reward functions. I'd suspect this does generalize into a fragility/impossibility result any time the reward is given to the agent in a way that's decoupled from the agent's sensors which is really going to be the prominent case in practice. In conclusion, you can try to work with a variable/rewardless MDP, but then this argument will apply and severely limit the usefulness of the generic theoretical analysis.

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

The main point, as I see it, is essentially that functions with good generalisation correspond to large volumes in parameter-space, and that SGD finds functions with a probability roughly proportional to their volume.

What I'm suggesting is that volume in high-dimensions can concentrate on the boundary. To be clear, when I say SGD only typically reaches the boundary, I'm talking about early stopping and the main experimental setup in your paper where training is stopped upon reaching zero train error.

We have done overtraining, which should allow SGD to penetrate into the region. This doesn’t seem to make much difference for the probabilities we get.

This does seem to invalidate the model. However, something tells me that the difference here is more about degree. Since you use the word 'should' I'll use the wiggle room to propose an argument for what 'should' happen.

If SGD is run with early stopping, as described above, then my argument is that this is roughly equivalent to random sampling via an appeal to concentration of measure in high-dimensions.

If SGD is not run with early stopping, it's enclosed by the boundary of zero train error functions. Because these are most likely in the interior these functions are unlikely to be produced by random sampling. Thus, on a log-log plot I'd expect overtraining to 'tilt' the correspondence between SGD and random sampling likelihoods downward.

Falsifiable Hypothesis: Compare SGD with overtaining to the random sampling algorithm. You will see that functions that are unlikely to be generated by random sampling will be more likely under SGD with overtraining. Moreover, functions that are more likely with random sampling will be become less likely under SGD with overtraining.

Developmental Stages of GPTs

The paper doesn't draw the causal diagram "Power → instrumental convergence", it gives sufficient conditions for power-seeking being instrumentally convergent. Cycle reachability preservation is one of those conditions.

This definitely feels like the place where I'm missing something. What is the formal definition of 'power seeking'? My understanding is that power is the rescaled value function, in the limit of farsightedness is decreasing, and in the context of terminal state reachability always goes to zero. The agent literally gives up power to achieve it's goal.

Now, I realize this might just be naming convention confusion. I do, I think, understand the idea that preserving cycle reachability could be instrumental. However,

Cycle reachability preservation is one of those conditions.

this seems circular to me. My understanding of figure 7 of your paper indicates that cycle reachability cannot be a sufficient condition.

You can formalize a kind of "alignment capability" by introducing a joint distribution over the human's goals and the induced agent goals

This is very interesting to me. Thank you for sharing. I wonder what you mean by,

The point isn't that alignment is impossible, but that you have to hit a low-measure set of goals which will give you aligned or non-power-seeking behavior.

Given your definitions it's clear that the set of aligned goals must be low-measure. Also by your reasoning 'non-power seeking behavior' is not instrumental. However, in a curricula, power-seeking must be instrumental or else the agent is less likely to achieve it's goals. It seems there's a two out of three condition (aligned/general/non-power-seeking) here. My philosophy is that aligned/general is OK based on a shared (?) premise that,

If the rewards are -close in sup-norm, then you can get nice regret bounds, sure.

Developmental Stages of GPTs

Thanks for the comment! I think max-ent brings up a related point. In IRL we observed behavior and infer a reward function (using max-ent also?). Ultimately, there is a relationship between state/action frequency and reward. This would considerably constrain the distribution of reward functions to be considered in instrumental/power analysis.

I think I get confused about the usage of power the most. It seems like you can argue that given a random reward to optimize the agent will try to avoid getting turned off without invoking power. If there's a collection of 'turned-off' terminal states where the agent receives no further reward for all time then every optimized policy will try to avoid such a state. It seems as though we could define for each and then we'd have,

It seems like this would extend out to a full definition. The advantage here being that you can say, “If one action in this state is more instrumental than another then the return is likely to be greater as well”.

I imagine that this is sufficient for the catastrophic power-stealing incentives

I'm not confident analysis in the single-agent case extends to the multi-agent setting. If our goal is fixed as and the agent's varies then I might argue it's instrumental for us to align the agent's goal with ours and vice versa. In general, I'd suspect that there are goals we could give the agent that significantly reduce our gain. However, I'd also suspect the opposite.

Say we have the capability to introduce a second agent with a reward . Would we want to introduce the agent? It seems reasonable to argue that we would if we could guarantee . There might be a way to argue over randomness and say this would double our gain. More speculatively, what if ?

Developmental Stages of GPTs

I think this is a slight misunderstanding of the theory in the paper.

I disagree. What I'm trying to do is outline a reinterpretation of the 'power seeking' claim. I'm citing the pre-task section and theorem 17 to insist that power-seeking can only really happen in the pre-task because,

The way the theory does this is by saying that first a reward function is drawn from the distribution, then it is given to the agent, then the agent thinks really hard, and then the agent executes the optimal policy.

The agent is done optimizing before the main portion of the paper even begins. I do not see how the agent 'seeks' out powerful states because, as you say, the agent is fixed. Now, when you say,

If we do not know an agent's goal, but we know that the agent knows its goal and is optimal w.r.t it, then from our perspective the agent is more likely to go to higher-power states. (From the agent's perspective, there is no probability, it always executes the deterministic perfect policy for its reward function.)

My issue is that the Figure 19 shows an example where the agent doesn't display this behavior. Tautologically, the agent tends to do what is instrumentally convergent. If power was tied to instrumental convergence then we could also say the agent tends to do what is powerful. However, it seems as though a state can be arbitrarily powerful without having the instrumental property which breaks the analogy.

From here I could launch a counter-argument: if power can be arbitrarily removed from the instrumental convergence phenomena then agent 'wireheading', while a powerful state, is sufficiently out of the way from most goals that the agent most likely won't. To be clear, I don't have any strong opinions, I'm just confused about these interpretive details.

Developmental Stages of GPTs

I appreciate the more concrete definition of IC presented here. However, I have an interpretation that is a bit different from you. I'm following the formal presentation.

My base understanding is that a cycle with max average reward is optimal. This is essentially just a definition. In the case the agent doesn't know the reward function, it seems clear that the agent ought to position it's self in a state which gives it access to as many of these cycles as possible.

In your paper, theorem 19 suggests that given a choice between two sets of 1-cycles and the agent is more likely to select the larger set. This makes sense. What doesn't make sense is the conclusion (theorem 17) that the agent selects states with more power. This is because at the very start of the paper it's mentioned that,

As an alternative motivation, consider an agent in a communicating MDP which is periodically assigned a task from a known distribution . Between tasks, the agent has as much downtime as required. To maximize return over time, the optimal policy during downtime is to navigate to the state with maximal .

According to theorem 17, loosing access to states means that power goes down (or stays constant). This seems to indicate power (cycle access) is really some sort of Lyapunov function for the dynamics. So at the outset, it seems clear that the agent will prefer states that maximize power, but then as soon as a determination is made on what the actual reward function is, power goes down, not up.

What I'm trying to point out here is that I find the distinction between pre-task optimization and execution to be loose. This is to such a degree that I find myself drawing the exact opposite conclusion: agents optimizing a generic reward will tend to give-up power.

At the moment, I find myself agreeing with the idea that an agent unaware of it's task will seek power, but also conclude that an agent aware of it's task will give-up power. My current opinion is that power seeking behavior is concentrated in the pre-task step. Giving the AI unrestricted 'free-time' to optimize with should 'never' be allowed. Now, I could be misunderstanding parts of the paper, but hopefully I've made things clear enough!