axioman - AI Alignment Forum

Avoiding Side Effects in Complex Environments

I'm curious whether AUP or the autencoder/random projection does more work here. Did you test how well AUP and AUP_proj with a discount factor of 0 for the AUP Q-functions do?

Do mesa-optimizer risk arguments rely on the train-test paradigm?

axioman4y30

You don't even need a catastrophe in any global sense. Disrupting the training procedure at step t should be sufficient.

AI Unsafety via Non-Zero-Sum Debate

axioman4y10

"My intuition is that there will be a class of questions where debate is definitely safe, a class where it is unsafe, and a class where some questions are safe, some unsafe, and we don’t really know which are which."

Interesting. Do you have some examples of types of questions you expect to be safe or potential features of save questions? Is it mostly about the downstram consquences that answers would have, or more about instrumental goals that the questions induce for debaters?

Tradeoff between desirable properties for baseline choices in impact measures

axioman4y20

I like the insight that offsetting is not always bad and the idea of dealing with the bad cases using the task reward. State-based reward functions that capture whether or not the task is currently done also intuitively seem like the correct way of specifying rewards in cases where achieving the task does not end the episode.

I am a bit confused about the section on the markov property: I was imagining that the reason you want the property is to make applying standard RL techniques more straightforward (or to avoid making already existing partial observability more complicated). However if I understand correctly, the second modification has the (expectation of the) penalty as a function of the complete agent policy and I don't really see, how that would help. Is there another reason to want the markov property, or am I missing some way in which the modification would simplify applying RL methods?

Attainable Utility Preservation: Scaling to Superhuman

axioman5y10

I wonder what happens to the subagent problem with a random action as baseline: In the current sense, building a subagent roughly works by reaching a state $s_{t + 1}$ where

Q_{R} (s_{t + 1}, \emptyset) = V_{R} (π^{*}, s_{t + 1})

for all auxillary rewards $R$ , where $π^{*}$ is the optimal policy according to the main reward; while making sure that there exists an action $a_{R}$ such that

R (t) + γ Q_{R} (s_{t + 1}, a_{R}) \approx Q_{R} (s_{t}, \emptyset)

for every $R$ . So while building a subagent in that way is still feasible, the agent would be forced to either receive a large penalty or give the subagent random orders at $t + 1$ .

Probably, there is a way to circumvent this again, though? Also, I am unsure about the other properties of randomized baselines.

How Low Should Fruit Hang Before We Pick It?

axioman5y00

Where does

u (¯ a') - \frac{I (¯ a')}{R_{1}} = u (¯ a) - \frac{I (¯ a)}{R_{2}}

come from?

Also, the equation seems to imply

R_{1} = R_{2}

Edit: I focused too much on what I suppose is a typo. Clearly you can just rewrite the the first and last equality as equality of an affine linear function

R \to R

at two points, which gives you equality everywhere.

How Low Should Fruit Hang Before We Pick It?

axioman5y10

I do not understand your proof for proposition 2.

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments