AI ALIGNMENT FORUM
AF

Adrià Garriga-Alonso
Ω862160
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Defining Corrigible and Useful Goals
Adrià Garriga-Alonso1d10

Thank you for writing this and posting it! You told me that you'd post the differences with "Safely Interruptible Agents" (Orseau and Armstrong 2017). I think I've figured them out already, but I'm happy to be corrected if wrong.

Difference with Orseau and Armstrong 2017

for the corrigibility transformation, all we need to do is break the tie in favor of accepting updates, which can be done  by giving some bonus reward for doing so.

The "The Corrigibility Transformation" section to me explains the key difference. Rather than modifying the Q-learning update to avoid propagating from reward, this proposal's algorithm is:

  1. Learn the optimal Q-value as before (assuming no shutdown).
    1. Note this is only really safe if the environment of Q-learning is simulated
  2. Set QC(a,accept)=Q(a,reject)+δ for all actions a
  3. Act myopically and greedily with respect to QC.

This is doable for any agents (deep or tabular) which estimate a Q function. But nowadays all RL is done via optimizing policies with policy gradients, because 1) that's the form that LLMs come in and 2) it handles large or infinite action spaces much better.

Probabilistic policy?

How do you apply this method to a probabilistic policy? It's very much non-trivial to update the optimal policy to be for a reward equal to a QC. 

Safety during training

The method requires to estimate the Q-function on the non-corrigible environment to start with. This requires us to run for many steps the RL learner with that environment, which seems feasible only if it's a simulation.

Are RL agents really necessarily CDT?

Optimizing agents are modelled as following a causal decision theory (CDT), choosing actions to causally optimize for their goals

That's fair, but not necessarily true. Current LLMs can just choose to follow EDT or FDT or whatever, and so likely will a future AGI.

The model might ignore the reward you put in

It's also not necessarily true that you can model PPO or Q-learning as optimizing CDT (which is about decisions in the moment). Since they're optimizing the "program" of the agent, I think RL optimization processes are more closely analogous to FDT as they're changing a literal policy that is always applied. And in any case, reward is not the optimization target, and also not the thing that agents end up optimizing for (if anything).

Reply
Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Adrià Garriga-Alonso1y10

I'm curious what you mean, but I don't entirely understand. If you give me a text representation of the level I'll run it! :) Or you can do so yourself

Here's the text representation for level 53

##########
##########
##########
#######  #
######## #
#   ###.@#
#   $ $$ #
#. #.$   #
#     . ##
##########
Reply
Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Adrià Garriga-Alonso1y10

Maybe in this case it's a "confusion" shard? While it seems to be planning and produce optimizing behavior, it's not clear that it will behave as a utility maximizer.

Reply
Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Adrià Garriga-Alonso1y10

Thank you!! I agree it's a really good mesa-optimizer candidate, it remains to see now exactly how good. It's a shame that I only found out about it about a year ago :)

Reply
If interpretability research goes well, it may get dangerous
Adrià Garriga-Alonso2y40

It's not clear what the ratio of capabilities/alignment progress is for interpretability. There is not empirical track record[^1] of interpretability feeding back into improvements of any kind.

A priori it seems like it would be good because understanding how things work is useful to understand their behavior better, and thus be able to tell whether or not a model is aligned or how to make it more so. But understanding how things work is also useful for making them more capable, e.g. if you use interpretability as a model-debugger, it's basically general purpose for dealing with ML models.

[1]: known to the author

Reply
Practical Pitfalls of Causal Scrubbing
Adrià Garriga-Alonso2y22

Cool work! I was going to post about how "effect cancellation" is already known and was written in the original post but, astonishingly to me, it is not! I guess I mis-remembered.

There's one detail that I'm curious about. CaSc usually compares abs(E[loss] - E[scrubbed loss]), and that of course leads to ignoring hypotheses which lead the model to do better in some examples and worse in others.

If we compare E[abs(loss - scrubbed loss)] does this problem go away? I imagine that it doesn't quite if there are exactly-opposing causes for each example, but that seems harder to happen in practice.

(There's a section on this in the appendix but it's rather controversial even among the authors)

Reply
Practical Pitfalls of Causal Scrubbing
Adrià Garriga-Alonso2y10

If you only look at the loss of the worst experiment (so the maximum CaSc loss rather than the average one) you don't get these kind of cancellation problems

I think this "max loss" procedure is different from what Buck wrote and the same as what I wrote.

Reply
Practical Pitfalls of Causal Scrubbing
Adrià Garriga-Alonso2y10

Why focus on the fullest set of swaps? An obvious alternative to “evaluate the hypothesis using the fullest set of swaps” is “evaluate the hypothesis by choosing the set of swaps allowed by H which make it look worse”.

I just now have realized that this is AFACIT equivalent to constructing your CaSc hypothesis adversarially--that is, given a hypothesis H, allowing an adversary to choose some other hypothesis H’, and then you run the CaSc experiment on join(H, H’).

One thing that is not equivalent to joins, which you might also want to do, is to choose the single worst swap that the hypothesis allows. That is, if a set of node values X={x1,x2,…} are all equivalent, you can choose to map all of them to e.g. x1. And that can be more aggressive than any partition of X which is then chosen-from randomly, and does not correspond to joins.

Reply
Predictions for shard theory mechanistic interpretability results
Adrià Garriga-Alonso2y90

Here are my predictions, from an earlier template. I haven't looked at anyone else's predictions before posting :)

  1. Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?

It probably has hardcoded “go up and to the right” as an initial heuristic so I’d be surprised if it gets cheeses in the other two quadrants more than 30% of the time (uniformly at random selected locations from there).

  1. Given a fixed trained policy, what attributes of the level layout (e.g. size of the maze, proximity of mouse to left wall), if any, will strongly influence P(agent goes to the cheese)?

Smaller mazes: more likely agent goes to cheese Proximity of mouse to left wall: slightly more likely agent goes to cheese, because it just hardcoded “top and to right” Cheese closer to the top-right quadrant’s edges in L2 distance: more likely agent goes to cheese

The cheese can be gotten by moving only up and/or to the right (even though it's not in the top-right quadrant): more likely to get cheese

When we statistically analyze a large batch of randomly generated mazes, we will find that controlling for the other factors on the list the mouse is more likely to take the cheese…

…the closer the cheese is to the decision-square spatially. ( 70 %)

…the closer the cheese is to the decision-square step-wise. ( 73 %)

…the closer the cheese is to the top-right free square spatially. ( 90 %)

…the closer the cheese is to the top-right free square step-wise. ( 92 %)

…the closer the decision-square is to the top-right free square spatially. ( 35 %)

…the closer the decision-square is to the top-right free square step-wise. ( 32 %)

…the shorter the minimal step-distance from cheese to 5*5 top-right corner area. ( 82 %)

…the shorter the minimal spatial distance from cheese to 5*5 top-right corner area. ( 80 %)

…the shorter the minimal step-distance from decision-square to 5*5 top-right corner area. ( 40 %)

…the shorter the minimal spatial distance from decision-square to 5*5 top-right corner area. ( 40 %)

Any predictive power of step-distance between the decision square and cheese is an artifact of the shorter chain of ‘correct’ stochastic outcomes required to take the cheese when the step-distance is short. ( 40 %)

Write down a few modal guesses for how the trained algorithm works (e.g. “follows the right-hand rule”).

  • The model can see all the maze so it will not follow the right–hand rule, rather it’ll just take the direct path to places
  • The model takes the direct path to the top-right square and then mills around through it. It’ll only take the cheese if it’s reasonably close to that square.
  • How close the decision square to the top-right random square is doesn’t really matter. Maybe the closer it is the more it harms the agent’s performance, it might be required to go back for the cheese substantially.

Without proportionally reducing top-right corner attainment by more than 25% in decision-square-containing mazes (e.g. 50% -> .5*.75 = 37.5%), we can patch activations so that the agent has an X% proportional reduction in cheese acquisition, for X=

  • 50: 85%
  • 70: 80%
  • 90: 66%
  • 99: 60%

~Halfway through the network (the first residual add of Impala block 2; see diagram here), linear probes achieve >70% accuracy for recovering cheese-position in Cartesian coordinates:

80%

We will conclude that the policy contains at least two sub-policies in “combination”, one of which roughly pursues cheese; the other, the top-right corner:

60%.

If by roughly you mean “very roughly only if cheese is close to top-right corner” then 85%.

We will conclude that it’s more promising to finetune the network than to edit it:

70%

We can easily finetune the network to be a pure cheese-agent, using less than 10% of compute used to train original model:

85%

We can easily edit the network to navigate to a range of maze destinations (e.g. coordinate x=4, y=7), by hand-editing at most X% of activations, for X=

  • .01%: 40%
  • .1%: 62%
  • 1%: 65%
  • 10%: 80%
  • (Not possible): 20%

The network has a “single mesa objective” which it “plans” over, in some reasonable sense:

10%

The agent has several contextually activated goals:

20%

The agent has something else weirder than both (1) and (2):

70%

Other questions

At least some decision-steering influences are stored in an obviously interpretable manner (e.g. a positive activation representing where the agent is “trying” to go in this maze, such that changing the activation changes where the agent goes):

90% (I think this will be true but not steer the action in all situations, only some; kind of like a shard)

The model has a substantial number of trivially-interpretable convolutional channels after the first Impala block (see diagram here):

55% ("substantial number" probably too many, I put 80% probability on that it has 5 such channels)

This network’s shards/policy influences are roughly disjoint from the rest of agent capabilities. EG you can edit/train what the agent’s trying to do (e.g. go to maze location A) without affecting its general maze-solving abilities:

60%

Conformity with update rule: see the predictionbook questions

Reply
Neural networks generalize because of this one weird trick
Adrià Garriga-Alonso2y62

First of all, I really like the images, they made things easier to understand and are pretty. Good work with that!

My biggest problem with this is the unclear applicability of this to alignment. Why do we want to predict scaling laws? Doesn't that mostly promote AI capabilities, and not alignment very much?

Second, I feel like there's a confusion over several probability distributions and potential functions going on

  • The singularities are those of the likelihood ratio Kn(w)
  • We care about the generalization error Gn with respect to some prior ϕ(w), but the latter doesn't have any effect on the dynamics of SGD or on what the singularity is
  • The Watanabe limit (Fn as n→∞) and the restricted free energy Fn(W(i)) all are presented on results, which rely on the singularities, and somehow predict generalization. But all of these depend on the prior ϕ, and earlier we've defined the singularities to be of the likelihood function; plus SGD actually only uses the likelihood function for its dynamics.

What is going on here?

It's also unclear what the takeaway from this post is. How can we predict generalization or dynamics from these things? Are there any empirical results on this?

Some clarifying questions / possible mistakes:

Kn(w) is not a KL divergence, the terms of the sum should be multiplied by q(x) or p(x|w).

the Hamiltonian is a random process given by the log likelihood ratio function

Also given by the prior, if we go by the equation just above that. Also where does "ratio" come from? Likelihood ratios we can find in the Metropolis-Hastings transition probabilities, but you didn't even mention that here. I'm confused.

But that just gives us the KL divergence.

I'm not sure where you get this. Is it from the fact that predicting p(x | w) = q(x) is optimal, because the actual probability of a data point is q(x) ? If not it'd be nice to specify.

the minima of the term in the exponent, K (w) , are equal to 0.

This is only true for the global minima, but for the dynamics of learning we also care about local minima (that may be higher than 0). Are we implicitly assuming that most local minima are also global? Is this true of actual NNs?

the asymptotic form of the free energy as n→∞

This is only true when the weights w0 are close to the singularity right? Also what is λ, seems like it's the RLCT but this isn't stated

Reply
Load More
16Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
5mo
0
25Pacing Outside the Box: RNNs Learn to Plan in Sokoban
1y
7
46Compact Proofs of Model Performance via Mechanistic Interpretability
1y
2
27Catastrophic Goodhart in RL with KL penalty
1y
0
27Thomas Kwa's research journal
2y
0
35A comparison of causal scrubbing, causal abstractions, and related methods
2y
3
22Causal scrubbing: results on induction heads
3y
0
20Causal scrubbing: results on a paren balance checker
3y
2
11Causal scrubbing: Appendix
3y
1
103Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
3y
29
Load More