Thank you for writing up! I'm still not sure I understand condensation. I would summarize as: instead of encoding the givens, we encode some latents which can be used to compute the set of possible answers to the givens (so we need a distribution over questions).
Also, the total cost of condensation has to be the at least the entropy of the answer distribution (generated by the probability distribution over questions, applied to the givens) because of Shannon's bound.
I feel like if the optimal condensation setup is indeed 1 book per question, then it's not a very good model of latent variables, no? But perhaps it's going in the right direction.
Thank you for writing this and posting it! You told me that you'd post the differences with "Safely Interruptible Agents" (Orseau and Armstrong 2017). I think I've figured them out already, but I'm happy to be corrected if wrong.
for the corrigibility transformation, all we need to do is break the tie in favor of accepting updates, which can be done by giving some bonus reward for doing so.
The "The Corrigibility Transformation" section to me explains the key difference. Rather than modifying the Q-learning update to avoid propagating from reward, this proposal's algorithm is:
This is doable for any agents (deep or tabular) which estimate a function. But nowadays all RL is done via optimizing policies with policy gradients, because 1) that's the form that LLMs come in and 2) it handles large or infinite action spaces much better.
How do you apply this method to a probabilistic policy? It's very much non-trivial to update the optimal policy to be for a reward equal to a .
The method requires to estimate the Q-function on the non-corrigible environment to start with. This requires us to run for many steps the RL learner with that environment, which seems feasible only if it's a simulation.
Optimizing agents are modelled as following a causal decision theory (CDT), choosing actions to causally optimize for their goals
That's fair, but not necessarily true. Current LLMs can just choose to follow EDT or FDT or whatever, and so likely will a future AGI.
It's also not necessarily true that you can model PPO or Q-learning as optimizing CDT (which is about decisions in the moment). Since they're optimizing the "program" of the agent, I think RL optimization processes are more closely analogous to FDT as they're changing a literal policy that is always applied. And in any case, reward is not the optimization target, and also not the thing that agents end up optimizing for (if anything).
I'm curious what you mean, but I don't entirely understand. If you give me a text representation of the level I'll run it! :) Or you can do so yourself
Here's the text representation for level 53
##########
##########
##########
####### #
######## #
# ###.@#
# $ $$ #
#. #.$ #
# . ##
##########
Maybe in this case it's a "confusion" shard? While it seems to be planning and produce optimizing behavior, it's not clear that it will behave as a utility maximizer.
Thank you!! I agree it's a really good mesa-optimizer candidate, it remains to see now exactly how good. It's a shame that I only found out about it about a year ago :)
It's not clear what the ratio of capabilities/alignment progress is for interpretability. There is not empirical track record[^1] of interpretability feeding back into improvements of any kind.
A priori it seems like it would be good because understanding how things work is useful to understand their behavior better, and thus be able to tell whether or not a model is aligned or how to make it more so. But understanding how things work is also useful for making them more capable, e.g. if you use interpretability as a model-debugger, it's basically general purpose for dealing with ML models.
[1]: known to the author
Cool work! I was going to post about how "effect cancellation" is already known and was written in the original post but, astonishingly to me, it is not! I guess I mis-remembered.
There's one detail that I'm curious about. CaSc usually compares abs(E[loss] - E[scrubbed loss]), and that of course leads to ignoring hypotheses which lead the model to do better in some examples and worse in others.
If we compare E[abs(loss - scrubbed loss)] does this problem go away? I imagine that it doesn't quite if there are exactly-opposing causes for each example, but that seems harder to happen in practice.
(There's a section on this in the appendix but it's rather controversial even among the authors)
If you only look at the loss of the worst experiment (so the maximum CaSc loss rather than the average one) you don't get these kind of cancellation problems
I think this "max loss" procedure is different from what Buck wrote and the same as what I wrote.
Why focus on the fullest set of swaps? An obvious alternative to “evaluate the hypothesis using the fullest set of swaps” is “evaluate the hypothesis by choosing the set of swaps allowed by H which make it look worse”.
I just now have realized that this is AFACIT equivalent to constructing your CaSc hypothesis adversarially--that is, given a hypothesis H, allowing an adversary to choose some other hypothesis H’, and then you run the CaSc experiment on join(H, H’).
One thing that is not equivalent to joins, which you might also want to do, is to choose the single worst swap that the hypothesis allows. That is, if a set of node values are all equivalent, you can choose to map all of them to e.g. . And that can be more aggressive than any partition of X which is then chosen-from randomly, and does not correspond to joins.
I do think this is a good insight. Or like, it's not new, SAEs do this; but it's fresh way of looking at it that yields: perhaps SAEs are trying to impose a particular structure on the input too much, and instead we should just try to compress the latent stream. Perhaps using diffusion or similar techniques.