David Reber - AI Alignment Forum

Could you clarify how this relates to e.g. the PC (Peter-Clark) or FCI (Fast Causal Inference) algorithms for causal structure learning?

Like, are you making different assumptions (than e.g. minimality, faithfulness, etc)?

Shutdown-Seeking AI

David Reber1y00

Distinguish two types of shutdown goals: temporary and permanent. These types of goals may differ with respect to entrenchment. AGIs that seek temporary shutdown may be incentivized to protect themselves during their temporary shutdown. Before shutting down, the AGI might set up cyber defenses that prevent humans from permanently disabling it while ‘asleep’. This is especially pressing if the AGI has a secondary goal, like paperclip manufacturing. In that case, protection from permanent disablement increases its expected goal satisfaction. On the other hand, AGIs that desire permanent shutdown may be less incentivized to entrench.

It seems like an AGI built to desire permanent shutdown may have an incentive to permanently disempower humanity, then shut down. Otherwise, there's a small chance that humanity may revive the AGI, right?

Steering GPT-2-XL by adding an activation vector

David Reber1y10

Another related work: Concept Algebra for Text-Controlled Vision Models (Discloser: while I did not author this paper, I am in the PhD lab who did, under Victor Veitch at UChicago. Any mistakes made in this comment are my own). We haven't prioritized a blog post about the paper so it makes sense that this community isn't familiar with it.

The concept algebra paper demonstrates that for text-to-image models like Stable Diffusion, there exist linear subspaces in the score embedding space, on which you can do the same manner of concept editing/control as Word-to-Vec.

Importantly, the paper comes with some theoretical investigation into why this might be the case, including articulating necessary assumptions/conditions (which this purely-empirical post does not).

I conjecture that the reason that <some activation additions in this post fail to have the desired effect> may be because they violate some conditions analogous to those in Concept Algebra: it feels a bit deja-vu to look at section E.1 in the appendix, of some empirical results which fail to act as expected when the conditions of completeness and causal separability don't hold.

EIS V: Blind Spots In AI Safety Interpretability Research

David Reber1y00

Also, just to make sure we share a common understanding of Schölkopf 2021: Wouldn't you agree that asking "how do we do causality when we don't even know what level abstraction on which to define causal variables?" is beyond the "usual pearl causality story" as usually summarized in FFS posts? It certainly goes beyond Pearl's well-known works.

EIS V: Blind Spots In AI Safety Interpretability Research

David Reber1y00

I don't think my claim is that "FFS is already subsumed by work in academia": as I acknowledge, FFS is a different theoretical framework than Pearl-based causality. I view them as two distinct approaches, but my claim is that they are motivated by the same question (that is, how to do causal representation learning).

It was intentional that the linked paper is an intro survey paper to the Pearl-ish approach to causal rep. learning: I mean to indicate that there are already lots of academic researchers studying the question "what does it mean to study causality if we don't have pre-defined variables?"

It may be that FFS ends up contributing novel insights above and beyond <Pearl-based causal rep. learning>, but a priori I expect this to occur only if FFS researchers are familiar with the existing literature, which I haven't seen mentioned in any FFS posts.

My line of thinking is: It's hard to improve on a field you aren't familiar with. If you're ignorant of the work of hundreds of other researchers who are trying to answer the same underlying question you are, odds are against your insights being novel / neglected.

EIS V: Blind Spots In AI Safety Interpretability Research

David Reber2y02

Strongly upvoting this for being a thorough and carefully cited explanation of how the safety/alignment community doesn't engage enough with relevant literature from the broader field, likely at the cost of reduplicated work, suboptimal research directions, and less exchange and diffusion of important safety-relevant ideas

Ditto. I've recently started moving into interpretability / explainability and spent the past week skimming the broader literature on XAI, so the timing of this carefully cited post is quite impactful for me.

I see similar things happening with causality generally, where it seems to me that (as a 1st order heuristic) much of alignment forum's reference for causality is frozen at Pearl's 2008 textbook, missing what I consider to be most of the valuable recent contributions and expansions in the field.

Example: Finite Factored Sets seems to be reinventing causal representation learning [for a good intro, see Schölkopf 2021], where it seems to me that the broader field is outpacing FFS on its own goals. FFS promises some theoretical gains (apparently to infer causality where Pearl-esque frameworks can't) but I'm no longer as sure about the validity of this.
Counterexample(s): the Causal Incentives Working Group, and David Krueger's lab, for instance. Notably these are embedded in academia, where there's more culture (incentive) to thoroughly relate to previous work. (These aren't the only ones, just 2 that came to mind.)

Models Don't "Get Reward"

David Reber2y31

Under the "reward as selection" framing, I find the behaviour much less confusing:
We use reward to select for actions that led to the agent reaching the coin.
This selects for models implementing the algorithm "move towards the coin".
However, it also selects for models implementing the algorithm "always move to the right".
It should therefore not be surprising you can end up with an agent that always moves to the right and not necessarily towards the coin.

I've been reconsidering the coin run example as well recently from a causal perspective, and your articulation helped me crystalize my thoughts. Building on these points above, it seems clear that the core issue is one of causal confusion: that is, the true causal model M is "move right" -> "get the coin" -> "get reward". However, if the variable of "did you get the coin" is effectively latent (because the model selection doesn't discriminate on this variable) then the causal model M is indistinguishable from M' which is "move right" -> "get reward" (which though it is not the true causal model governing the system, generates the same observational distribution).

In fact, the incorrect model M' actually has shorter description length, so it may be that here there is a bias against learning the true causal model. If so, I believe we have a compelling explanation for the coin runner phenomenon which does not require the existence of a mesa optimizer, and which does indicate we should be more concerned about causal confusion.

The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

David Reber2y0-1

I'd be interested in seeing other matrix factorizations explored as well. Specifically, I would recommend trying nonnegative matrix factorizations: to quote the Wikipedia article:

This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered.

The added constraint may help eliminate spurious patterns: for instance, I suspect the positive/negative singular value distinction might be a red herring (based on past projects I've worked on).

Testing The Natural Abstraction Hypothesis: Project Update

David Reber2y20

[Warning: "cyclic" overload. I think in this post it's referring to the dynamical systems definition, i.e. variables reattain the same state later in time. I'm referring to Pearl's causality definition: variable X is functionally dependent on variable Y, which is itself functionally dependent on variable X.]

Turns out Chaos is not Linear...

I think the bigger point (which is unaddressed here) is that chaos can't arise for acyclic causal models (SCMs). Chaos can only arise when there is feedback between the variables right? Hence the characterization of chaos is that orbits of all periods are present in the system: you can't have an orbit at all without functional feedback. The linear approximations post is working on an acyclic Bayes net.

I believe this sort of phenomenon [ chaos ] plays a central role in abstraction in practice: the “natural abstraction” is a summary of exactly the information which isn’t wiped out. So, my methods definitely needed to handle chaos.

Not all useful systems in the world are chaotic. And the Telephone Theorem doesn't rely on chaos as the mechanism for information loss. So it seems too strong to say "my methods definitely need to handle chaos". Surely there are useful footholds in between the extremes of "acyclic + linear" to "cyclic + chaos": for instance, "cyclic + linear".

At any rate, Foundations of Structural Causal Models with Cycles and Latent Variables could provide a good starting point for cyclic causal models (also called structural equation models). There are other formalisms as well but I'm preferential towards this because of how closely it matches Pearl.

The Telephone Theorem: Information At A Distance Is Mediated By Deterministic Constraints

David Reber2y20

As I understand it, the proof in the appendix only assumes we're working with Bayes nets (so just factorizations of probability distributions). That is, no assumption is made that the graphs are causal in nature (they're not necessarily assumed to be the causal diagrams of SCMs) although of course the arguments still port over if we make that stronger assumption.

Is that correct?

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments