Was on Vivek Hebbar's team at MIRI, now working with Adrià Garriga-Alonso on various empirical alignment projects.
I'm looking for projects in interpretability, activation engineering, and control/oversight; DM me if you're interested in working with me.
If SAE features are the correct units of analysis (or at least more so than neurons), should we expect that patching in the feature basis is less susceptible to the interpretability illusion than in the neuron basis?
Maybe related: A paper likely to get an oral at ICLR 2024. I haven't read it, but I think it substantially improves on the Good Regulator Theorem. I think their Theorem 1 shows that from an optimal policy, you can identify (deduce) the exact causal model of the data generating process, and Theorem 2 shows that from a policy satisfying regret bounds, you can identify an approximate causal model. The assumptions are far weaker and more realistic than being the simplest policy that can perfectly regulate some variable.
Robust agents learn causal world models
[...]
We prove that agents that are capable of adapting to distributional shifts must have learned a causal model of their environment, establishing a formal equivalence between causality and transfer learning.
I think that interpretability research isn't going to be able to produce explanations that are very faithful explanations of what's going on in non-toy models (e.g. I think that no such explanation has ever been produced). Since I think faithful explanations are infeasible, measures of faithfulness of explanations don't seem very important to me now.
By "explanations" you mean labeled high-level causal graphs right? Do you also think it's infeasible to identify sparse, unlabeled circuits as "the part of the model that's doing the task", like in ACDC, in a way that gets good performance on some downstream task?
Can you define what you mean by consequentialism? It's clearly dangerous to have a system with a fixed utility function over configurations of the world, but this is not necessary for an AGI, or necessary to be dangerous. Weaker notions like "picks thoughts in part based on real-world consequences" do not obviously lead to danger.
There are some subskills to having consistent goals that I think will be selected for, at least when outcome-based RL starts working to get models to do long-horizon tasks. For example, the ability to not be distracted/nerdsniped into some different behavior by most stimuli while doing a task. The longer the horizon, the more selection-- if you have to do a 10,000 step coding project, then the probability you get irrecoverably distracted on one step has to be below 1/10,000.
I expect some pretty sophisticated goal-regulation circuitry to develop as models get more capable, because humans need it, and this makes me pretty scared.
Some thoughts:
"Nearly no data" is way too strong a statement, and relies on this completely binary distinction between things that are not AGI and things that are AGI.
The right question is, what level of dangerous consequentialist goals are needed for systems to reach certain capability levels, e.g. novel science? It could have been that to be as useful as LLMs, systems would be as goal-directed as chimpanzees. Animals display goal-directed behavior all the time, and to get them to do anything you mostly have to make the task instrumental to their goals e.g. offer them treats. However we can control LLMs way better than we can animals, and the concerns are of goal misgeneralization, misspecification, robustness, etc. rather than affecting the system's goals at all.
It remains to be seen what happens at higher capability levels, and alignment will likely get harder, but current LLMs are definitely significant evidence. Like, imagine if people were worried about superintelligent aliens invading Earth and killing everyone due to their alien goals, and scientists were able to capture an animal from their planet as smart as chimpanzees and make it as aligned as LLMs, such that it would happily sit around and summarize novels for you, follow your instructions, try to be harmless for personality rather than instrumental reasons, and not eat your body if you die alone. This is not the whole alignment problem but seems like a decent chunk of it! It could have been much harder.
My current take is that we don't have good formalisms for consequentialist goal-directed systems that are weaker than expected utility maximization, and therefore don't really know how to reason about them. I think this is main cause of overemphasis on EUM.
For example, completeness as stated in the VNM assumptions is actually a really strong property. Aumann wrote a paper on removing completeness, but the utility function is no longer unique.
I now think the majority of impact of AI pause advocacy will come from the radical flank effect, and people should study it to decide whether pause advocacy is good or bad.