Just left Vivek Hebbar's team at MIRI, now doing various empirical alignment projects.
I'm looking for projects in interpretability, activation engineering, and control/oversight; DM me if you're interested in working with me.
The "surgical model edits" section should also have a subsection on editing model weights. For example there's this paper on removing knowledge from models using multi-objective weight masking.
See also Holtman’s neglected result.
Does anyone have a technical summary? This sounds pretty exciting, but the paper is 35 pages and I can't find a summary anywhere that straightforwardly tells me a formal description of the setting, why it satisfies the desiderata it does, and what this means for the broader problem of reflective stability in shutdownable agents.
Suppose that we are selecting for U=X+V where V is true utility and X is error. If our estimator is unbiased (E[X|V=v]=0 for all v) and X is light-tailed conditional on any value of V, do we have limt→∞E[V|X+V≥t]=∞?
No; here is a counterexample. Suppose that V∼N(0,1), and X|V∼N(0,4) when V∈[−1,1], otherwise X=0. Then I think limt→∞E[V|X+V≥t]=0.
This is worrying because in the case where V∼N(0,1) and X∼N(0,4) independently, we do get infinite V. Merely making the error *smaller* for large values of V causes catastrophe. This suggests that success caused by light-tailed error when V has even lighter tails than X is fragile, and that these successes are “for the wrong reason”: they require a commensurate overestimate of the value when V is high as when V is low.
We might want to keep our AI from learning a certain fact about the world, like particular cognitive biases humans have that could be used for manipulation. But a sufficiently intelligent agent might discover this fact despite our best efforts. Is it possible to find out when it does this through monitoring, and trigger some circuit breaker?
Evals can measure the agent's propensity for catastrophic behavior, and mechanistic anomaly detection hopes to do better by looking at the agent's internals without assuming interpretability, but if we can measure the agent's beliefs, we can catch the problem earlier. Maybe there can be more specific evals we give to the agent, which are puzzles that can only be solved if the agent knows some particular fact. Or maybe the agent is factorable into a world-model and planner, and we can extract whether it knows the fact from the world-model.
Have the situational awareness people already thought about this? Does anything change when we're actively trying to erase a belief?
Eight beliefs I have about technical alignment research
Written up quickly; I might publish this as a frontpage post with a bit more effort.
, : I expect some but not all of the MIRI threat models to come into play. Like, when we put safeguards into agents, they'll rip out or circumvent some but not others, and it's super tricky to predict which. My research with Vivek often got stuck by worrying too much about reflection, others get stuck by worrying too little.
A toy model of intelligence implies that there's an intelligence threshold above which minds don't get stuck when they try to solve arbitrarily long/difficult problems, and below which they do get stuck. I might not write this up otherwise due to limited relevance, so here it is as a shortform, without the proofs, limitations, and discussion.
A task of difficulty n is composed of n independent and serial subtasks. For each subtask, a mind of cognitive power Q knows Q different “approaches” to choose from. The time taken by each approach is at least 1 but drawn from a power law, P(X>x)=x−α for x>1, and the mind always chooses the fastest approach it knows. So the time taken on a subtask is the minimum of Q samples from the power law, and the overall time for a task is the total for the n subtasks.
Main question: For a mind of strength Q,
I think the ability to post-hoc fit something is questionable evidence that it has useful predictive power. I think the ability to actually predict something else means that it has useful predictive power.
It's always trickier to reason about post-hoc, but some of the observations could be valid, non-cherry-picked parallels between evolution and deep learning that predict further parallels.
I think looking at which inspired more DL capabilities advances is not perfect methodology either. It looks like evolution predicts only general facts whereas the brain also inspires architectural choices. Architectural choices are publishable research whereas general facts are not, so it's plausible that evolution analogies are decent for prediction and bad for capabilities. Don't have time to think this through further unless you want to engage.
One more thought on learning rates and mutation rates:
As far as I know optimal learning rate for most architectures is scheduled, and decreases over time, which is not a feature of evolution so far as I am aware?
This feels consistent with evolution, and I actually feel like someone clever could have predicted it in advance. Mutation rate per nucleotide is generally lower and generation times are longer in more complex organisms; this is evidence that lower genetic divergence rates are optimal, because evolution can tune them through e.g. DNA repair mechanisms. So it stands to reason that if models get more complex during training, their learning rate should go down.
Does anyone know if decreasing learning rate is optimal even when model complexity doesn't increase over time?
I'm finally engaging with this after having spent too long afraid of the math. Initial thoughts:
Disagree on several points. I don't need future AIs to satisfy some mathematically simple description of corrigibility, just for them to be able to solve uploading or nanotech or whatever without preventing us from changing their goals. This laundry list by Eliezer of properties like myopia, shutdownability, etc. seems likely to make systems more controllable and less dangerous in practice, and while not all of them are fully formalized it seems like there are no barriers to achieving these properties in the course of ordinary engineering. If there is some argument why this is unlikely, I haven't seen a good rigorous version.
As Algon says in a sibling comment, non-agentic systems are by default shutdownable, myopic, etc. In addition, there are powerful shutdownable systems: KataGo can beat me at Go but doesn't prevent itself from being shut down for instrumental reasons, whereas humans generally will. So there is no linear scale of "powerful optimizer" that determines whether a system is easy to shut down. If there is some property of competent systems in practice that does prevent shutdownability, what is it? Likewise with other corrigibility properties. That's what I'm trying to get at with my comment. "Goal-oriented" is not an answer, it's not specific enough for us to make engineering progress on corrigibility.
I think the claim that there is no description of corrigibility to which systems can easily generalize is really strong. It's plausible to me that corrigibility-- again, in this practical rather than mathematically elegant sense-- is rare or anti-natural in systems competent enough to do novel science efficiently, but it seems like your claim is that it's incoherent. This seems unlikely because myopia, shutdownability, and the other properties on Eliezer's laundry list are just ordinary cognitive properties that we can apply selection pressure on, and modern ML is pretty good at generalizing. Nate's post here is arguing that we are unlikely to get corrigibility without investing in an underdeveloped "science of AI" that gives us mechanistic understanding, and I think there needs to be some other argument here for it to be convincing, but your claim seems even stronger.
I'm also unsure why you say shutdownability hasn't been formalized. I feel like we're confused about how to get shutdownability, not what it is.
This homunculus is frequently ascribed almost magical powers, like the ability to perform gradient surgery on itself during training to subvert the training process.
Gradient hacking in supervised learning is generally recognized by alignment people (including the author of that article) to not be a likely problem. A recent post by people at Redwood Research says "This particular construction seems very unlikely to be constructible by early transformative AI, and in general we suspect gradient hacking won’t be a big safety concern for early transformative AI". I would still defend the past research into it as good basic science, because we might encounter failure modes somewhat related to it.