The "surgical model edits" section should also have a subsection on editing model weights. For example there's this paper on removing knowledge from models using multi-objective weight masking.
See also Holtman’s neglected result.
Does anyone have a technical summary? This sounds pretty exciting, but the paper is 35 pages and I can't find a summary anywhere that straightforwardly tells me a formal description of the setting, why it satisfies the desiderata it does, and what this means for the broader problem of reflective stability in shutdownable agents.
I spent a good hour or two reading the construction and proposed solution of the paper; here's my attempted explanation with cleaned up notation.
Basically, he considers a setting with four actions: a, b, c, d, and a real numbered state s, where R(s, a) > R(s, b) = R(s, c) > R(s, d) = 0 if s > 0 and 0 = R(s, d) > R(s, c) = R (s, b) > R(s, c) if s <= 0.
The transition rule is:s' = s - 1 + L if action b is taken and s > 0,s' = s - 1 - L if action c is taken and s > 0,s' = s - 1 otherwisefor some constant L >= 0.
The ... (read more)
Suppose that we are selecting for U=X+V where V is true utility and X is error. If our estimator is unbiased (E[X|V=v]=0 for all v) and X is light-tailed conditional on any value of V, do we have limt→∞E[V|X+V≥t]=∞?
No; here is a counterexample. Suppose that V∼N(0,1), and X|V∼N(0,4) when V∈[−1,1], otherwise X=0. Then I think limt→∞E[V|X+V≥t]=0.
This is worrying because in the case where V∼N(0,1) and X∼N(0,4) independently, we do get infinite V. Merely making the error *smaller* for large v... (read more)
We might want to keep our AI from learning a certain fact about the world, like particular cognitive biases humans have that could be used for manipulation. But a sufficiently intelligent agent might discover this fact despite our best efforts. Is it possible to find out when it does this through monitoring, and trigger some circuit breaker?
Evals can measure the agent's propensity for catastrophic behavior, and mechanistic anomaly detection hopes to do better by looking at the agent's internals without assuming interpretability, but if we can measure the a... (read more)
Eight beliefs I have about technical alignment research
Written up quickly; I might publish this as a frontpage post with a bit more effort.
A toy model of intelligence implies that there's an intelligence threshold above which minds don't get stuck when they try to solve arbitrarily long/difficult problems, and below which they do get stuck. I might not write this up otherwise due to limited relevance, so here it is as a shortform, without the proofs, limitations, and discussion.
A task of difficulty n is composed of n independent and serial subtasks. For each subtask, a mind of cognitive power Q knows Q differ... (read more)
I think the ability to post-hoc fit something is questionable evidence that it has useful predictive power. I think the ability to actually predict something else means that it has useful predictive power.
It's always trickier to reason about post-hoc, but some of the observations could be valid, non-cherry-picked parallels between evolution and deep learning that predict further parallels.
I think looking at which inspired more DL capabilities advances is not perfect methodology either. It looks like evolution predicts only general facts whereas the brain a... (read more)
I'm finally engaging with this after having spent too long afraid of the math. Initial thoughts:
Disagree on several points. I don't need future AIs to satisfy some mathematically simple description of corrigibility, just for them to be able to solve uploading or nanotech or whatever without preventing us from changing their goals. This laundry list by Eliezer of properties like myopia, shutdownability, etc. seems likely to make systems more controllable and less dangerous in practice, and while not all of them are fully formalized it seems like there are no barriers to achieving these properties in the course of ordinary engineering. If there is some... (read more)
This homunculus is frequently ascribed almost magical powers, like the ability to perform gradient surgery on itself during training to subvert the training process.
Gradient hacking in supervised learning is generally recognized by alignment people (including the author of that article) to not be a likely problem. A recent post by people at Redwood Research says "This particular construction seems very unlikely to be constructible by early transformative AI, and in general we suspect gradient hacking won’t be a big safety concern for early transformative A... (read more)
By the time AIs are powerful enough to endanger the world at large, I expect AIs to do something akin to “caring about outcomes”, at least from a behaviorist perspective (making no claim about whether it internally implements that behavior in a humanly recognizable manner).Roughly, this is because people are trying to make AIs that can steer the future into narrow bands (like “there’s a cancer cure printed on this piece of paper”) over long time-horizons, and caring about outcomes (in the behaviorist sense) is the flip side of the same coin as steerin
By the time AIs are powerful enough to endanger the world at large, I expect AIs to do something akin to “caring about outcomes”, at least from a behaviorist perspective (making no claim about whether it internally implements that behavior in a humanly recognizable manner).
Roughly, this is because people are trying to make AIs that can steer the future into narrow bands (like “there’s a cancer cure printed on this piece of paper”) over long time-horizons, and caring about outcomes (in the behaviorist sense) is the flip side of the same coin as steerin
I'm very sympathetic to this complaint; I think that these arguments simply haven't been made rigorously, and at this point it seems like Nate and Eliezer are not in an epistemic position where they're capable of even trying to do so. (That is, they reject the conception of "rigorous" that you and I are using in these comments, and therefore aren't willing to formulate their arguments in a way which moves closer to meeting it.)
You should look at my recent post on value systematization, which is intended as a framework in which these claims can be discussed more clearly.
This is a meta-point, but I find it weird that you ask what is "caring about something" according to CS but don't ask what "corrigibility" is, despite the fact of existence of multiple examples of goal-oriented systems and some relatively-good formalisms (we disagree whether expected utility maximization is a good model of real goal-oriented systems, but we all agree that if we met expected utility maximizer we would find its behavior pretty much goal-oriented), while corrigibility is a pure product of imagination of one particular Eliezer Yudkowsky, born ... (read more)
Does evolution ~= AI have predictive power apart from doom?
Evolution analogies predict a bunch of facts that are so basic they're easy to forget about, and even if we have better theories for explaining specific inductive biases, the simple evolution analogies should still get some weight for questions we're very uncertain about.
I agree that if you knew nothing about DL you'd be better off using that as an analogy to guide your predictions about DL than using an analogy to a car or a rock.
I do think a relatively small quantity of knowledge about DL screens off the usefulness of this analogy; that you'd be better off deferring to local knowledge about DL than to the analogy.
Or, what's more to the point -- I think you'd better defer to an analogy to brains than to evolution, because brains are more like DL than evolution is.
Combining some of yours and Habryka's comments, which see... (read more)
Maybe the reward models are expressive enough to capture all patterns in human preferences, but it seems nice to get rid of this assumption if we can. Scaling laws suggest that larger models perform better (in the Gao paper there is a gap between 3B and 6B reward model) so it seems reasonable that even the current largest reward models are not optimal.
I guess it hasn't been tested whether DPO scales better than RLHF. I don't have enough experience with these techniques to have a view on whether it does.
DPO seems like a step towards better and more fine-grained control over models than RLHF, because it removes the possibility that the reward model underfits.
It seems like there's some intuition underlying this post for why the wildfire spark of strategicness is possible, but there is no mechanism given. What is this mechanism, and in what toy cases do you see a wildfire of strategicness? My guess is something like
This doesn't really feel like a wildfire though, so I'm curious if you have something different in mind.
I commented on the original post last year regarding the economics angle:
Ryan Kidd and I did an economics literature review a few weeks ago for representative agent stuff, and couldn't find any results general enough to be meaningful. We did find one paper that proved a market's utility function couldn't be of a certain restricted form, but nothing about proving the lack of a coherent utility function in general. A bounty also hasn't found any such papers.
Based on this lit review and the Wikipedia page and ChatGPT , I'm 90% sure that "representative age... (read more)
This was previously posted (though not to AF) here: https://www.lesswrong.com/posts/eS7LbJizE5ucirj7a/dath-ilan-s-views-on-stopgap-corrigibility
How do you think "agent" should be defined?
Prediction market for whether someone will strengthen our results or prove something about the nonindependent case:
Downvoted, this is very far from a well-structured argument, and doesn't give me intuitions I can trust either
I'm fairly sure you can get a result something like "it's not necessary to put positive probability mass on two different functions that can't be distinguished by observing only s bits", so some functions can get zero probability, e.g. the XOR of all combinations of at least s+1 bits.
edit: The proof is easy. Let f1, f2 be two such indistinguishable functions that you place positive probability on, F be a random variable for the function, and F' be F but with all probability mass for f2 replaced by f1. Then P(Y|F)=P(Y|F′).... (read more)
I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.
I think to solve alignment, we need to develop our toolbox of "getting AI systems to behave in ways we choose". Not in the sense of being friendly or producing economic value, but things that push towards whatever cognitive properties we need for a future alignment solution. We can make AI systems do some things we want e.g. GPT-4 can answer questions with only words starting with "Q", but we don't know how it does this in terms of internal representations of concepts. Current systems are not well-characterized enough that we can predict what they do far O... (read more)
This is the most impressive concrete achievement in alignment I've seen. I think this post reduces my p(doom) by around 1%, and I'm excited to see where all of the new directions uncovered lead.
Edit: I explain this view in a reply.
Edit 25 May: I now think RLHF is more impressive in terms of what we can get systems to do, but I still think activation editing has opened up more promising directions. This is still in my all-time top 10.
What other concrete achievements are you considering and ranking less impressive than this? E.g. I think there's a case for more alignment progress having come from RLHF, debate, some mechanistic interpretability, or adversarial training.
SGD has inductive biases, but we'd have to actually engineer them to get high V rather than high X when only trained on U=V+X. In the Gao et al paper, optimization and overoptimization happened at the same relative rate in RL as in conditioning, so I think the null hypothesis is that training does about as well as conditioning. I'm pretty excited about work that improves on that paper to get higher gold reward while only having access to the proxy reward model.
I think the point still holds in mainline shard theory world, which in m... (read more)
That section is even more outdated now. There's nothing on interpretability, Paul's work now extends far beyond IDA, etc. In my opinion it should link to some other guide.
This seems good if it could be done. But the original proposal was just a call for labs to individually pause their research, which seems really unlikely to work.
Also, the level of civilizational competence required to compensate labs seems to be higher than for other solutions. I don't think it's a common regulatory practice to compensate existing labs like this, and it seems difficult to work out all the details so that labs will feel adequately compensated. Plus there might be labs that irrationally believe they're undervalued. Regulations similar to the nuclear or aviation industry feel like a more plausible way to get slowdown, and have the benefit that they actually incentivize safety work.
I'm worried that "pause all AI development" is like the "defund the police" of the alignment community. I'm not convinced it's net bad because I haven't been following governance-- my current guess is neutral-- but I do see these similarities:
There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals
There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals
IMO making the field of alignment 10x larger or evals do not solve a big part of the problem, while indefinitely pausing AI development would. I agree it's much harder, but I think it's good to at least try, as long as it doesn't terribly hurt less ambitious efforts (which I think it doesn't).
I'm planning to write a post called "Heavy-tailed error implies hackable proxy". The idea is that when you care about V and are optimizing for a proxy U=V+X, Goodhart's Law sometimes implies that optimizing hard enough for U causes V to stop increasing.
A large part of the post would be proofs about what the distributions of X and V must be for limt→∞E[V|V+X>t]=0, where X and V are independent random variables with mean zero. It's clear that
Suppose an agent has this altruistic empowerment objective, and the problem of getting an objectiveI(human actions,state) into the agent has been solved.
Wouldn't it be maximized by forcing the human in front of a box that encrypts its actions and uses the resulting stream to determine the fate of the universe? Then the human would be maximally "in control" of the universe but unlikely to create a universe that's good by human preferences.
I think this reflects two problems:
FWIW this was basically cached for me, and if I were better at writing and had explained this ~10 times before like I expect Eliezer has, I'd be able to do about as well. So would Nate Soares or Buck or Quintin Pope (just to pick people in 3 different areas of alignment), and Quintin would also have substantive disagreements.
A while ago you wanted a few posts on outer/inner alignment distilled. Is this post a clear explanation of the same concept in your view?
not Nate or a military historian, but to me it seems pretty likely for a ~100 human-years more technologically advanced actor to get decisive strategic advantage over the world.
There's a clarification by John here. I heard it was going to be put on Superlinear but unclear if/when.
Why should we expect that True Names useful for research exist in general? It seems like there are reasons why they don't:
Were any cautious people trying empirical alignment research before Redwood/Conjecture?
Do you have thoughts on when there are two algorithms that aren’t “doing the same thing” that fall within the same loss basin?
It seems like there could be two substantially different algorithms which can be linearly interpolated between with no increase in loss. For example, the model is trained to classify fruit types and ripeness. One module finds the average color of a fruit (in an arbitrary basis), and another module uses this to calculate fruit type and ripeness. The basis in which color is expressed can be arbitrary, since the second module can compe... (read more)
The ultimate goal of John Wentworth’s sequence "Basic Foundations for Agent Models" is to prove a selection theorem of the form:
John has not yet proved su... (read more)
Note that the particular form of "nonexistence of a representative agent" John mentions is an original result that's not too difficult to show informally, but hasn't really been written down formally either here or in the economics literature.
Ryan Kidd and I did an economics literature review a few weeks ago for representative agent stuff, and couldn't find any results general enough to be meaningful. We did find one paper that proved a market's utility function couldn't be of a certain restricted form, but nothing about proving the lack of a coherent util... (read more)
Again analogizing from the definition in “Risks From Learned Optimization”, “corrigible alignment” would be developing a motivation along the lines of “whatever my subcortex is trying to reward me for, that is what I want!” Maybe the closest thing to that is hedonism? Well, I don’t think we want AGIs with that kind of corrigible alignment, for reasons discussed below.
At first this claim seemed kind of wild, but there's a version of it I agree with.
It seems like conditional on the inner optimizer being corrigible, in the sense of having a goal that's a poin... (read more)
I think a lot of commenters misunderstand this post, or think it's trying to do more than it is. TLDR of my take: it's conveying intuition, not suggesting we should model preferences with 2D vector spaces.
The risk-neutral measure in finance is one way that "rotations" between probability and utility can be made:
As far as I can tell, this is the entire point. I don't see this 2D vector space actually being used in modeling agents, and I don't think Abram does either.
I largely agree. In retrospect, a large part of the point of this post for me is that it's practical to think of decision-theoretic agents as having expected value estimates for everything without having a utility function anywhere, which the expected values are "expectations of".
A utility function is a gadget for turning probability distributions into expected values. This object makes sense in ... (read more)
I think we need to unpack "sufficiently aligned"; here's my attempt. There are A=2^10000 10000-bit strings. Maybe 2^1000 of them are coherent English text, and B=2^200 of these are alignment proposals that look promising to a human reviewer, and C=2^100 of them are actually correct and will result in aligned AI.The thesis of the post requires that we can make a "sufficiently aligned" AI that, conditional on a proposal looking promising, is likely to be actually correct.
I haven't heard this. What's the strongest criticism?