Yes! There's two ways that can be relevant. First, a ton of bits presumably come from unsupervised learning of the general structure of the world. That part also carries over to natural abstractions/minimal latents: the big pile of random variables from which we're extracting a minimal latent is meant to represent things like all those images the toddler sees over the course of their early life.
Second, sparsity: most of the images/subimages which hit my eyes do not contain apples. Indeed, most images/subimages which hit my eyes do not contain instances of ...
I don't think this post is aimed at the same concept(s).
Point is that the "Structural(Inner) prediction method" doesn't seem particularly likely to generalize across things-which-look-like-big-neural-nets. It more plausibly generalizes across things-which-learn-to-perform-diverse-tasks-in-diverse-environments, but I don't think neural net aspect is carrying very much weight there.
This is some evidence that it'll work for AGIs too; after all, both humans and AGIs are massive neural nets that learn to perform diverse tasks in diverse environments.
Highly debatable whether "massive neural nets that learn to perform diverse tasks in diverse environments" is a natural category. "Massive neural net" is not a natural category - e.g. transformers vs convnets vs boltzmann machines are radically different things, to the point where understanding one tells us very little about the others. The embedding of interpretable features of one does not...
Great post! I think the things said in the post are generally correct - in particular, I agree with the overall point that objective-centric arguments (e.g. power-seeking) are plausible, and therefore support a high enough probability of doom to justify alignment work, but aren't sufficiently probable to justify a very high probability of doom.
That said, I do think a very high probability of doom can be justified. The arguments have to route primarily through failure of the iterative design loop for AI alignment in particular, rather than primarily through...
Priors against Scenario 2. Another possibility is that given only the information in Scenario 1, people had strong priors against the story in Scenario 2, such that they could say “99% likely that it is outer misalignment” for Scenario 1, which gets rounded to “outer misalignment”, while still saying “inner misalignment” for Scenario 2.
I would guess this is not what’s going on. Given the information in Scenario 1, I’d expect most people would find Scenario 2 reasonably likely (i.e. they don’t have priors against it).
FWIW, this was basically my thinking on ...
Roughly, yeah. I currently view the types of and as the "low-level" type signature of abstraction, in some sense to be determined. I expect there are higher-level organizing principles to be found, and those will involve refinement of the types and/or different representations.
The main problem I see with hodge-podge-style strategies is that most alignment ideas fail in roughly-the-same cases, for roughly-the-same reasons. It's the same hard cases/hard subproblems which kill most plans. In particular, section B.2 (and to a lesser extent B.1 - B.3) of List of Lethalities covers "core problems" which strategies usually fail to handle.
In terms of methodology, epistemology, etc, what did you do right/wrong? What advice would you today give to someone who produced something like your old goal-deconfusion work, or what did your previous self really need to hear?
I want to see Adam do a retrospective on his old goal-deconfusion stuff.
I only skimmed the post, so apologies if you addressed this problem and I missed it.
Problem: even if the AI's utility function is time-bounded, there may still be other agents in the environment whose utility functions are not time-bounded, and those agents will be willing to trade short-term resources/assistance for long-term resources/assistance. So, for instance, the 10-minute laundry-folding robot might still be incentivized to create a child AI which persists for a long time and seizes lots of resources, in order to trade those future resources to some other agent who can help fold the laundry in the next 10 minutes.
That’s true! Thanks for pointing this out; I added a subsection about it to the post. There are probably also a bunch of other cases I haven’t thought of that provide stories for how the environment directly rewards actions that go against the spirit of the shutdown criterion (besides imitation and this one, which I might call “trade”). This construction does nothing to counteract such incentives. Rather, it just avoids the way that being an infinite-horizon RL agent systematically creates new ones.
Sounds closer. Maybe "there's always surprises"? Or "your pre-existing models/tools/frames are always missing something"? Or "there are organizing principles, but you're not going to guess all of them ahead of time"?
In my opinion, whenever you're faced with a question about like this, it's always more messy than you think
I think this is exactly wrong. I think that mainly because I personally went into biology research, twelve years ago, expecting systems to be fundamentally messy and uninterpretable, and it turned out that biological systems are far less messy than I expected.
We've also seen the same, in recent years, with neural nets. Early on, lots of people expected that the sort of interpretable structure found by Chris Olah & co wouldn't exist. And yet, whene...
And yet, whenever we actually delve into these systems, it turns out that there's a ton of ultimately-relatively-simple internal structure.
I'm not sure exactly what you mean by "ton of ultimately-relatively-simple internal structure".
I'll suppose you mean "a high percentage of what models use parameters for is ultimately simple to humans" (where by simple to humans we mean something like, description length in the prior of human knowledge, e.g., natural language).
If so, this hasn't been my experience doing interp work or from the interp work I've seen (...
(Thinking out loud here...) In general, I am extremely suspicious of arguments that the expected-impact-maximizing strategy is to aim for marginal improvement (not just in alignment - this is a general heuristic); I think that is almost always false in practice, at least in situations where people bother to explicitly make the claim. So let's say I were somehow approximately-100% convinced that it's basically possible for iterative design to produce an AI. Then I'd expect AI is probably not an X-risk, but I still want to reduce the small remaining chance o...
Yeah, that's fair. The reason I talked about it that way is that I was trying to give what I consider the strongest/most general argument, i.e. the argument with the fewest assumptions.
What I actually think is that:
I think it would be much more interesting and helpful to exhibit a case of software with a vulnerability where it's really hard for someone to verify the claim that the vulnerability exists.
Conditional on such counterexamples existing, I would usually expect to not notice them. Even if someone displayed such a counterexample, it would presumably be quite difficult to verify that it is a counterexample. Therefore a lack of observation of such counterexamples is, at most, very weak evidence against their existence; we are forced to fall back on priors.
I get ...
I've been using "natural abstraction" here as if it just means an abstraction that would be useful for a wide variety of agents to have in their toolbox. But we might also use "natural abstractions" to denote the vital abstractions, those that aren't merely nice to have, but that you literally can't complete certain tasks without using.
The thing I usually have in mind these days is stronger than the first but weaker than the second. Roughly speaking: natural abstractions should be convergent for distributed system produced by local selection pressures. Tha...
I don't think the generalization of the OP is quite "sometimes it's easier to create an object with property P than to decide whether a borderline instance satisfies property P". Rather, the halting example suggests that verification is likely to be harder than generation specifically when there is some (possibly implicit) adversary. What makes verification potentially hard is the part where we have to quantify over all possible inputs - the verifier must work for any input.
Borderline cases are an issue for that quantifier, but more generally any sort of a...
Yeah, ok, so I am making a substantive claim that the distribution is bimodal. (Or, more accurately, the distribution is wide and work on RLHF only counterfactually matters if we happen to land in a very specific tiny slice somewhere in the middle.) Those "middle worlds" are rare enough to be negligible; it would take a really weird accident for the world to end up such that the iteration cycles provided by ordinary economic/engineering activity would not produce aligned AI, but the extra iteration cycles provided by research into RLHF would produce aligned AI.
In worlds where iterative design works, it works by iteratively designing some techniques. Why wouldn't RLHF be one of them?
Wrong question. The point is not that RLHF can't be part of a solution, in such worlds. The point is that working on RLHF does not provide any counterfactual improvement to chances of survival, in such worlds.
Iterative design is something which happens automagically, for free, without any alignment researcher having to work on it. Customers see problems in their AI products, and companies are incentivized to fix them; that's iterative...
The argument is not structurally invalid, because in worlds where iterative design works, we probably survive AGI without anybody (intentionally) thinking about RLHF. Working on RLHF does not particularly increase our chances of survival, in the worlds where RLHF doesn't make things worse.
That said, I admit that argument is not very cruxy for me. The cruxy part is that I do in fact think that relying on an iterative design loop fails for aligning AGI, with probability close to 1. And I think the various examples/analogies in the post convey my main intuition-sources behind that claim. In particular, the excerpts/claims from Get What You Measure are pretty cruxy.
Does this mean that you expect we will be able to build advanced AI that doesn't become an expected utility maximizer?
When talking about whether some physical system "is a utility maximizer", the key questions are "utility over what variables?", "in what model do those variables live?", and "with respect to what measuring stick?". My guess is that a corrigible AI will be a utility maximizer over something, but maybe not over the AI-operator interface itself? I'm still highly uncertain what that type-signature will look like, but there's a lot of degrees of...
Has the FTX fiasco impacted your expectation of us-in-the-future having enough money=compute to do the latter?
Basically no.
I'd like to make a case that Do What I Mean will potentially turn out to be the better target than corrigibility/value learning. ...
I basically buy your argument, though there's still the question of how safe a target DWIM is.
Still on the "figure out agency and train up an aligned AGI unilaterally" path?
"Train up an AGI unilaterally" doesn't quite carve my plans at the joints.
One of the most common ways I see people fail to have any effect at all is to think in terms of "we". They come up with plans which "we" could follow, for some "we" which is not in fact going to follow that plan. And then they take political-flavored actions which symbolically promote the plan, but are not in fact going to result in "we" implementing the plan. (And also, usually, the "we" in question is to...
Any changes to your median timeline until AGI, i. e., do we actually have these 9-14 years?
Here's a dump of my current timeline models. (I actually originally drafted this as part of the post, then cut it.)
My current intuition is that deep learning is approximately one transformer-level paradigm shift away from human-level AGI. (And, obviously, once we have human-level AGI things foom relatively quickly.) That comes from an intuitive extrapolation: if something were about as much better as the models of the last 2-3 years, as the models of the last 2-3 yea...
My own responses to OpenAI's plan:
These are obviously not intended to be a comprehensive catalogue of the problems with OpenAI's plan, but I think they cover the most egregious issues.
I'd add that everything in this post is still relevant even if the AGI in question isn't explicitly modelling itself as in a simulation, attempting to deceive human operators, etc. The more-general takeaway of the argument is that certain kinds of distribution shift will occur between training and deployment - e.g. a shift to a "large reality", universe which embeds the AI and has simple physics, etc. Those distribution shifts potentially make training behavior a bad proxy for deployment behavior, even in the absence of explicit malign intent of the AI toward its operators.
One subtlety which I'd expect is relevant here: when two singular vectors have approximately the same singular value, the two vectors are very numerically unstable (within their span).
Suppose that two singular vectors have the same singular value. Then in the SVD, we have two terms of the form
(where the 's and 's are column vectors). That middle part is just the shared singular value times a 2x2 identity matrix:
But the 2x2 identity matrix can be rewritten as a 2x2 rotation ...
One can argue that the goal-aligned model has an incentive to preserve its goals, which would result in an aligned model after SLT. Since preserving alignment during SLT is largely outsourced to the model itself, arguments for alignment techniques failing during an SLT don't imply that the plan fails...
I think this misses the main failure mode of a sharp left turn. The problem is not that the system abandons its old goals and adopts new goals during a sharp left turn. The problem is that the old goals do not generalize in the way we humans would prefe...
I maybe want to stop saying "explicitly thinking about it" (which brings up associations of conscious vs subconscious thought, and makes it sound like I only mean that "conscious thoughts" have deception in them) and instead say that "the AI system at some point computes some form of 'reason' that the deceptive action would be better than the non-deceptive action, and this then leads further computation to take the deceptive action instead of the non-deceptive action".
I don't quite agree with that as literally stated; a huge part of intelligence is finding...
Two probable cruxes here...
First probable crux: at this point, I think one of my biggest cruxes with a lot of people is that I expect the capability level required to wipe out humanity, or at least permanently de-facto disempower humanity, is not that high. I expect that an AI which is to a +3sd intelligence human as a +3sd intelligence human is to a -2sd intelligence human would probably suffice, assuming copying the AI is much cheaper than building it. (Note: I'm using "intelligence" here to point to something including ability to "actually try" as oppos...
For that part, the weaker assumption I usually use is that AI will end up making lots of big and fast (relative to our ability to meaningfully react) changes to the world, running lots of large real-world systems, etc, simply because it's economically profitable to build AI which does those things. (That's kinda the point of AI, after all.)
In a world where most stuff is run by AI (because it's economically profitable to do so), and there's RLHF-style direct incentives for those AIs to deceive humans... well, that's the starting point to the Getting What Yo...
I continue to be surprised that people think a misaligned consequentialist intentionally trying to deceive human operators (as a power-seeking instrumental goal specifically) is the most probable failure mode.
To me, Christiano's Get What You Measure scenario looks much more plausible a priori to be "what happens by default". For instance: why expect that we need a multi-step story about consequentialism and power-seeking in order to deceive humans, when RLHF already directly selects for deceptive actions? Why additionally assume that we need consequentiali...
That counterargument does at least typecheck, so we're not talking past each other. Yay!
In the context of neurosymbolic methods, I'd phrase my argument like this: in order for the symbols in the symbolic-reasoning parts to robustly mean what we intended them to mean (e.g. standard semantics in the case of natural language), we need to pick the right neural structures to "hook them up to". We can't just train a net to spit out certain symbols given certain inputs and then use those symbols as though they actually correspond to the intended meaning, because ...
I think the argument in Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc applies here.
Carrying it over to the car/elephant analogy: we do not have a broken car. Instead, we have two toddlers wearing a car costume and making "vroom" noises. [Edit-To-Add: Actually, a better analogy would be a Flintstones car. It only looks like a car if we hide the humans' legs running underneath.] We have not ever actually built a car or anything even remotely similar to a car; we do not understand the principles of mechanics, thermodynamics or ch...
Upvoted for content format - I would like to see more people do walkthroughs with their takes on a paper (especially their own), talking about what's under-appreciated, a waste of time, replication expectations, etc.
I do want to evoke BFS/DFS/MCTS/A*/etc here, because I want to make the point that those search algorithms themselves do not look like (what I believe to be most peoples' conception of) babble and prune, and I expect the human search algorithm to differ from babble and prune in many similar ways to those algorithms. (Which makes sense - the way people come up with things like A*, after all, is to think about how a human would solve the problem better and then write an algorithm which does something more like a human.)
To be clear, I don't think the exponential asymptotics specifically are obvious (sorry for implying that), but I also don't think they're all that load-bearing here. I intended more to gesture at the general cluster of reasons to expect "reward for proxy, get an agent which cares about the proxy"; there's lots of different sets of conditions any of which would be sufficient for that result. Maybe we just train the agent for a long time with a wide variety of data. Maybe it turns out that SGD is surprisingly efficient, and usually finds a global optimum, so...
Kudos for writing all that out. Part of the reason I left that comment in the first place was because I thought "it's Turner, if he's actually motivatedly cognitating here he'll notice once it's pointed out". (And, corollary: since you have the skill to notice when you are motivedly cognitating, I believe you if you say you aren't. For most people, I do not consider their claims about motivatedness of their own cognition to be much evidence one way or the other.) I do have a fairly high opinion of your skills in that department.
...For the record: I welcome we
I think I have a complaint like "You seem to be comparing to a 'perfect' reward function, and lamenting how we will deviate from that. But in the absence of inner/outer alignment, that doesn't make sense.
I think this is close to our most core crux.
It seems to me that there are a bunch of standard arguments which you are ignoring because they're formulated in an old frame that you're trying to avoid. And those arguments in fact carry over just fine to your new frame if you put a little effort into thinking about the translation, but you've instead thrown th...
Let me exaggerate the kind of "error rates" I think you're anticipating:
- Suppose I hit the reward 99% of the time for cut gems, and 90% of the time for uncut gems.
- What's supposed to go wrong? The agent somewhat more strongly steers towards cut gems?
- Suppose I'm grumpy for the first 5 minutes and only hit the reward button 95% as often as I should otherwise. What's supposed to happen next?
(If these errors aren't representative, can you please provide a concrete and plausible scenario?)
Both of these examples are are focused on one error type:...
Why does the ensembling matter?
I could imagine a story where it matters - e.g. if every shard has a veto over plans, and the shards are individually quite intelligent subagents, then the shards bargain and the shard-which-does-what-we-intended has to at least gain over the current world-state (otherwise it would veto). But that's a pretty specific story with a lot of load-bearing assumptions, and in particular requires very intelligent shards. I could maybe make an argument that such bargaining would be selected for even at low capability levels (probably ...
The agent already has the diamond abstraction from SSL+IL, but not the labelling process (due to IID training, and it having never seen our "labelling" before—in the sense of us watching it approach the objects in real time). And this is very early in the RL training, at the very beginning. So why would the agent learn the labelling abstraction during the labelling and hook that in to decision-making, in the batch PG updates, instead of just hooking in the diamond abstraction it already has? (Edit: I discussed this a bit in this footnote.)
I think there's a...
Are any of these cruxes for anyone?
Yup, that's a valid argument. Though I'd expect that gradient hacking to the point of controlling the reinforcement on one's own shards is a very advanced capability with very weak reinforcement, and would therefore come much later in training than picking up on the actual labelling process (which seems simpler and has much more direct and strong reinforcement).
I partly buy that, but we can easily adjust the argument about incorrect labels to circumvent that counterargument. It may be that the full label generation process is too "distant"/complex for the AI to learn in early training, but insofar as there are simple patterns to the humans' labelling errors (which of course there usually are, in practice) the AI will still pick up those simple patterns, and shards which exploit those simple patterns will be more reinforced than the intended shard. It's like that example from the RLHF paper where the AI learns to hold a grabber in front of a ball to make it look like it's grabbing the ball.
First, I think most of the individual pieces of this story are basically right, so good job overall. I do think there's at least one fatal flaw and a few probably-smaller issues, though.
The main fatal flaw is this assumption:
Since “IF
diamond
”-style predicates do in fact “perfectly classify” the positive/negative approach/don’t-approach decision contexts...
This assumes that the human labellers (or automated labellers created by humans) have perfectly labelled the training examples.
I'm mostly used to thinking about this in the context of alignment with huma...
Indeed. Unfortunately, I didn't catch that when skimming.
One barrier for this general approach: the basic argument that something like this would work is that if one shard is aligned, and every shard has veto power over changes (similar to the setup in Why Subagents?), then things can't get much worse for humanity. We may fall well short of our universe-scale potential, but at least X-risk is out.
Problem is, that argument requires basically-perfect alignment of the one shard (or possibly a set of shards which together basically-perfectly represent human values). If we try to weaken it to e.g. a bunch of shards w...
Personally, I usually associate "accident" with "painfully obvious risk that was not actually mitigated" (note the difference in wording from "not taken seriously enough"). IIUC, that's usually how engineering/industrial "accidents" work, and that is the sort of association I'd expect someone used to thinking about industrial "accidents" to have.