Yeah, I agree I convey the implicit prediction that, even though not all one-month tasks will fall at once, they'll be closer than you would otherwise expect not using this framework.
I think I still disagree with your point, as follows: I agree that AI will soon do passably well at summarizing 10k word books, because the task is not very "sharp" - i.e. you get gradual rather than sudden returns to skill differences. But I think it will take significantly longer for AI to beat the quality of summary produced by a median expert in 1 month, because that expert's summary will in fact explore a rich hierarchical interconnected space of concepts from the novel (novel concepts, if you will).
Seems like there's a bunch of interesting stuff here, though some of it is phrased overly strongly.
E.g. "mechanistic interpretability requires program synthesis, program induction, and/or programming language translation" seems possible but far from obvious to me. In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways. Perhaps it's appropriate to advocate for MI researchers to pay more attention to these fields, but calling this an example of "reinventing", "reframing" or "renami...
My default (very haphazard) answer: 10,000 seconds in a day; we're at 1-second AGI now; I'm speculating 1 OOM every 1.5 years, which suggests that coherence over multiple days is 6-7 years away.
The 1.5 years thing is just a very rough ballpark though, could probably be convinced to double or halve it by doing some more careful case studies.
Thanks. For the record, my position is that we won't see progress that looks like "For t-AGI, t increases by +1 OOM every X years" but rather that the rate of OOMs per year will start off slow and then accelerate. So e.g. here's what I think t will look like as a function of years:
Year | Richard (?) guess | Daniel guess |
2023 | 1 | 5 |
2024 | 5 | 15 |
2025 | 25 | 100 |
2026 | 100 | 2000 |
2027 | 500 | Infinity (singularity) |
2028 | 2,500 | |
2029 | 10,000 | |
2030 | 50,000 | |
2031 | 250,000 | |
2032 | 1,000,000 |
I think this partly because of the way I think generalization works (I think e.g. once AIs have gotten...
Why is it cheating? That seems like the whole point of my framework - that we're comparing what AIs can do in any amount of time to what humans can do in a bounded amount of time.
Whatever. Maybe I was just jumping on an excuse to chit-chat about possible limitations of LLMs :) And maybe I was thread-hijacking by not engaging sufficiently with your post, sorry.
This part you wrote above was the most helpful for me:
if the task is "spend a month doing novel R&D for lidar", then my framework predicts that we'll need 1-month AGI for that
I guess I just want to state my opinion that (1) summarizing a 10,000-page book is a one-month task but could come pretty soon if indeed it’s not already possible, (2) spending a month doing novel R&a...
But then we could just ask the question: “Can you please pose a question about string theory that no AI would have any prayer of answering, and then answer it yourself?” That’s not cherry-picking, or at least not in the same way.
But can't we equivalently just ask an AI to pose a question that no human would have a prayer of answering in one second? It wouldn't even need to be a trivial memorization thing, it could also be a math problem complex enough that humans can't do it that quickly, or drawing a link between two very different domains of knowledge.
How long would it take (in months) to train a smart recent college graduate with no specialized training in my field to complete this task?
This doesn't seem like a great metric because there are many tasks that a college grad can do with 0 training that current AI can't do, including:
I do think that there's something important about this metric, but I think it's basically subsumed by my metric: if the task is "spend a month doing novel R&D for...
These are all arguments about the limit; whether or not they're relevant depends on whether they apply to the regime of "smart enough to automate alignment research".
For instance, for debate, one could believe:
1) Debate will work for long enough for us to use it to help find an alignment solution.
2) Debate is a plausible basis for an alignment solution.
I generally don't think about things in terms of this dichotomy. To me, an "alignment solution" is anything that will align an AGI which is then capable of solving alignment for its successor. And so I don't think you can separate these two things.
(Of course I agree that debate is not an arbitrarily scalable alignment solution in the sense that you can just keep training...
To preserve my current shards, I don't need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means "treading water" and seeing dogs sometimes in situations similar to historical dog-seeing events.
I think this depends sensitively on whether the "actor" and the "critic" in fact have the same goals, and I feel pretty confused about how to reason about this. For example, in some cases they could be two separate models, in which case the c...
In general if two possible models perform the same, then I expect the weights to drift towards the simpler one. And in this case they perform the same because of deceptive alignment: both are trying to get high reward during training in order to be able to carry out their misaligned goal later on.
Because of standard deceptive alignment reasons (e.g. "I should make sure gradient descent doesn't change my goal; I should make sure humans continue to trust me").
This doesn't seem implausible. But on the other hand, imagine an agent which goes through a million episodes, and in each one reasons at the beginning "X is my misaligned terminal goal, and therefore I'm going to deceptively behave as if I'm aligned" and then acts perfectly like an aligned agent from then on. My claims then would be:
a) Over many update steps, even a small description length penalty of having terminal goal X (compared with being aligned) will add up.
b) Having terminal goal X also adds a runtime penalty, and I expect that NNs in practice are...
So I'm imagining the agent doing reasoning like:
Misaligned goal --> I should get high reward --> Behavior aligned with reward function
and then I'm hypothesizing that the whatever the first misaligned goal is, it requires some amount of complexity to implement, and you could just get rid of it and make "I should get high reward" the terminal goal. (I could imagine this being false though depending on the details of how terminal and instrumental goals are implemented.)
I could also imagine something more like:
Misaligned goal --> I should behave in al...
Deceptive alignment doesn't preserve goals.
A short note on a point that I'd been confused about until recently. Suppose you have a deceptively aligned policy which is behaving in aligned ways during training so that it will be able to better achieve a misaligned internally-represented goal during deployment. The misaligned goal causes the aligned behavior, but so would a wide range of other goals (either misaligned or aligned) - and so weight-based regularization would modify the internally-represented goal as training continues. For example, if the misali...
Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it's unclear whether that pointer is simpler than a very simple misaligned goal.
Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representati...
Supervised data seems way more fine-grained in what you are getting the AI to do. It's just that supervised fine-tuning is worse.
My (pretty uninformed) guess here is that supervised fine-tuning vs RLHF has relatively modest differences in terms of producing good responses, but bigger differences in terms of avoiding bad responses. And it seems reasonable to model decisions about product deployments as being driven in large part by how well you can get AI not to do what you don't want it to do.
Putting my money where my mouth is: I just uploaded a (significantly revised) version of my Alignment Problem position paper, where I attempt to describe the AGI alignment problem as rigorously as possible. The current version only has "policy learns to care about reward directly" as a footnote; I can imagine updating it based on the outcome of this discussion though.
Note that the "without countermeasures" post consistently discusses both possibilities
Yepp, agreed, the thing I'm objecting to is how you mainly focus on the reward case, and then say "but the same dynamics apply in other cases too..."
I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that's distinct from being confident in the motivations that give rise to that policy.
The problem is that you need to reason about generalization to novel situations somehow, and in practice that ends up being by reasoning about the underlying motivations (whether implicitly or explicitly).
I strongly disagree with the "best case" thing. Like, policies could just learn human values! It's not that implausible.
If I had to try point to the crux here, it might be "how much selection pressure is needed to make policies learn goals that are abstractly related to their training data, as opposed to goals that are fairly concretely related to their training data?" Where we both agree that there's some selection pressure towards reward-like goals, and it seems like you expect this to be enough to lead policies to behavior that violates all their existi...
(Written quickly and not very carefully.)
I think it's worth stating publicly that I have a significant disagreement with a number of recent presentations of AI risk, in particular Ajeya's "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover", and Cohen et al.'s "Advanced artificial agents intervene in the provision of reward". They focus on policies learning the goal of getting high reward. But I have two problems with this:
Putting my money where my mouth is: I just uploaded a (significantly revised) version of my Alignment Problem position paper, where I attempt to describe the AGI alignment problem as rigorously as possible. The current version only has "policy learns to care about reward directly" as a footnote; I can imagine updating it based on the outcome of this discussion though.
In general I think it's better to reason in terms of continuous variables like "how helpful is the iterative design loop" rather than "does it work or does it fail"?
My argument is more naturally phrased in the continuous setting, but if I translated it into the binary setting: the problem with your argument is that conditional on the first being wrong, then the second is not very action-guiding. E.g. conditional on the first, then the most impactful thing is probably to aim towards worlds in which we do hit or miss by a little bit; and that might still be true if it's 5% of worlds rather than 50% of worlds.
Upon further thought, I have another hypothesis about why there seems like a gap here. You claim here that the distribution is bimodal, but your previous claim ("I do in fact think that relying on an iterative design loop fails for aligning AGI, with probability close to 1") suggests you don't actually think there's significant probability on the lower mode, you essentially think it's unimodal on the "iterative design fails" worlds.
I personally disagree with both the "significant probability on both modes, but not in between" hypothesis, and the "unimodal ...
I think you're just doing the bimodal thing again. Sure, if you condition on worlds in which alignment happens automagically, then it's not valuable to advance the techniques involved. But there's a spectrum of possible difficulty, and in the middle parts there are worlds where RLHF works, but only because we've done a lot of research into it in advance (e.g. exploring things like debate); or where RLHF doesn't work, but finding specific failure cases earlier allowed us to develop better techniques.
in worlds where iterative design works, we probably survive AGI without anybody (intentionally) thinking about RLHF
In worlds where iterative design works, it works by iteratively designing some techniques. Why wouldn't RLHF be one of them?
In particular, the excerpts/claims from Get What You Measure are pretty cruxy.
It seems pretty odd to explain this by quoting someone who thinks that this effect is dramatically less important than you do (i.e. nowhere near causing a ~100% probability of iterative design failing). Not gonna debate this on the object level,...
...In worlds where the iterative design loop works for alignment, we probably survive AGI. So, if we want to improve humanity’s chances of survival, we should mostly focus on worlds where, for one reason or another, the iterative design loop fails. ... Among the most basic robust design loop failures is problem-hiding. It happens all the time in the real world, and in practice we tend to not find out about the hidden problems until after a disaster occurs. This is why RLHF is such a uniquely terrible strategy: unlike most other alignment schemes, it make
I believe this because of how the world looks "brittle" (e.g., nanotech exists) and because lots of technological progress seems cognition-constrained (such as, again, nanotech). This is a big part of why I think heavy-precedent-style justifications are doomed.
Apart from nanotech, what are the main examples or arguments you would cite in favor of these claims?
Separately, how close is your conception of nanotech to "atomically precise manufacturing", which seems like Drexler's preferred framing right now?
A couple of differences between Kolmogorov complexity/Shannon entropy and the loss function of autoregressive LMs (just to highlight them, not trying to say anything you don't already know):
I agree that we'll have a learning function that works on the data actually input, but it seems strange to me to characterize that learned model as "reflecting back on that data" in order to figure out what it cares about (as opposed to just developing preferences that were shaped by the data).
if some kind of compassionate-LDT is a source of hope about not destroying all the value in our universe-share and getting ourselves killed, then it must be hope about us figuring out such a theory and selecting for AGIs that implement it from the start, rather than that maybe an AGI would likely convergently become that way before taking over the world.
I weakly disagree here, mainly because Nate's argument for very high levels of risk goes through strong generalization/a "sharp left turn" towards being much more coherent + goal-directed. So I find it hard...
Broadly agree with this post. Couple of small things:
...Then later, it is smart enough to reflect back on that data and ask: “Were the humans pointing me towards the distinction between goodness and badness, with their training data? Or were they pointing me towards the distinction between that-which-they'd-label-goodness and that-which-they'd-label-badness, with things that look deceptively good (but are actually bad) falling into the former bin?” And to test this hypothesis, it would go back to its training data and find some example bad-but-deceptively-goo
I considered this, but it seems like the latter is 4x longer while covering fairly similar content?
Given the baseline classifier's 0.003% failure rate, you would have to sample and label 30,000 in-distribution examples to find a failure (which would cost about $10,000). With our tools, our contractors are able to find an adversarial example on the baseline classifier every 13 minutes (which costs about $8 – about 1000x cheaper).
This isn't comparing apples to apples, though? If you asked contractors to find adversarial examples without using the tools, they'd likely find them at a rate much higher than 0.003%.
None of the "hard takeoff people" or hard takeoff models predicted or would predict that the sorts of minor productivity advancements we are starting to see would lead to a FOOM by now.
The hard takeoff models predict that there will be less AI-caused productivity advancements before a FOOM than soft takeoff models. Therefore any AI-caused productivity advancements without FOOM are relative evidence against the hard takeoff models.
You might say that this evidence is pretty weak; but it feels hard to discount the evidence too much when there are few concrete claims by hard-takeoff proponents about what advances would surprise them. Everything is kinda prosaic in hindsight.
Thanks for the comments Vika! A few responses:
It might be good to clarify that this is an example architecture and the claims apply more broadly.
Makes sense, will do.
Phase 1 and 2 seem to map to outer and inner alignment respectively.
That doesn't quite seem right to me. In particular:
How? E.g. Jacob left a comment here about his motivations, does that count as a falsification? Or, if you'd say that this is an example of rationalization, then what would the comment need to look like in order to falsify your claim? Does Paul's comment here mentioning the discussions that took place before launching the GPT-3 work count as a falsification? if not, why not?
Jacob's comment does not count, since it's not addressing the "actually consider whether the project will net decrease chance of extinction" or the "could the answer have plausibly been 'no' and then the project would not have happened" part.
Paul's comment does address both of those, especially this part at the end:
...To be clear, this is not post hoc reasoning. I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arg
At a sufficiently high level of abstraction, I agree that "cost of experimenting" could be seen as the core difficulty. But at a very high level of abstraction, many other things could also be seen as the core difficulty, like "our inability to coordinate as a civilization" or "the power of intelligence" or "a lack of interpretability", etc. Given this, John's comment seemed like mainly rhetorical flourishing rather than a contentful claim about the structure of the difficult parts of the alignment problem.
Also, I think that "on our first try" thing isn't ...
RLHF helps with outer alignment because it leads to rewards which more accurately reflect human preferences than the hard-coded reward functions (including the classic specification gaming examples, but also intrinsic motivation functions like curiosity and empowerment) which are used to train agents in the absence of RLHF.
The smiley faces example feels confusing as a "classic" outer alignment problem because AGIs won't be trained on a reward function anywhere near as limited as smiley faces. An alternative like "AGIs are trained on a reward function in wh...
I think the smiling example is much more analogous than you are making it out here. I think the basic argument for "this just encourages taking control of the reward" or "this just encourages deception" goes through the same way.
Like, RLHF is not some magical "we have definitely figured out whether a behavior is really good or bad" signal, it's historically been just some contractors thinking for like a minute about whether a thing is fine. I don't think there is less bayesian evidence conveyed by people smiling (like, the variance in smiling is greater th...
I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is. I think this is an ITT which a lot of people in the broader LW cluster would fail. I think the basic mistake that's being made here is failing to recognize that reality doesn't grade on a curve when it comes to understanding the world - your arguments can be false even if nobody has refuted them. That's particularly true when it comes to very high-level abstractions, like the ones th...
Comments on parts of this other than the ITT thing (response to the ITT part is here)...
(and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment)
I don't usually focus much on the outer/inner abstraction, and when I do I usually worry about outer alignment. I consider RLHF to have been negative progress on outer alignment, same as inner alignment; I wasn't relying on that particular abstraction at all.
...Historically, the way that great scientists have gotten around
I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is.
I don't want to pour a ton of effort into this, but here's my 5-paragraph ITT attempt.
"As an analogy for alignment, consider processor manufacturing. We didn't get to gigahertz clock speed and ten nanometer feature size by trying to tackle all the problems of 10 nm manufacturing processes right out the gate. That would never have worked; too many things independently go wrong to isola...
There's one attitude towards alignment techniques which is something like "do they prevent all catastrophic misalignment?" And there's another which is more like "do they push out the frontier of how advanced our agents need to be before we get catastrophic misalignment?" I don't think the former approach is very productive, because by that standard no work is ever useful. So I tend to focus on the latter, with the theory of victory being "push the frontier far enough that we get a virtuous cycle of automating alignment work".
Huh, I thought you agreed with statements like "if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier".
I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).
I don't really know what "reliable empirical feedback" means in this context - if you have sufficiently reliable feedback mechanisms, then you've solved m...
I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).
I do really feel like it would have been really extremely hard to build rockets if we had to get it right on the very first try.
I think for rockets the fact that it is so costly to experiment with stuff, explains the majority of the difficulty of rocket engineering. I agree you also have very little chance to build a successful space rocket without having a g...
I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).
I basically buy that argument, though I do still think lack of shots is the main factor which makes alignment harder than most other technical fields in their preparadigmatic stage.
Alignment as a field is hard precisely because we do not expect to see empirical evidence before it is too late.
I don't think this is the core reason that alignment is hard - even if we had access to a bunch of evidence about AGI misbehavior now, I think it'd still be hard to convert that into a solution for alignment. Nor do I believe we'll see no empirical evidence of power-seeking behavior before it's too late (and I think opinions amongst alignment researchers are pretty divided on this question).
I don't think this is the core reason that alignment is hard - even if we had access to a bunch of evidence about AGI misbehavior now, I think it'd be very hard to convert that into a solution for alignment.
If I imagine that we magically had a boxing setup which let us experiment with powerful AGI alignment without dying, I do agree it would still be hard to solve alignment. But it wouldn't be harder than the core problems of any other field of science/engineering. It wouldn't be unusually hard, by the standards of technical research.
Of course, "empirical ...
Huh, I thought you agreed with statements like "if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier".
My model is that John is talking about "evidence on whether an AI alignment solution is sufficient", and you understood him to say "evidence on whether the AI Alignment problem is real/difficult". My guess is you both agree on the former, but I am not confident.
No, I'm thinking of cases where Alice>Bob, and trying to gesture towards the distinction between "Bob knows that Alice believes X" and "Bob can use X to make predictions".
For example, suppose that Bob is a mediocre physicist and Alice just invented general relativity. Bob knows that Alice believes that time and space are relative, but has no idea what that means. So when trying to make predictions about physical events, Bob should still use Newtonian physics, even when those calculations require assumptions that contradict Alice's known beliefs.
Interesting post! Two quick comments:
Sometimes we analyze agents from a logically omniscient perspective. ... However, this omniscient perspective eliminates Vingean agency from the picture.
Another example of this happening comes when thinking about utilitarian morality, which by default doesn't treat other agents as moral actors (as I discuss here).
...Bob has minimal use for attributing beliefs to Alice, because Bob doesn't think Alice is mistaken about anything -- the best he can do is to use his own beliefs as a proxy, and try to figure out what Alice will
...
- Another response is "The AI paralyzes your face into smiling."
- But this is actually a highly nontrivial claim about the internal balance of value and computation which this reinforcement schedule carves into the AI. Insofar as this response implies that an AI will primarily "care about" literally making you smile, that seems like a highly speculative and unsupported claim about the AI internalizing a single powerful decision-relevant criterion / shard of value, which also happens to be related to the way that humans conceive of the situation (i.e. som
I intend to convert this report to a nicely-formatted PDF with academic-style references. Please comment below, or message me, if you're interested in being paid to do this. EDIT: have now hired someone to do it.
More generally, I'll likely make a number of edits over the coming weeks, so comments and feedback would be very welcome.
Just skimmed the course. One suggestion (will make more later): adding the Goal Misgeneralization paper from Langosco et al. as a core readings in the week on Detecting and Forecasting Emergent Behavior.
Hmm, perhaps clearer to say "reward does not automatically reinforce reward-focused thoughts into terminal values", given that we both agree that agents will have thoughts about reward either way.
But if you agree that reward gets reinforced as an instrumental value, then I think your claims here probably need to actually describe the distinction between terminal and instrumental values. And this feels pretty fuzzy - e.g. in humans, I think the distinction is actually not that clear-cut.
In other words, if everyone agrees that reward likely becomes a strong ...
Amplification can just be used as a method for making more and better common-sense improvements, though. You could also do all sorts of other stuff with it, but standard examples (like "catch agents when they lie to us") seem very much like common-sense improvements.
+1 on this comment, I feel pretty confused about the excerpt from Paul that Steve quoted above. And even without the agent deliberately deciding where to avoid exploring, incomplete exploration may lead to agents which learn non-reward goals before convergence - so if Paul's statement is intended to refer to optimal policies, I'd be curious why he thinks that's the most important case to focus on.
I just stumbled upon the Independence of Pareto dominated alternatives criterion; does the ROSE value have this property? I'm pattern-matching it as related to disagreement-point invariance, but haven't thought about this at all.