Yep, I'd say I intuitively agree with all of that, though I'd add that if you want to specify the set of "outcomes" differently from the set of "goals", then that must mean you're implicitly defining a mapping from outcomes to goals. One analogy could be that an outcome is like a thermodynamic microstate (in the sense that it's a complete description of all the features of the universe) while a goal is like a thermodynamic macrostate (in the sense that it's a complete description of the features of the universe that the system can perceive).
This mapping fr... (read more)
Gotcha. I definitely agree with what you're saying about the effectiveness of incentive structures. And to be clear, I also agree that some of the affordances in the quote reasonably fall under "alignment": e.g., if you explicitly set a specific mission statement, that's a good tactic for aligning your organization around that specific mission statement.
But some of the other affordances aren't as clearly goal-dependent. For example, iterating quickly is an instrumentally effective strategy across a pretty broad set of goals a company might have. That (in m... (read more)
Thanks, great post.
These include formulating and repeating a clear mission statement, setting up a system for promotions that rewards well-calibrated risk taking, and iterating quickly at the beginning of the company in order to habituate a rhythm of quick iteration cycles.
I may be misunderstanding, but wouldn't these techniques fall more under the heading of capabilities rather than under alignment? These are tactics that should increase a company's effectiveness in general, for most reasonable mission statements or products the company could have.
This is fantastic. Really appreciate both the detailed deep-dive in the document, and the summary here. This is also timely, given that teams working on superscale models with concerning capabilities haven't generally been too forthcoming with compute estimates. (There are exceptions.)
As you and Alex point out in the sibling thread, the biggest remaining fudge factors seem to be:
It's simply because we each (myself more than her) have an inclination to apply a fair amount of adjustment in a conservative direction, for generic "burden of proof" reasons, rather than go with the timelines that seem most reasonable based on the report in a vacuum.
While one can sympathize with the view that the burden of proof ought to lie with advocates of shorter timelines when it comes to the pure inference problem ("When will AGI occur?"), it's worth observing that in the decision problem ("What should we do about it?") this situation is reversed. T... (read more)
This is an excellent point and it's indeed one of the fundamental limitations of a public tracking approach. Extrapolating trends in an information environment like this can quickly degenerate into pure fantasy. All one can really be sure of is that the public numbers are merely lower bounds — and plausibly, very weak ones.
Yeah, great point about Gopher, we noticed the same thing and included a note to that effect in Gopher's entry in the tracker.
I agree there's reason to believe this sort of delay could become a bigger factor in the future, and may already be a factor now. If we see this pattern develop further (and if folks start publishing "model cards" more consistently like DM did, which gave us the date of Gopher's training) we probably will begin to include training date as separate from publication date. But for now, it's a possible trend to keep an eye on.
Thanks again!
A more typical example: I can look at a chain of options on a stock, and use the prices of those options to back out market-implied probabilities for each possible stock price at expiry.
Gotcha, this is a great example. And the fundamental reasons why this works are 1) the immediate incentive that you can earn higher returns by pricing the option more correctly; combined with 2) the fact that the agents who are assigning these prices have (on a dollar-weighted-average basis) gone through multiple rounds of selection for higher returns.
(I wonder to what exte... (read more)
Okay, then to make sure I've understood correctly: what you were saying in the quoted text is that you'll often see an economist, etc., use coherence theorems informally to justify a particular utility maximization model for some system, with particular priors and conditionals. (As opposed to using coherence theorems to justify the idea of EU models generally, which is what I'd thought you meant.) And this is a problem because the particular priors and conditionals they pick can't be justified solely by the coherence theorem(s) they cite.
... (read more)The problem with V
Thanks so much for the feedback!
The ability to sort by model size etc would be nice. Currently sorting is alphabetical.
Right now the default sort is actually chronological by publication date. I just added the ability to sort by model size and compute budget at your suggestion. You can use the "⇅ Sort" button in the Models tab to try it out; the rows should now sort correctly.
... (read more)Also the rows with long textual information should be more to the right and the more informative/tighter/numerical columns more to the left (like "deep learning" in almost all r
I’m surprised by just how much of a blindspot goal-inputs seem to be for today’s economists, AI researchers, etc. The coherence theorems usually cited to justify expected utility maximization models imply a quite narrow range of inputs to those utility functions: utilities are only over the outcomes on which agents can bet. Yet practitioners use utility functions over entire (unobservable) world states, world state trajectories, MDP states, etc, often without any way for the agent to bet on all of the outcomes.
It's true that most of the agents we build can... (read more)
Personally speaking, I think this is the subfield to be closely tracking progress in, because 1) it has far-reaching implications in the long term and 2) it has garnered relatively little attention compared to other subfields.
Thanks for the clarification — definitely agree with this.
If you'd like to visualize trends though, you'll need more historical data points, I think.
Yeah, you're right. Our thinking was that we'd be able to do this with future data points or by increasing the "density" of points within the post-GPT-3 era, but ultimately it will probably be necessary (and more compelling) to include somewhat older examples too.
Interesting; I hadn't heard of DreamerV2. From a quick look at the paper, it looks like one might describe it as a step on the way to something like EfficientZero. Does that sound roughly correct?
it would be great to see older models incorporated as well
We may extend this to older models in the future. But our goal right now is to focus on these models' public safety risks as standalone (or nearly standalone) systems. And prior to GPT-3, it's hard to find models whose public safety risks were meaningful on a standalone basis — while an earlier model could have been used as part of a malicious act, for example, it wouldn't be as central to such an act as a modern model would be.
Yeah, these are interesting points.
Isn't it a bit suspicious that the thing-that's-discontinuous is hard to measure, but the-thing-that's-continuous isn't? I mean, this isn't totally suspicious, because subjective experiences are often hard to pin down and explain using numbers and statistics. I can understand that, but the suspicion is still there.
I sympathize with this view, and I agree there is some element of truth to it that may point to a fundamental gap in our understanding (or at least in mine). But I'm not sure I entirely agree that disc... (read more)
I think what gwern is trying to say is that continuous progress on a benchmark like PTB appears (from what we've seen so far) to map to discontinuous progress in qualitative capabilities, in a surprising way which nobody seems to have predicted in advance. Qualitative capabilities are more relevant to safety than benchmark performance is, because while qualitative capabilities include things like "code a simple video game" and "summarize movies with emojis", they also include things like "break out of confinement and kill everyone". It's the latter capabil... (read more)
I think what gwern is trying to say is that continuous progress on a benchmark like PTB appears (from what we've seen so far) to map to discontinuous progress in qualitative capabilities, in a surprising way which nobody seems to have predicted in advance.
This is a reasonable thesis, and if indeed it's the one Gwern intended, then I apologize for missing it!
That said, I have a few objections,
Good catch! I didn't check the form. Yes you are right, the spoiler should say (1=Paul, 9=Eliezer) but the conclusion is the right way round.
(Not being too specific to avoid spoilers) Quick note: I think the direction of the shift in your conclusion might be backwards, given the statistics you've posted and that 1=Eliezer and 9=Paul.
Thanks for the kind words and thoughtful comments.
You're absolutely right that expected ROI ultimately determines scale of investment. I agree on your efficiency point too: scaling and efficiency are complements, in the sense that the more you have of one, the more it's worth investing in the other.
I think we will probably include some measure of efficiency as you've suggested. But I'm not sure exactly what that will be, since efficiency measures tend to be benchmark-dependent so it's hard to get apples-to-apples here for a variety of reasons. (e.g., diffe... (read more)
Gotcha. Well, that seems right—certainly in the limit case.
Thanks, that helps. So actually this objection says: "No, the biggest risk lies not in the trustworthiness of the Bob you use as the input to your scheme, but rather in the fidelity of your copying process; and this is true even if the errors in your copying process are being introduced randomly rather than adversarially. Moreover, if you actually do develop the technical capability to reduce your random copying-error risk down to around the level of your Bob-trustworthiness risk, well guess what, you've built yourself an AGI. But since this myopic copying... (read more)
This is a great thread. Let me see if I can restate the arguments here in different language:
Eliezer's counterargument is "You don't get a high-fidelity copy of Bob that can be iterated and recursed to do arbitrary amounts of work a Bob-army could do, the way Bob could do it, until many years after the world otherwise ends. The imitated Bobs are imperfect, and if they scale to do vast amounts of work, kill you."
Abstracting out one step: there is a rough general argument that human-imitating AI is, if not perfectly safe, then at least as safe as the humans it's imitating. In particular, if it's imitating humans working on alignment, then it's at least as likely as we are to come up with an aligned AI. Its prospects are no worse than our prospects are already. (And plausibly better, since the simulated humans may have more time to solve the problem.)
For full strength, this argument requires that:
I want to push back a little against the claim that the bootstrapping strategy ("build a relatively weak aligned AI that will make superhumanly fast progress on AI alignment") is definitely irrelevant/doomed/inferior. Specifically, I don't know whether this strategy is good or not in practice, but it serves as useful threshold for what level/kind of capabilities we need to align in order to solve AI risk.
Yeah, very much agree with all of this. I even think there's an argument to be made that relatively narrow-yet-superhuman theorem provers (or other resear... (read more)
Great catch. For what it's worth, it actually seems fine to me intuitively that any finite pattern would be an optimizing system for this reason, though I agree most such patterns may not directly be interesting. But perhaps this is a hint that some notion of independence or orthogonality of optimizing systems might help to complete this picture.
Here's a real-world example: you could imagine a universe where humans are minding their own business over here on Earth, while at the same time, over there in a star system 20 light-years away, two planets are hur... (read more)
Extremely interesting — thanks for posting. Obviously there are a number of caveats which you carefully point out, but this seems like a very reasonable methodology and the actual date ranges look compelling to me. (Though they also align with my bias in favor of shorter timelines, so I might not be impartial on that.)
One quick question about the end of this section:
The expected number of bits in original encoding per bits in the compression equals the entropy of that language.
Wouldn't this be the other way around? If your language has low entropy it shoul... (read more)
Thanks! I think this all makes sense.
Loved this post. This whole idea of using a deterministic dynamical system as a conceptual testing ground feels very promising.
A few questions / comments:
Very neat. It's quite curious that switching to L2 for the base optimizer doesn't seem to have resulted in the meta-initialized network learning the sine function. What sort of network did you use for the meta-learner? (It looks like the 4-layer network in your Methods refers to your base optimizer, but perhaps it's the same architecture for both?)
Also, do you know if you end up getting the meta-initialized network to learn the sine function eventually if you train for thousands and thousands of steps? Or does it just never learn no matter how hard you train it?
I see — perhaps I did misinterpret your earlier comment. It sounds like the transition you are more interested in is closer to (AI has ~free rein over the internet) => (AI invents nanotech). I don't think this is a step we should expect to be able to model especially well, but the best story/analogy I know of for it is probably the end part of That Alien Message. i.e., what sorts of approaches would we come up with, if all of human civilization was bent on solving the equivalent problem from our point of view?
If instead you're thinking more about a tran... (read more)
No problem, glad it was helpful!
And thanks for the APS-AI definition, I wasn't aware of the term.
Thanks! I agree with this critique. Note that Daniel also points out something similar in point 12 of his comment — see my response.
To elaborate a bit more on the "missing step" problem though:
See my response to point 6 of Daniel's comment — it's rather that I'm imagining competing hedge funds (run by humans) beginning to enter the market with this sort of technology.
Hey Daniel — thanks so much for taking the time to write this thoughtful feedback. I really appreciate you doing this, and very much enjoyed your "2026" post as well. I apologize for the delay and lengthy comment here, but wanted to make sure I addressed all your great points.
1. It would be great if you could pepper your story with dates, so that we can construct a timeline and judge for ourselves whether we think things are happening too quickly or not.
I've intentionally avoided referring to absolute dates, other than by indirect implication (e.g. "iOS 19... (read more)
I see. Okay, I definitely agree that makes sense under the "fails to generalize" risk model. Thanks Rohin!
Got it, thanks!
I find it plausible that the AI systems fail in only a special few exotic circumstances, which aren't the ones that are actually created by AGI.
This helps, and I think it's the part I don't currently have a great intuition for. My best attempt at steel-manning would be something like: "It's plausible that an AGI will generalize correctly to distributions which it is itself responsible for bringing about." (Where "correctly" here means "in a way that's consistent with its builders' wishes.") And you could plausibly argue that an AGI would hav... (read more)
I agree with pretty much this whole comment, but do have one question:
But it still seems plausible that in practice we never hit those exotic circumstances (because those exotic circumstances never happen, or because we've retrained the model before we get to the exotic circumstances, etc), and it's intent aligned in all the circumstances the model actually encounters.
Given that this is conditioned on us getting to AGI, wouldn't the intuition here be that pretty much all the most valuable things such a system would do would fall under "exotic circumstances... (read more)
But in the context of superhuman systems, I think we need to be more concerned by the possibility that it’s performance-uncompetitive to restrict your system to only take actions that can be justified entirely with human-understandable reasoning.
Interestingly, this is already a well known phenomenon in the hedge fund world. In fact, quant funds discovered about 25 years ago that the most consistently profitable trading signals are reliably the ones that are the least human-interpretable. It makes intuitive sense: any signal that can be understood by a huma... (read more)
One reason to favor such a definition of alignment might be that we ultimately need a definition that gives us guarantees that hold at human-level capability or greater, and humans are probably near the bottom of the absolute scale of capabilities that can be physically realized in our world. It would (imo) be surprising to discover a useful alignment definition that held across capability levels way beyond us, but that didn't hold below our own modest level of intelligence.
No problem! Glad it was helpful. I think your fix makes sense.
I'm not quite sure what the error was in the original proof of Lemma 3; I think it may be how I converted to and interpreted the vector representation.
Yeah, I figured maybe it was because the dummy variable was being used in the EV to sum over outcomes, while the vector was being used to represent the probabilities associated with those outcomes. Because and are similar it's easy to conflate their meanings, and if you apply to the wrong... (read more)
Thanks for writing this.
I have one point of confusion about some of the notation that's being used to prove Lemma 3. Apologies for the detail, but the mistake could very well be on my end so I want to make sure I lay out everything clearly.
First, is being defined here as an outcome permutation. Presumably this means that 1) for some , ; and 2) admits a unique inverse . That makes sense.
We also define lotteries over outcomes, presumably as, e.g., , where is ... (read more)
Update: having now thought more deeply about this, I no longer endorse my above comment.
While I think the reasoning was right, I got the definitions exactly backwards. To be clear, what I would now claim is:
Everything in the above comment then still goes through, except with these definitions reversed.
On the one hand, the "per... (read more)
I'm with you on this, and I suspect we'd agree on most questions of fact around this topic. Of course demarcation is an operation on maps and not on territories.
But as a practical matter, the moment one starts talking about the definition of something such as a mesa-objective, one has already unfolded one's map and started pointing to features on it. And frankly, that seems fine! Because historically, a great way to make forward progress on a conceptual question has been to work out a sequence of maps that give you successive degrees of approximation to th... (read more)
Yeah I agree this is a legitimate concern, though it seems like it is definitely possible to make such a demarcation in toy universes (like in the example I gave above). And therefore it ought to be possible in principle to do so in our universe.
To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make ... (read more)
I'm not sure what would constitute a clearly-worked counterexample. To me, a high reliance on an agent/world boundary constitutes a "non-naturalistic" assumption, which simply makes me think a framework is more artificial/fragile.
Oh for sure. I wouldn't recommend having a Cartesian boundary assumption as the fulcrum of your alignment strategy, for example. But what could be interesting would be to look at an isolated dynamical system, draw one boundary, investigate possible objective functions in the context of that boundary; then erase that first boundary... (read more)
I would further add that looking for difficulties created by the simplification seems very intellectually productive.
Yep, strongly agree. And a good first step to doing this is to actually build as robust a simplification as you can, and then see where it breaks. (Working on it.)
Ah I see! Thanks for clarifying.
Yes, the point about the Cartesian boundary is important. And it's completely true that any agent / environment boundary we draw will always be arbitrary. But that doesn't mean one can't usefully draw such a boundary in the real world — and unless one does, it's hard to imagine how one could ever generate a working definition of something like a mesa-objective. (Because you'd always be unable to answer the legitimate question: "the mesa-objective of what?")
Of course the right question will always be: "what is the whole unive... (read more)
which stems from the assumption that you are able to carve an environment up into an agent and an environment and place the "same agent" in arbitrary environments. No such thing is possible in reality, as an agent cannot exist without its environment
I might be misunderstanding what you mean here, but carving up a world into agent vs environment is absolutely possible in reality, as is placing that agent in arbitrary environments to see what it does. You can think of the traditional RL setting as a concrete example of this: on one side we have an agen... (read more)
If we wish, we could replace or re-define "capability robustness" with "inner robustness", the robustness of pursuit of the mesa-objective under distributional shift.
I strongly agree with this suggestion. IMO, tying capability robustness to the behavioral objective confuses a lot of things, because the set of plausible behavioral objectives is itself not robust to distributional shift.
One way to think about this from the standpoint of the "Objective-focused approach" might be: the mesa-objective is the thing the agent is revealed to be pursuing under arbit... (read more)
Ah yes, that's right. Yeah, I just wanted to make this part fully explicit to confirm my understanding. But I agree it's equivalent to just let ignore the extra (or whatever) component.
Thanks very much!
Late comment here, but I really liked this post and want to make sure I've fully understood it. In particular there's a claim near the end which says: if is not fixed, then we can build equivalent models , for which it is fixed. I'd like to formalize this claim to make sure I'm 100% clear on what it means. Here's my attempt at doing that:
For any pair of models , where , there exists a variable (of which is a subset) and a pair of models ... (read more)
Interesting. The specific idea you're proposing here may or may not be workable, but it's an intriguing example of a more general strategy that I've previously tried to articulate in another context. The idea is that it may be viable to use an AI to create a "platform" that accelerates human progress in an area of interest to existential safety, as opposed to using an AI to directly solve the problem or perform the action.
Essentially:
- A "platform" for work in domain X is something that removes key constraints that would otherwise have consumed human time an
... (read more)