All of Edouard Harris's Comments + Replies

AI Tracker: monitoring current and near-future risks from superscale models

Personally speaking, I think this is the subfield to be closely tracking progress in, because 1) it has far-reaching implications in the long term and 2) it has garnered relatively little attention compared to other subfields.

Thanks for the clarification — definitely agree with this.

If you'd like to visualize trends though, you'll need more historical data points, I think.

Yeah, you're right. Our thinking was that we'd be able to do this with future data points or by increasing the "density" of points within the post-GPT-3 era, but ultimately it will probably be necessary (and more compelling) to include somewhat older examples too.

AI Tracker: monitoring current and near-future risks from superscale models

Interesting; I hadn't heard of DreamerV2. From a quick look at the paper, it looks like one might describe it as a step on the way to something like EfficientZero. Does that sound roughly correct?

it would be great to see older models incorporated as well

We may extend this to older models in the future. But our goal right now is to focus on these models' public safety risks as standalone (or nearly standalone) systems. And prior to GPT-3, it's hard to find models whose public safety risks were meaningful on a standalone basis — while an earlier model could have been used as part of a malicious act, for example, it wouldn't be as central to such an act as a modern model would be.

2maximkazhenkov7dYes. They don't share a common lineage, but are similar in that they're both recent advances in efficient model-based RL. Personally speaking, I think this is the subfield to be closely tracking progress in, because 1) it has far-reaching implications in the long term and 2) it has garnered relatively little attention compared to other subfields. I see. If you'd like to visualize trends though, you'll need more historical data points, I think.
Yudkowsky and Christiano discuss "Takeoff Speeds"

Yeah, these are interesting points.

Isn't it a bit suspicious that the thing-that's-discontinuous is hard to measure, but the-thing-that's-continuous isn't? I mean, this isn't totally suspicious, because subjective experiences are often hard to pin down and explain using numbers and statistics. I can understand that, but the suspicion is still there.

I sympathize with this view, and I agree there is some element of truth to it that may point to a fundamental gap in our understanding (or at least in mine). But I'm not sure I entirely agree that disc... (read more)

Yudkowsky and Christiano discuss "Takeoff Speeds"

I think what gwern is trying to say is that continuous progress on a benchmark like PTB appears (from what we've seen so far) to map to discontinuous progress in qualitative capabilities, in a surprising way which nobody seems to have predicted in advance. Qualitative capabilities are more relevant to safety than benchmark performance is, because while qualitative capabilities include things like "code a simple video game" and "summarize movies with emojis", they also include things like "break out of confinement and kill everyone". It's the latter capabil... (read more)

I think what gwern is trying to say is that continuous progress on a benchmark like PTB appears (from what we've seen so far) to map to discontinuous progress in qualitative capabilities, in a surprising way which nobody seems to have predicted in advance.

This is a reasonable thesis, and if indeed it's the one Gwern intended, then I apologize for missing it!

That said, I have a few objections,

  • Isn't it a bit suspicious that the thing-that's-discontinuous is hard to measure, but the-thing-that's-continuous isn't? I mean, this isn't totally suspicious, because
... (read more)
Yudkowsky and Christiano discuss "Takeoff Speeds"

Good catch! I didn't check the form. Yes you are right, the spoiler should say (1=Paul, 9=Eliezer) but the conclusion is the right way round.

4Rafael Harth8dYeah, it's fixed now. Thanks for pointing it out.
Yudkowsky and Christiano discuss "Takeoff Speeds"

(Not being too specific to avoid spoilers) Quick note: I think the direction of the shift in your conclusion might be backwards, given the statistics you've posted and that 1=Eliezer and 9=Paul.

4Lukas Finnveden8dNo, the form says that 1=Paul. It's just the first sentence under the spoiler that's wrong.
AI Tracker: monitoring current and near-future risks from superscale models

Thanks for the kind words and thoughtful comments.

You're absolutely right that expected ROI ultimately determines scale of investment. I agree on your efficiency point too: scaling and efficiency are complements, in the sense that the more you have of one, the more it's worth investing in the other.

I think we will probably include some measure of efficiency as you've suggested. But I'm not sure exactly what that will be, since efficiency measures tend to be benchmark-dependent so it's hard to get apples-to-apples here for a variety of reasons. (e.g., diffe... (read more)

A positive case for how we might succeed at prosaic AI alignment

Gotcha. Well, that seems right—certainly in the limit case.

A positive case for how we might succeed at prosaic AI alignment

Thanks, that helps. So actually this objection says: "No, the biggest risk lies not in the trustworthiness of the Bob you use as the input to your scheme, but rather in the fidelity of your copying process; and this is true even if the errors in your copying process are being introduced randomly rather than adversarially. Moreover, if you actually do develop the technical capability to reduce your random copying-error risk down to around the level of your Bob-trustworthiness risk, well guess what, you've built yourself an AGI. But since this myopic copying... (read more)

4Eliezer Yudkowsky13dCloser, yeah. In the limit of doing insanely complicated things with Bob you will start to break him even if he is faithfully simulated, you will be doing things that would break the actual Bob; but I think HCH schemes fail long before they get to that point.
A positive case for how we might succeed at prosaic AI alignment

This is a great thread. Let me see if I can restate the arguments here in different language:

  1. Suppose Bob is a smart guy whom we trust to want all the best things for humanity. Suppose we also have the technology to copy Bob's brain into software and run it in simulation at, say, a million times its normal speed. Then, if we thought we had one year between now and AGI (leaving aside the fact that I just described a literal AGI in the previous sentence), we could tell simulation-Bob, "You have a million subjective years to think of an effective pivotal act i
... (read more)

Eliezer's counterargument is "You don't get a high-fidelity copy of Bob that can be iterated and recursed to do arbitrary amounts of work a Bob-army could do, the way Bob could do it, until many years after the world otherwise ends.  The imitated Bobs are imperfect, and if they scale to do vast amounts of work, kill you."

Abstracting out one step: there is a rough general argument that human-imitating AI is, if not perfectly safe, then at least as safe as the humans it's imitating. In particular, if it's imitating humans working on alignment, then it's at least as likely as we are to come up with an aligned AI. Its prospects are no worse than our prospects are already. (And plausibly better, since the simulated humans may have more time to solve the problem.)

For full strength, this argument requires that:

  • It emulate the kind of alignment research which the actual humans woul
... (read more)
Ngo and Yudkowsky on alignment difficulty

I want to push back a little against the claim that the bootstrapping strategy ("build a relatively weak aligned AI that will make superhumanly fast progress on AI alignment") is definitely irrelevant/doomed/inferior. Specifically, I don't know whether this strategy is good or not in practice, but it serves as useful threshold for what level/kind of capabilities we need to align in order to solve AI risk.

Yeah, very much agree with all of this. I even think there's an argument to be made that relatively narrow-yet-superhuman theorem provers (or other resear... (read more)

Optimization Concepts in the Game of Life

Great catch. For what it's worth, it actually seems fine to me intuitively that any finite pattern would be an optimizing system for this reason, though I agree most such patterns may not directly be interesting. But perhaps this is a hint that some notion of independence or orthogonality of optimizing systems might help to complete this picture.

Here's a real-world example: you could imagine a universe where humans are minding their own business over here on Earth, while at the same time, over there in a star system 20 light-years away, two planets are hur... (read more)

Forecasting progress in language models

Extremely interesting — thanks for posting. Obviously there are a number of caveats which you carefully point out, but this seems like a very reasonable methodology and the actual date ranges look compelling to me. (Though they also align with my bias in favor of shorter timelines, so I might not be impartial on that.)

One quick question about the end of this section:

The expected number of bits in original encoding per bits in the compression equals the entropy of that language.

Wouldn't this be the other way around? If your language has low entropy it shoul... (read more)

Optimization Concepts in the Game of Life

Thanks! I think this all makes sense.

  1. Oh yeah, I definitely agree with you that the empty board would be an optimizing system in the GoL context. All I meant was that the "Death" square in the examples table might not quite correspond to it in the analogy, since the death property is perhaps not an optimization target by the definition. Sorry if that wasn't clear.
  2. :)
  3.  
  4.  
  5. Got it, thanks! So if I've understood correctly, you are currently only using the mask as a way to separate the agent from its environment at instantiation, since that is all you real
... (read more)
3Vika1mo1. Actually, we realized that if we consider an empty board an optimizing system, then any finite pattern is an optimizing system (because it's similarly robust to adding non-viable collections of live cells), which is not very interesting. We have updated the post to reflect this.
Optimization Concepts in the Game of Life

Loved this post. This whole idea of using a deterministic dynamical system as a conceptual testing ground feels very promising.

A few questions / comments:

  1. About the examples: do you think it's strictly correct to say that entropy / death is an optimizing system? One of the conditions of the Flint definition is that the set of target states ought to be substantially smaller than the basin of attraction, by some measure on the configuration space. Yet neither high entropy nor death seem like they satisfy this: there are too many ways to be dead, and (tautolog
... (read more)
4Ramana Kumar1moNice comment - thanks for the feedback and questions! 1. I think the specific example we had in mind has a singleton set of target states: just the empty board. The basin is larger: boards containing no groups of more than 3 live cells. This is a refined version of "death" where even the noise is gone. But I agree with you that "high entropy" or "death", intuitively, could be seen as a large target, and hence maybe not an optimization target. Perhaps compare to the black hole [https://www.alignmentforum.org/posts/znfkdCoHMANwqc2WE/the-ground-of-optimization-1?commentId=ybf3fcMEuMTqacSL6] . 2. Great suggestion - I think the "macrostate" terminology may indeed be a good fit / worth exploring more. 3. Thanks! I think there are probably external perturbations that can't be represented as embedded perturbations. 4. Thanks! 5. The mask applies only at the instant of instantiation, and then is irrelevant for the rest of the computation, in the way we've set things up. (This is because once you've used the mask to figure out what the initial state for the computation is, you then have just an ordinary state to roll out.)If we wanted to be able to find the agent again later on in the computation then yes indeed some kind of un-instantiation operation might need a mask to do that - haven't thought about it much but could be interesting.
Meta learning to gradient hack

Very neat. It's quite curious that switching to L2 for the base optimizer doesn't seem to have resulted in the meta-initialized network learning the sine function. What sort of network did you use for the meta-learner? (It looks like the 4-layer network in your Methods refers to your base optimizer, but perhaps it's the same architecture for both?)

Also, do you know if you end up getting the meta-initialized network to learn the sine function eventually if you train for thousands and thousands of steps? Or does it just never learn no matter how hard you train it?

AI takeoff story: a continuation of progress by other means

I see — perhaps I did misinterpret your earlier comment. It sounds like the transition you are more interested in is closer to (AI has ~free rein over the internet) => (AI invents nanotech). I don't think this is a step we should expect to be able to model especially well, but the best story/analogy I know of for it is probably the end part of That Alien Message. i.e., what sorts of approaches would we come up with, if all of human civilization was bent on solving the equivalent problem from our point of view?

If instead you're thinking more about a tran... (read more)

AI takeoff story: a continuation of progress by other means

No problem, glad it was helpful!

And thanks for the APS-AI definition, I wasn't aware of the term.

AI takeoff story: a continuation of progress by other means

Thanks! I agree with this critique. Note that Daniel also points out something similar in point 12 of his comment — see my response.

To elaborate a bit more on the "missing step" problem though:

  1. I suspect many of the most plausible risk models have features that make it undesirable for them to be shared too widely. Please feel free to DM me if you'd like to chat more about this.
  2. There will always be some point between Step 1 and Step 3 at which human-legible explanations fail. i.e., it would be extremely surprising if we could tell a coherent story about the
... (read more)
2Aegeus2moIs it something like the AI-box argument? "If I share my AI breakout strategy, people will think 'I just won't fall for that strategy' instead of noticing the general problem that there are strategies they didn't think of"? I'm not a huge fan of that idea, but I won't argue it further. I'm not expecting a complete explanation, but I'd like to see a story that doesn't skip directly to "AI can reformat reality at will" without at least one intermediate step. Like, this is the third time I've seen an author pull this trick and I'm starting to wonder if the real AI-safety strategy is "make sure nobody invents grey-goo nanotech." If you have a ball of nanomachines that can take over the world faster than anyone can react to it, it doesn't really matter if it's an AI or a human at the controls, as soon as it's invented everyone dies. It's not so much an AI-risk problem as it is a problem with technological progress in general. (Fortunately, I think it's still up for debate whether it's even possible to create grey-goo-style nanotech.)
AI takeoff story: a continuation of progress by other means

See my response to point 6 of Daniel's comment — it's rather that I'm imagining competing hedge funds (run by humans) beginning to enter the market with this sort of technology.

AI takeoff story: a continuation of progress by other means

Hey Daniel — thanks so much for taking the time to write this thoughtful feedback. I really appreciate you doing this, and very much enjoyed your "2026" post as well. I apologize for the delay and lengthy comment here, but wanted to make sure I addressed all your great points.

1. It would be great if you could pepper your story with dates, so that we can construct a timeline and judge for ourselves whether we think things are happening too quickly or not.

I've intentionally avoided referring to absolute dates, other than by indirect implication (e.g. "iOS 19... (read more)

5Daniel Kokotajlo2moThanks, this was a load of helpful clarification and justification! APS-AI means Advanced, Planning, Strategically-Aware AI. Advanced means superhuman at some set of tasks (such as persuasion, strategy, etc.) that combines to enable the acquisition of power and resources, at least in today's world. The term & concept is due to Joe Carlsmith (see his draft report on power-seeking AI, he blogged about it a while ago).
The theory-practice gap

I see. Okay, I definitely agree that makes sense under the "fails to generalize" risk model. Thanks Rohin!

The theory-practice gap

Got it, thanks!

I find it plausible that the AI systems fail in only a special few exotic circumstances, which aren't the ones that are actually created by AGI.

This helps, and I think it's the part I don't currently have a great intuition for. My best attempt at steel-manning would be something like: "It's plausible that an AGI will generalize correctly to distributions which it is itself responsible for bringing about." (Where "correctly" here means "in a way that's consistent with its builders' wishes.") And you could plausibly argue that an AGI would hav... (read more)

4Rohin Shah2moIt's nothing quite so detailed as that. It's more like "maybe in the exotic circumstances we actually encounter, the objective does generalize, but also maybe not; there isn't a strong reason to expect one over the other". (Which is why I only say it is plausible that the AI system works fine, rather than probable.) You might think that the default expectation is that AI systems don't generalize. But in the world where we've gotten an existential catastrophe, we know that the capabilities generalized to the exotic circumstance; it seems like whatever made the capabilities generalize could also make the objective generalize in that exotic circumstance.
The theory-practice gap

I agree with pretty much this whole comment, but do have one question:

But it still seems plausible that in practice we never hit those exotic circumstances (because those exotic circumstances never happen, or because we've retrained the model before we get to the exotic circumstances, etc), and it's intent aligned in all the circumstances the model actually encounters.

Given that this is conditioned on us getting to AGI, wouldn't the intuition here be that pretty much all the most valuable things such a system would do would fall under "exotic circumstances... (read more)

4Rohin Shah2moThere are lots and lots of exotic circumstances. We might get into a nuclear war. We might invent time travel. We might become digital uploads. We might decide democracy was a bad idea. I agree that AGI will create exotic circumstances. But not all exotic circumstances will be created by AGI. I find it plausible that the AI systems fail in only a special few exotic circumstances, which aren't the ones that are actually created by AGI.
The alignment problem in different capability regimes

But in the context of superhuman systems, I think we need to be more concerned by the possibility that it’s performance-uncompetitive to restrict your system to only take actions that can be justified entirely with human-understandable reasoning.

Interestingly, this is already a well known phenomenon in the hedge fund world. In fact, quant funds discovered about 25 years ago that the most consistently profitable trading signals are reliably the ones that are the least human-interpretable. It makes intuitive sense: any signal that can be understood by a huma... (read more)

The alignment problem in different capability regimes

One reason to favor such a definition of alignment might be that we ultimately need a definition that gives us guarantees that hold at human-level capability or greater, and humans are probably near the bottom of the absolute scale of capabilities that can be physically realized in our world. It would (imo) be surprising to discover a useful alignment definition that held across capability levels way beyond us, but that didn't hold below our own modest level of intelligence.

When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives

No problem! Glad it was helpful. I think your fix makes sense.

I'm not quite sure what the error was in the original proof of Lemma 3; I think it may be how I converted to and interpreted the vector representation.

Yeah, I figured maybe it was because the dummy variable  was being used in the EV to sum over outcomes, while the vector  was being used to represent the probabilities associated with those outcomes. Because  and  are similar it's easy to conflate their meanings, and if you apply  to the wrong... (read more)

When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives

Thanks for writing this.

I have one point of confusion about some of the notation that's being used to prove Lemma 3. Apologies for the detail, but the mistake could very well be on my end so I want to make sure I lay out everything clearly.

First,  is being defined here as an outcome permutation. Presumably this means that 1)  for some ; and 2)  admits a unique inverse . That makes sense.

We also define lotteries over outcomes, presumably as, e.g., , where  is ... (read more)

3Alex Turner3moThanks! I think you're right. I think I actually should have defined≻ϕ differently, because writing it out, it isn't what I want. Having written out a small example, intuitively,L≻ϕMshould hold iffϕ(L)≻ϕ(M), which will also induceu (ϕ(oi))as we want. I'm not quite sure what the error was in the original proof of Lemma 3; I think it may be how I converted to and interpreted the vector representation. Probably it's more natural to representEℓ∼ϕ−1(L)[u(ℓ)]asu⊤(Pϕ−1l)=(u⊤Pϕ−1)l, which makes your insight obvious. The post is edited and the issues should now be fixed.
Re-Define Intent Alignment?

Update: having now thought more deeply about this, I no longer endorse my above comment.

While I think the reasoning was right, I got the definitions exactly backwards. To be clear, what I would now claim is:

  1. The behavioral objective is the thing the agent is revealed to be pursuing under arbitrary distributional shifts.
  2. The mesa-objective is something the agent is revealed to be pursuing under some subset of possible distributional shifts.

Everything in the above comment then still goes through, except with these definitions reversed.

On the one hand, the "per... (read more)

Re-Define Intent Alignment?

I'm with you on this, and I suspect we'd agree on most questions of fact around this topic. Of course demarcation is an operation on maps and not on territories.

But as a practical matter, the moment one starts talking about the definition of something such as a mesa-objective, one has already unfolded one's map and started pointing to features on it. And frankly, that seems fine! Because historically, a great way to make forward progress on a conceptual question has been to work out a sequence of maps that give you successive degrees of approximation to th... (read more)

Re-Define Intent Alignment?

Yeah I agree this is a legitimate concern, though it seems like it is definitely possible to make such a demarcation in toy universes (like in the example I gave above). And therefore it ought to be possible in principle to do so in our universe.

To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make ... (read more)

1Jack Koch4moProbably something like the last one, although I think "even in principle" is doing some probably doing something suspicious in that statement. Like, sure, "in principle," you can pretty much construct any demarcation you could possibly imagine, including the Cartesian one, but what I'm trying to say is something like, "all demarcations, by their very nature, exist only in the map, not the territory." Carving reality is an operation that could only make sense within the context of a map, as reality simply is. Your concept of "agent" is defined in terms of other representations that similarly exist only within your world-model; other humans have a similar concept of "agent" because they have a similar representation built from correspondingly similar parts. If an AI is to understand the human notion of "agency," it will need to also understand plenty of other "things" which are also only abstractions or latent variables within our world models, as well as what those variables "point to" [https://www.lesswrong.com/posts/gQY6LrTWJNkTv8YJR/the-pointers-problem-human-values-are-a-function-of-humans] (at least, what variables in the AI's own world model they 'point to,' as by now I hope you're seeing the problem with trying to talk about "things they point to" in external/'objective' reality!).
Re-Define Intent Alignment?

I'm not sure what would constitute a clearly-worked counterexample. To me, a high reliance on an agent/world boundary constitutes a "non-naturalistic" assumption, which simply makes me think a framework is more artificial/fragile.

Oh for sure. I wouldn't recommend having a Cartesian boundary assumption as the fulcrum of your alignment strategy, for example. But what could be interesting would be to look at an isolated dynamical system, draw one boundary, investigate possible objective functions in the context of that boundary; then erase that first boundary... (read more)

Re-Define Intent Alignment?

I would further add that looking for difficulties created by the simplification seems very intellectually productive.

Yep, strongly agree. And a good first step to doing this is to actually build as robust a simplification as you can, and then see where it breaks. (Working on it.)

Re-Define Intent Alignment?

Ah I see! Thanks for clarifying.

Yes, the point about the Cartesian boundary is important. And it's completely true that any agent / environment boundary we draw will always be arbitrary. But that doesn't mean one can't usefully draw such a boundary in the real world — and unless one does, it's hard to imagine how one could ever generate a working definition of something like a mesa-objective. (Because you'd always be unable to answer the legitimate question: "the mesa-objective of what?")

Of course the right question will always be: "what is the whole unive... (read more)

1Jack Koch4moI totally agree with this; I guess I'm just (very) wary about being able to "clearly demarcate" whichever bit we're asking about and therefore fairly pessimistic we can "meaningfully" ask the question to begin with? Like, if you start asking yourself questions like "what am 'I' optimizing for?," and then try to figure out exactly what the demarcation is between "you" and "everything else" in order to answer that question, you're gonna have a real tough time finding anything close to a satisfactory answer.
3Abram Demski4moI would further add that looking for difficulties created by the simplification seems very intellectually productive. (Solving "embedded agency problems" seems to genuinely allow you to do new things, rather than just soothing philosophical worries.) But yeah, I would agree that if we're defining mesa-objective anyway, we're already in the business of assuming some agent/environment boundary.
Re-Define Intent Alignment?

which stems from the assumption that you are able to carve an environment up into an agent and an environment and place the "same agent" in arbitrary environments. No such thing is possible in reality, as an agent cannot exist without its environment

 

I might be misunderstanding what you mean here, but carving up a world into agent vs environment is absolutely possible in reality, as is placing that agent in arbitrary environments to see what it does. You can think of the traditional RL setting as a concrete example of this: on one side we have an agen... (read more)

4Abram Demski4moAh, I wasn't aware of this! I'm not sure what would constitute a clearly-worked counterexample. To me, a high reliance on an agent/world boundary constitutes a "non-naturalistic" assumption, which simply makes me think a framework is more artificial/fragile. For example, AIXI assumes a hard boundary between agent and environment. One manifestation of this assumption is how AIXI doesn't predict its own future actions the way it predicts everything else, and instead, must explicitly plan its own future actions. This is necessary because AIXI is not computable, so treating the future self as part of the environment (and predicting it with the same predictive capabilities as usual) would violate the assumption of a computable environment. But this is unfortunate for a few reasons. First, it forces AIXI to have an arbitrary finite planning horizon, which is weird for something that is supposed to represent unbounded intelligence. Second, there is no reason to carry this sort of thing over to finite, computable agents; so it weakens the generality of the model, by introducing a design detail that's very dependent on the specific infinite setting. Another example would be game-theoretic reasoning. Suppose I am concerned about cooperative behavior in deployed AI systems. I might work on something like the equilibrium selection problem in game theory, looking for rationality concepts which can select cooperative equilibria where they exist. However, this kind of work will typically treat a "game" as something which inherently comes with a pointer to the other agents. This limits the real-world applicability of such results, because to apply it to real AI systems, those systems would need "agent pointers" as well. This is a difficult engineering problem (creating an AI system which identifies "agents" in its environment); and even assuming away the engineering challenges, there are serious philosophical difficulties (what really counts as an "agent"?). We could try to tac
2Jack Koch4moI'm not saying you can't reason under the assumption of a Cartesian boundary, I'm saying the results you obtain when doing so are of questionable relevance to reality, because "agents" and "environments" can only exist in a map, not the territory [https://www.readthesequences.com/Reductive-Reference]. The idea of trying to e.g. separate "your atoms" or whatever from those of "your environment," so that you can drop them into those of "another environment," is only a useful fiction, as in reality they're entangled with everything else. I'm not aware of formal proof of this point that I'm trying to make; it's just a pretty strongly held intuition. Isn't this also kind of one of the key motivations for thinking about embedded agency?
Re-Define Intent Alignment?

If we wish, we could replace or re-define "capability robustness" with "inner robustness", the robustness of pursuit of the mesa-objective under distributional shift.

I strongly agree with this suggestion. IMO, tying capability robustness to the behavioral objective confuses a lot of things, because the set of plausible behavioral objectives is itself not robust to distributional shift.

One way to think about this from the standpoint of the "Objective-focused approach" might be: the mesa-objective is the thing the agent is revealed to be pursuing under arbit... (read more)

[This comment is no longer endorsed by its author]Reply
1Edouard Harris4moUpdate: having now thought more deeply about this, I no longer endorse my above comment. While I think the reasoning was right, I got the definitions exactly backwards. To be clear, what I would now claim is: 1. The behavioral objective is the thing the agent is revealed to be pursuing under arbitrary distributional shifts. 2. The mesa-objective is something the agent is revealed to be pursuing under some subset of possible distributional shifts. Everything in the above comment then still goes through, except with these definitions reversed. On the one hand, the "perfect IRL" definition [https://intelligence.org/learned-optimization/#glossary] of the behavioral objective seems more naturally consistent with the omnipotent experimenter setting in the IRL unidentifiability [https://arxiv.org/pdf/1601.06569.pdf] paper cited downthread. As far as I know, perfect IRL isn't defined anywhere other than by reference to this reward modelling paper [https://arxiv.org/pdf/1811.07871.pdf], which introduces the term but doesn't define it either. But the omnipotent experimenter setting seems to capture all the properties implied by perfect IRL, and does so precisely enough that one can use it to make rigorous statements about the behavioral objective of a system in various contexts. On the other hand, it's actually perfectly possible for a mesa-optimizer to have a mesa-objective that is inconsistent with its own actions under some subset of conditions (the key conceptual error I was making was in thinking this was not possible). For example, a human being is a mesa-optimizer from the point of view of evolution. A human being may have something like "maximize happiness" as their mesa-objective. And a human being may, and frequently does, do things that do not maximize for their happiness. A few consequences of the above: * Under an "omnipotent experimenter" definition, the behavioral objective (and not the mesa-objective) is a reliable invariant of the age
3Jack Koch4moThis is the right sort of idea; in the OOD robustness literature you try to optimize worst-case performance over a perturbation set of possible environments. The problem I have with what I understand you to be saying is with the assumption that there is any possible reliable invariant of the agent over every possible environment that could be a mesa-objective, which stems from the assumption that you are able to carve an environment up into an agent and an environment and place the "same agent" in arbitrary environments. No such thing is possible in reality, as an agent cannot exist without its environment, so why shouldn't we talk about the mesa-objective being over a perturbation set, too, just that it has to be some function of the model's internal features?
Utility Maximization = Description Length Minimization

Ah yes, that's right. Yeah, I just wanted to make this part fully explicit to confirm my understanding. But I agree it's equivalent to just let  ignore the extra  (or whatever) component.

Thanks very much!

Utility Maximization = Description Length Minimization

Late comment here, but I really liked this post and want to make sure I've fully understood it. In particular there's a claim near the end which says: if  is not fixed, then we can build equivalent models  for which it is fixed. I'd like to formalize this claim to make sure I'm 100% clear on what it means. Here's my attempt at doing that:

For any pair of models  where , there exists a variable  (of which  is a subset) and a pair of models ... (read more)

4johnswentworth4moThe construction is correct. Note that forM2, conceptually we don't need to modify it, we just need to use the originalM2but apply it only to the subcomponents of the newX-variable which correspond to the originalX-variable. Alternatively, we can take the approach you do: constructM′2which has a distribution over the newX, but "doesn't say anything" about the new components, i.e. the it's just maxentropic over the new components. This is equivalent to ignoring the new components altogether.
BASALT: A Benchmark for Learning from Human Feedback

That makes sense, though I'd also expect that LfLH benchmarks like BASALT could turn out to be a better fit for superscale models in general. (e.g. a BASALT analogue might do a better job of capturing the flexibility of GPT-N or DALL-E type models than current benchmarks do, though you'd probably need to define a few hundred tasks for that to be useful. It's also possible this has already been done and I'm unaware of it.)

3Rohin Shah5moOh yeah, it totally is, and I'd be excited for that to happen. But I think that will be a single project, whereas the benchmark reporting process is meant to apply for things where there will be lots of projects that you want to compare in a reasonably apples-to-apples way, so when designing the reporting process I'm focused more on the small-scale projects that aren't GPT-N-like. I'm pretty confident that there's nothing like this that's been done and publicly released.
BASALT: A Benchmark for Learning from Human Feedback

Love this idea. From the linked post on the BAIR website, the idea of "prompting" a Minecraft task with e.g. a brief sequence of video frames seems especially interesting.

Would you anticipate the benchmark version of this would ask participants to disclose metrics such as "amount of task-specific feedback or data used in training"? Or does this end up being too hard to quantify because you're explicitly expecting folks to use a variety of feedback modalities to train their agents?

3Rohin Shah5moProbably not, just because it's pretty niche -- I expect the vast majority of papers (at least in the near future) will have only task-specific feedback, so the extra data isn't worth the additional hassle. (The prompting approach seems like it would require a lot of compute.) Tbc, "amount of task-specific feedback" should still be inferable from research papers, where you are meant to provide enough details that others could reproduce your work. It just wouldn't be as simple as looking up the "BASALT evaluation table" for your method of choice.
Formal Inner Alignment, Prospectus

Great post.

I responded that for me, the whole point of the inner alignment problem was the conspicuous absence of a formal connection between the outer objective and the mesa-objective, such that we could make little to no guarantees based on any such connection.

Strong agree. In fact I believe developing the tools to make this connection could be one of the most productive focus areas of inner alignment research.

What I'd like to have would be several specific formal definitions, together with several specific informal concepts, and strong stories connectin

... (read more)
Clarifying inner alignment terminology

Sure, makes sense! Though to be clear, I believe what I'm describing should apply to optimizers other than just gradient descent — including optimizers one might think of as reward-maximizing agents.

Clarifying inner alignment terminology

Great post. Thanks for writing this — it feels quite clarifying. I'm finding the diagram especially helpful in resolving the sources of my confusion.

I believe everything here is consistent with the definitions I proposed recently in this post (though please do point out any inconsistencies if you see them!), with the exception of one point.

This may be a fundamental confusion on my part — but I don't see objective robustness, as defined here, as being a separate concept at all from inner alignment. The crucial point, I would argue, is that we ought to be tr... (read more)

2Evan Hubinger1yI agree that what you're describing is a valid way of looking at what's going on—it's just not the way I think about it, since I find that it's not very helpful to think of a model as a subagent of gradient descent, as gradient descent really isn't itself an agent in a meaningful sense, nor do I think it can really be understood as “trying” to do anything in particular.
Biextensional Equivalence

Really interesting!

I think there might be a minor typo in Section 2.2:

For transitivity, assume that for 

I think this should be  based on the indexing in the rest of the paragraph.

2Scott Garrabrant1yFixed. Thanks.
Defining capability and alignment in gradient descent

Thanks for the kind words, Adam! I'll follow up over DM about early drafts — I'm interested in getting feedback that's as broad as possible and really appreciate the kind offer here.

Typo is fixed — thanks for pointing it out!

At first I wondered why you were taking the sum instead of just , but after thinking about it, the latter would probably converge to 0 almost all the time, because even with amazing optimization, the loss will stop being improved by a factor linear in T at some point. That might be interesting to put in the po

... (read more)
Defining capability and alignment in gradient descent

Thanks for the comment!

Not sure if I agree with your interpretation of the "real objective" - might be better served by looking for stable equilibria and just calling them as such.

I think this is a reasonable objection. I don't make this very clear in the post, but the "true objective" I've written down in the example indeed isn't unique: like any measure of utility or loss, it's only unique up to affine transformations with positive coefficients. And that could definitely damage the usefulness of these definitions, since it means that alignment factors, f... (read more)