All of evhub's Comments + Replies

evhub's Shortform

It seems that you're postulating that the human brain's credit assignment algorithm is so bad that it can't tell what high-level goals generated a particular action and so would give credit just to thoughts directly related to the current action. That seems plausible for humans, but my guess would be against for advanced AI systems.

2Alex Turner2d
No, I don't intend to postulate that. Can you tell me a mechanistic story of how better credit assignment would go, in your worldview?
A transparency and interpretability tech tree

I find this post interesting as a jumping-off point; seems like the kind of thing which will inspire useful responses via people going "no that's totally wrong!".

Yeah, I'd be very happy if that were the result of this post—in fact, I'd encourage you and others to just try to build your own tech trees so that we have multiple visions of progress that we can compare.

A transparency and interpretability tech tree

Yep, that's right—if you take a look at my discussion under (6), my claim is that we should only need (6), not (8).

evhub's Shortform
  • Other search-like algorithms like inference on a Bayes net that also do a good job in diverse environments also have the problem that their capabilities generalize faster than their objectives—the fundamental reason being that the regularity that they are compressing is a regularity only in capabilities.
  • Neural networks, by virtue of running in constant time, bring algorithmic equality all the way from uncomputable to EXPTIME—not a large difference in practice.
  • One way to think about the core problem with relaxed adversarial training is that when we gener
... (read more)
evhub's Shortform

This is a list of random, assorted AI safety ideas that I think somebody should try to write up and/or work on at some point. I have a lot more than this in my backlog, but these are some that I specifically selected to be relatively small, single-post-sized ideas that an independent person could plausibly work on without much oversight. That being said, I think it would be quite hard to do a good job on any of these without at least chatting with me first—though feel free to message me if you’d be interested.

  • What would be necessary to build a good audit
... (read more)
6Alex Turner7d
Humans don't wirehead because reward reinforces the thoughts which the brain's credit assignment algorithm deems responsible for producing that reward. Reward is not, in practice, that-which-is-maximized -- reward is the antecedent-thought-reinforcer, it reinforces that which produced it. And when a person does a rewarding activity, like licking lollipops, they are thinking thoughts about reality (like "there's a lollipop in front of me" and "I'm picking it up"), and so these are the thoughts which get reinforced. This is why many human values are about latent reality and not about the human's beliefs about reality or about the activation of the reward system.
3Evan Hubinger13d
* Other search-like algorithms like inference on a Bayes net that also do a good job in diverse environments also have the problem that their capabilities generalize faster than their objectives—the fundamental reason being that the regularity that they are compressing is a regularity only in capabilities. * Neural networks, by virtue of running in constant time, bring algorithmic equality all the way from uncomputable to EXPTIME—not a large difference in practice. * One way to think about the core problem with relaxed adversarial training [https://www.alignmentforum.org/posts/9Dy5YRaoCxH9zuJqa/relaxed-adversarial-training-for-inner-alignment] is that when we generate distributions over intermediate latent states (e.g. activations) that trigger the model to act catastrophically, we don't know how to guarantee that those distributions over latents actually correspond to some distribution over inputs that would cause them.
AGI Ruin: A List of Lethalities

Sure—that's easy enough. Just off the top of my head, here's five safety concerns that I think are important that I don't think you included:

  • The fact that there exist functions that are easier to verify than satisfy ensures that adversarial training can never guarantee the absence of deception.

  • It is impossible to verify a model's safety—even given arbitrarily good transparency tools—without access to that model's training process. For example, you could get a deceptive model that gradient hacks itself in such a way that cryptographically obfuscates i

... (read more)

Consider my vote to be placed that you should turn this into a post, keep going for literally as long as you can, expand things to paragraphs, and branch out beyond things you can easily find links for.

(I do think there's a noticeable extent to which I was trying to list difficulties more central than those, but I also think many people could benefit from reading a list of 100 noncentral difficulties.)

AGI Ruin: A List of Lethalities

I'm not sure what you mean by missing points? I only included your technical claims, not your sociological ones, if that's what you mean.

AGI Ruin: A List of Lethalities

It's very clear to me I could have written this if I had wanted to—and at the very least I'm sure Paul could have as well. As evidence: it took me ~1 hour to list off all the existing sources that cover every one of these points in my comment.

Well, there's obviously a lot of points missing!  And from the amount this post was upvoted, it's clear that people saw the half-assed current form as valuable.

Why don't you start listing out all the missing further points, then?  (Bonus points for any that don't trace back to my own invention, though I realize a lot of people may not realize how much of this stuff traces back to my own invention.)

AGI Ruin: A List of Lethalities

That requires, not the ability to read this document and nod along with it, but the ability to spontaneously write it from scratch without anybody else prompting you; that is what makes somebody a peer of its author. It's guaranteed that some of my analysis is mistaken, though not necessarily in a hopeful direction. The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for th

... (read more)

Eliezer's post here is doing work left undone by the writing you cite. It is a much clearer account of how our mainline looks doomed than you'd see elsewhere, and it's frank on this point.

I think Eliezer wishes these sorts of artifacts were not just things he wrote, like this and "There is no fire alarm".

Also, re your excerpts for (14), (15), and (32), I see Eliezer as saying something meaningfully different in each case. I might elaborate under this comment.

11Eliezer Yudkowsky18d
Well, my disorganized list sure wasn't complete, so why not go ahead and list some of the foreseeable difficulties I left out? Bonus points if any of them weren't invented by me, though I realize that most people may not realize how much of this entire field is myself wearing various trenchcoats.

I agree this list doesn't seem to contain much unpublished material, and I think the main value of having it in one numbered list is that "all of it is in one, short place", and it's not an "intro to computers can think" and instead is "these are a bunch of the reasons computers thinking is difficult to align".

The thing that I understand to be Eliezer's "main complaint" is something like: "why does it seem like No One Else is discovering new elements to add to this list?". Like, I think Risks From Learned Optimization was great, and am glad you and others ... (read more)

Thank you, Evan, for living the Virture of Scholarship. Your work is appreciated. 

Reshaping the AI Industry

I was referring to stuff like this, this, and this.

I haven't finished it yet, but I've so far very much enjoyed the Pragmatic AI Safety sequence, though I certainly have disagreements with it.

Reshaping the AI Industry

As someone who has really not been a fan of a lot of the recent conversations on LessWrong that you mentioned, I thought this was substantially better in an actually productive way with some really good analysis.

Also, if you or anyone else has a good concrete idea along these lines, feel free to reach out to me and I can help you get support, funding, etc. if I think the idea is a good one.

(Moderation note: added to the Alignment Forum from LessWrong.)

1Robert Kirk1mo
I'd be curious to hear what your thoughts are on the other conversations, or at least specifically which conversations you're not a fan of?
Risks from Learned Optimization: Introduction

Sure—I just edited it to be maybe a bit less jarring for those who know Greek.

[Link] A minimal viable product for alignment

I think this concern is only relevant if your strategy is to do RL on human evaluations of alignment research. If instead you just imitate the distribution of current alignment research, I don't think you get this problem, at least anymore than we have it now--and I think you can still substantially accelerate alignment research with just imitation. Of course, you still have inner alignment issues, but from an outer alignment perspective I think imitation of human alignment research is a pretty good thing to try.

A Longlist of Theories of Impact for Interpretability

The reason other possible reporter heads work is because they access the model's concept for X and then do something with that (where the 'doing something' might be done in the core model or in the head).

I definitely think there are bad reporter heads that don't ever have to access X. E.g. the human imitator only accesses X if X is required to model humans, which is certainly not the case for all X.

A Longlist of Theories of Impact for Interpretability

E.g. simplicity of explanation of a particular computation is a bit more like a speed prior.

I don't think that the size of an explanation/proof of correctness for a program should be very related to how long that program runs—e.g. it's not harder to prove something about a program with larger loop bounds, since you don't have to unroll the loop, you just have to demonstrate a loop invariant.

Gears-Level Mental Models of Transformer Interpretability

(Moderation note: added to the Alignment Forum from LessWrong.)

Understanding “Deep Double Descent”

SGD is meant as a shorthand that includes other similar optimizers like Adam.

A Longlist of Theories of Impact for Interpretability

Use the length of the shortest interpretability explanation of behaviours of the model as a training loss for ELK - the idea is that models with shorter explanations are less likely to include human simulations / you can tell if they do.

Maybe this is not the right place to ask this, but how does this not just give you a simplicity prior?

3Ryan Greenblatt2mo
By explanation, I think we mean 'reason why a thing happens' in some intuitive (and underspecified) sense. Explanation length gets at something like "how can you cluster/compress a justification for the way the program responds to inputs" (where justification is doing a lot of work). So, while the program itself is a great way to compress how the program responds to inputs, it doesn't justify why the program responds this way to inputs. Thus program length/simplicity prior isn't equivalent. Here are some examples demonstrating where (I think) these priors differ: * The axioms of arithmetic don't explain why the primes have a certain frequency - there is a short justification for this, but it's longer than just the axioms and has to include the axioms. * The explanation of why code golfed programs work is often longer than the programs (at least in English) * The shortest explanation for 'the SHA-512 hash of the first 2000 primes is x' probably has to include a full (long) computation trace despite the fact that a program which computes/checks this can be short. Here's a short and bad explanation for why this is maybe useful for ELK. The reason the good reporter works is because it accesses the model's concept for X and directly outputs it. The reason other possible reporter heads work is because they access the model's concept for X and then do something with that (where the 'doing something' might be done in the core model or in the head). So, the explanation for why the other heads work still has to go through the concept for X, but then has some other stuff tacked on and must be longer than the good reporter.
1Beth Barnes2mo
Seems like a simplicity prior over explanations of model behavior is not the same as a simplicity prior over models? E.g. simplicity of explanation of a particular computation is a bit more like a speed prior. I don't understand exactly what's meant by explanations here. For some kinds of attribution, you can definitely have a simple explanation for a complicated circuit and/long-running computation - e.g. if under a relevant input distribution, one input almost always determines the output of a complicated computation.
1Neel Nanda4mo
Honestly, I don't understand ELK well enough (yet!) to meaningfully comment. That one came from Tao Lin, who's a better person to ask.
Musings on the Speed Prior

This is not true in the circuit-depth complexity model. Remember that an arbitrary lookup table is O(log n) circuit depth. If my function I'm trying to memorize is f(x) = (x & 1), the fastest circuit is O(1), whereas a lookup table is O(log n).

Certainly, I'm assuming that the intended function is not in O(log n), though I think that's a very mild assumption for any realistic task.


I think the prior you're suggesting is basically a circuit size prior. How do you think it differs from that?

3TLW4mo
Inttime, the brain (or any realistic agent) can doO(t)processing... but receives O(t)sensory data. Realizable-speed priors are certainly correlated with circuit size priors to some extent, but there are some important differences: * The naive circuit size prior assumes gates take O(1) space and wiring takes zero space, and favors circuits that take less space. * There are more complex circuit size priors that e.g. assign O(1) space to a gate and O(length) space to wiring. * TheO(log(n))variant of the realizable-speed prior has no simple analog in the circuit size prior, but roughly corresponds to the circuit-depth prior. * TheO(n1/2)variant of the realizable-speed prior has no simple analog in the circuit size prior. * TheO(n1/3)variant of the realizable-speed prior roughly corresponds to the complex circuit-size prior described above, with differences described below. * Circuit size priors ignore the effects of routing congestion. * A circuit-size prior will prefer one complex circuit of N-1 gates over two simpler circuits of N/2 gates. * A realizable-speed prior will tend to prefer the two simpler circuits, as, essentially, they are easier to route (read: lower overall latency due to shorter wiring) * My (untested) intuition here is that a realizable-speed prior will be better at structuring and decomposing problems than a circuit-size prior, as a result. * Circuit size priors prefer deeper circuits than realizable-speed priors. * A circuit-size prior will prefer a circuit of max depth 10D and N gates over a circuit of max depth D and N-1 gates. * A realizable-speed prior will (typically) prefer the slightly larger far shallower circuit. * Note that the human brain is surprisingly shallow, when you consider speed of neuron activation versus human speed of response. But also very wide...
Challenges with Breaking into MIRI-Style Research

One of my hopes with the SERI MATS program is that it can help fill this gap by providing a good pipeline for people interested in doing theoretical AI safety research (be that me-style, MIRI-style, Paul-style, etc.). We're not accepting public applications right now, but the hope is definitely to scale up to the point where we can run many of these every year and accept public applications.

2021 AI Alignment Literature Review and Charity Comparison

(Moderation note: added to the Alignment Forum from LessWrong.)

More Christiano, Cotra, and Yudkowsky on AI progress

And what other EAs reading it are thinking, I expect, is plain old Robin-Hanson-style reference class tennis of "Why would you expect intelligence to scale differently from bridges, where are all the big bridges?"

I find these sorts of characterizations very strange, since I feel like I know quite a lot of EAs, but approximately nobody that's really into that sort of reference class forecasting (at least not more so than where Paul and Eliezer agree that superforecaster-style methodology is sound). I'm curious who specifically you're thinking of other th... (read more)

Are minimal circuits deceptive?

This is perhaps not directly related to your argument here, but how is inner alignment failure distinct from generalization failure?

Yes, inner alignment is a subset of robustness. See the discussion here and my taxonomy here.


If you train network N on dataset D and optimization pressure causes N to internally develop a planning system (mesa-optimizer) M, aren't all questions of whether M is aligned with N's optimization objective just generalization questions?

This reflects a misunderstanding of what a mesa-optimizer is—as we say in Risks from Learne... (read more)

Theoretical Neuroscience For Alignment Theory

To the extent Steve is right that “[understanding] the algorithms in the human brain that give rise to social instincts and [putting] some modified version of those algorithms into our AGIs” is a worthwhile safety proposal, I think we should be focusing our attention on instantiating the relevant algorithms that underlie affective and cognitive ToM + affective empathy.

It seems to me like you would very likely get both cognitive and affective theory of mind “for free” in the sense that they're necessary things to understand for predicting humans well. If... (read more)

5Steve Byrnes6mo
For my part, I strongly agree with the first part, and I said something similar in my comment [https://www.lesswrong.com/posts/ZJY3eotLdfBPCLP3z/theoretical-neuroscience-for-alignment-theory?commentId=bfjsLXdx5TJRGQxQA] . For the second part, if we're talking about within-lifetime brain learning / thinking, we're talking about online-learning. For example, if I'm having a conversation with someone, and they tell me their name is Fred, and then 2 minutes later I say "Well Fred, this has been a lovely conversation", I can thank online-learning for my remembering their name. Another example: the math student trying to solve a homework problem (and learning from the experience) is using the same basic algorithms as the math professor trying to prove a new theorem—even if the first is vaguely analogous to "training" and the second to "deployment". So then you can say: "Well fine, but online learning is pretty unfashionable in ML today. Can we talk about what the brain's within-lifetime learning algorithms would look like without online learning?" And I would say: "Ummmm, I don't know. I'm not sure that's a coherent or useful thing to talk about. A brain without online-learning would look like unusually severe retrograde [oops I meant anterograde] amnesia." That's not a criticism of what you said. Just a warning that "non-online-learning versions of brain algorithms" is maybe an incoherent notion that we shouldn't think too hard about. :)
Moore's Law, AI, and the pace of progress

(Moderation note: added to the Alignment Forum from LessWrong.)

Hard-Coding Neural Computation

(Moderation note: added to the Alignment Forum from LessWrong.)

larger language models may disappoint you [or, an eternally unfinished draft]

It's totally possible to do ecological evaluation with large LMs. (Indeed, lots of people are doing it.) For example, you can:

  • Take an RL environment with some text in it, and make an agent that uses the LM as its "text understanding module."
    • If the LM has a capacity, and that capability is helpful for the task, the agent will learn to elicit it from the LM as needed. See e.g. this paper.
  • Just do supervised learning on a capability you want to probe.

I don't understand why you think this would actually give you an ecological evaluation. It seems ... (read more)

4nostalgebraist7mo
It will still only provide a lower bound, yes, but only in the trivial sense that presence is easier to demonstrate than absence. All experiments that try to assess a capability suffer from this type of directional error, even prototype cases like "giving someone a free-response math test." * They know the material, yet they fail the test: easy to imagine (say, if they are preoccupied by some unexpected life event) * They don't know the material, yet they ace the test: requires an astronomically unlikely coincidence The distinction I'm meaning to draw is not that there is no directional error, but that the RL/SL tasks have the right structure: there is an optimization procedure which is "leaving money on the table" if the capability is present yet ends up unused.
Behavior Cloning is Miscalibrated

(Moderation note: added to the Alignment Forum from LessWrong.)

Yudkowsky and Christiano discuss "Takeoff Speeds"

But after the 10^10 point, something interesting happens: the score starts growing much faster (~N).

And for some tasks, the plot looks like a hockey stick (a sudden change from ~0 to almost-human).

Seems interestingly similar to the grokking phenomenon.

A positive case for how we might succeed at prosaic AI alignment

are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?

No—I'm separating out two very important pieces that go into training a machine learning model: what sort of model you want to get and how you're going to get it. My step (1) above, which is what I understand that we're talking about, is just about that first piece: understanding what we're going to be shooting for when we set up our training process (and then once we know what we're shooting for we can think about h... (read more)

A positive case for how we might succeed at prosaic AI alignment

To be clear, I agree with this as a response to what Edouard said—and I think it's a legitimate response to anyone proposing we just do straightforward imitative amplification, but I don't think it's a response to what I'm advocating for in this post (though to be fair, this post was just a quick sketch, so I suppose I shouldn't be too surprised that it's not fully clear).

In my opinion, if you try to imitate Bob and get a model that looks like it behaves similarly to Bob, but no have no other guarantees about it, that's clearly not a safe model to amplify,... (read more)

All the really basic concerns—e.g. it tries to get more compute so it can simulate better—can be solved by having a robust Cartesian boundary and having an agent that optimizes an objective defined on actions through the boundary

I'm confused from several directions here.  What is a "robust" Cartesian boundary, why do you think this stops an agent from trying to get more compute, and when you postulate "an agent that optimizes an objective" are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?

A positive case for how we might succeed at prosaic AI alignment

If the underlying process your myopic agent was trained to imitate would (under some set of circumstances) be incentivized to deceive you, and the myopic agent (by hypothesis) imitates the underlying process to sufficient resolution, why would the deceptive behavior of the underlying process not be reflected in the behavior of the myopic agent?

Yeah, this is obviously true. Certainly if you have an objective of imitating something that would act deceptively, you'll get deception. The solution isn't to somehow “filter out the unwanted instrumental behavio... (read more)

This is a great thread. Let me see if I can restate the arguments here in different language:

  1. Suppose Bob is a smart guy whom we trust to want all the best things for humanity. Suppose we also have the technology to copy Bob's brain into software and run it in simulation at, say, a million times its normal speed. Then, if we thought we had one year between now and AGI (leaving aside the fact that I just described a literal AGI in the previous sentence), we could tell simulation-Bob, "You have a million subjective years to think of an effective pivotal act i
... (read more)
A positive case for how we might succeed at prosaic AI alignment

How does a "myopic optimizer" successfully reason about problems that require non-myopic solutions, i.e. solutions whose consequences extend past whatever artificial time-frame the optimizer is being constrained to reason about?

It just reasons about them, using deduction, prediction, search, etc., the same way any optimizer would.

To the extent that it does successfully reason about those things in a non-myopic way, in what remaining sense is the optimizer myopic?

The sense that it's still myopic is in the sense that it's non-deceptive, which is the o... (read more)

[Note: Still speaking from my Eliezer model here, in the sense that I am making claims which I do not myself necessarily endorse (though naturally I don't anti-endorse them either, or else I wouldn't be arguing them in the first place). I want to highlight here, however, that to the extent that the topic of the conversation moves further away from things I have seen Eliezer talk about, the more I need to guess about what I think he would say, and at some point I think it is fair to describe my claims as neither mine nor (any model of) Eliezer's, but instea... (read more)

A positive case for how we might succeed at prosaic AI alignment

I mean, that's because this is just a sketch, but a simple argument for why myopia is more natural than “obey humans” is that if we don't care about competitiveness, we already know how to build myopic optimizers, whereas we don't know how to build an optimizer to “obey humans” at any level of capabilities.

Furthermore, LCDT is a demonstration that we can at least reduce the complexity of specifying myopia to the complexity of specifying agency. I suspect we can get much better upper bounds on the complexity than that, though.

6Joe_Collman7mo
It's an interesting idea, but are you confident that LCDT actually works? E.g. have you thought more about the issues I talked about here [https://www.lesswrong.com/posts/Y76durQHrfqwgwM5o/lcdt-a-myopic-decision-theory?commentId=crfQECLrYRaP225ne] and concluded they're not serious problems? I still don't see how we could get e.g. an HCH simulator without agentic components (or the simulator's qualifying as an agent). As soon as an LCDT agent expects that it may create agentic components in its simulation, it's going to reason horribly about them (e.g. assuming that any adjustment it makes to other parts of its simulation can't possibly impact their existence or behaviour, relative to the prior). I think LCDT does successfully remove the incentives you're aiming to remove. I just expect it to be too broken to do anything useful. I can't currently see how we could get the good parts without the brokenness.
4Richard Ngo7mo
What are you referring to here?
A positive case for how we might succeed at prosaic AI alignment

(1) a (good) pivotal act is probably a non-myopic problem, and (2) you can't solve a nontrivial nonmyopic problem with a myopic solver. [...] My guess is that you have some idea of how a myopic solver can solve a nonmyopic problem (by having it output whatever HCH would do, for instance).

Yeah, that's right, I definitely agree with (1) and disagree with (2).

And then Eliezer would probably reply that the non-myopia has been wrapped up somewhere else (e.g. in HCH), and that has become the dangerous part (or, more realistically, the insufficiently capable

... (read more)

It still doesn't seem to me like you've sufficiently answered the objection here.

I tend to think that HCH is not dangerous, but I agree that it's likely insufficiently capable. To solve that problem, we have to do go to a myopic objective that is more powerful.

What if any sufficiently powerful objective is non-myopic? Or, on a different-but-equivalent phrasing: what if myopia is a property only of very specific toy objectives, rather than a widespread property of objectives in general (including objectives that humans would intuitively consider to be aimed... (read more)

A positive case for how we might succeed at prosaic AI alignment

To be clear, I agree with this also, but don't think it's really engaging with what I'm advocating for—I'm not proposing any sort of assemblage of reasoners; I'm not really sure where that misconception came from.

I don't think the assemblage is the point. I think the idea here is that "myopia" is a property of problems: a non-myopic problem is (roughly) one which inherently requires doing things with long time horizons. I think Eliezer's claim is that (1) a (good) pivotal act is probably a non-myopic problem, and (2) you can't solve a nontrivial nonmyopic problem with a myopic solver. Part (2) is what I think TekhneMakr is gesturing at and Eliezer is endorsing.

My guess is that you have some idea of how a myopic solver can solve a nonmyopic problem (by having it out... (read more)

A positive case for how we might succeed at prosaic AI alignment

Why do you expect this to be any easier than directing that optimisation towards the goal of "doing what the human wants"? In particular, if you train a system on the objective "imitate HCH", why wouldn't it just end up with the same long-term goals as HCH has?

To be clear, I was only talking about (1) here, which is just about what it might look like for an agent to be myopic, not how to actually get an agent that satisfies (1). I agree that you would most likely get a proxy-aligned model if you just trained on “imitate HCH”—but just training on “imitat... (read more)

2Richard Ngo7mo
That all makes sense. But I had a skim of (2), (3), (4), and (5) and it doesn't seem like they help explain why myopia is significantly more natural than "obey humans"?
A positive case for how we might succeed at prosaic AI alignment

I suspect that there were a lot of approaches that would have produced similar results to how we ended up doing language modeling. I believe that the main advantage of Transformers over LSTMs is just that LSTMs have exponentially decaying ability to pay attention to prior tokens while Transformers can pay constant attention to all tokens in the context. I suspect that it would have been possible to fix the exponential decay problem with LSTMs and get them to scale like Transformers, but Transformers came first, so nobody tried. And that's not to say that M... (read more)

4Quintin Pope7mo
I agree that transformers vs other architectures is a better example of the field “following the leader” because there are lots of other strong architectures (perceiver, mlp mixer, etc). In comparison, using self supervised transfer learning is just an objectively good idea you can apply to any architecture and one the brain itself almost surely uses. The field would have converged to doing so regardless of the dominant architecture. One hopeful sign is how little attention the ConvBERT language model [https://huggingface.co/transformers/model_doc/convbert.html] has gotten. It mixes some convolution operations with self attention to allow self attention heads to focus on global patterns as opposed to local patterns better handled by convolution. ConvBERT is more compute efficient than a standard transformer, but hasn’t made much of a splash. It shows the field can ignore low profile advances made by smaller labs. For your point about the value of alignment: I think there’s a pretty big range of capabilities where the marginal return on extra capabilities is higher than the marginal return on extra alignment. Also, you seem focused on avoiding deception/treacherous turns, which I think are a small part of alignment costs until near human capabilities. I don’t know what sort of capabilities penalty you pay for using a myopic training objective, but I don’t think there’s much margin available before voluntary mass adoption becomes implausible.
A positive case for how we might succeed at prosaic AI alignment

The notion of (1) seems like the cat-belling problem here; the other steps don't seem interesting by comparison, the equivalent of talking about all the neat things to do after belling the cat.

I'm surprised that you think (1) is the hard part—though (1) is what I'm currently working on, since I think it's necessary to make a lot of the other parts go through, I expect it to be one of the easiest parts of the story to make work.

What pivotal act is this AGI supposed to be executing? Designing a medium-strong nanosystem?

I left this part purposefully v... (read more)

The key idea, in the case of HCH, would be to direct that optimization towards the goal of producing an action that is maximally close to what HCH would do.

Why do you expect this to be any easier than directing that optimisation towards the goal of "doing what the human wants"? In particular, if you train a system on the objective "imitate HCH", why wouldn't it just end up with the same long-term goals as HCH has? That seems like a much more natural thing for it to learn than the concept of imitating HCH, because in the process of imitating HCH it still ha... (read more)

Certainly it doesn't matter what substrate the computation is running on.

I read Yudkowsky as positing some kind of conservation law. Something like, if the plans produced by your AI succeed at having specifically chosen far-reaching consequences if implemented, then the AI must have done reasoning about far-reaching consequences. Then (I'm guessing) Yudkowsky is applying that conservation law to [a big assemblage of myopic reasoners which outputs far-reaching plans], and concluding that either the reasoners weren't myopic, or else the assemblage implement... (read more)

How do we become confident in the safety of a machine learning system?

This is probably one of the most important post on alignment on this forum. Seriously. I want everyone thinking about conceptual alignment, and everyone trying conceptual alignment, to read this and think about it deeply.

Glad you think so! I definitely agree and am planning on using this framework in my own research going forward.

"story" makes technical people feel uncomfortable. We immediately fear weird justification and biases towards believing interesting stories. And we should be wary of this when working on alignment, while acknowledging that mo

... (read more)
What exactly is GPT-3's base objective?

First, the problem is only with outer/inner alignment—the concept of unintended mesa-optimization is still quite relevant and works just fine.

Second, the problems with applying Risks from Learned Optimization terminology to GPT-3 have nothing to do with the training scenario, the fact that you're doing unsupervised learning, etc.

The place where I think you run into problems is that, for cases where mesa-optimization is intended in GPT-style training setups, inner alignment in the Risks from Learned Optimization sense is usually not the goal. Most of the op... (read more)

What exactly is GPT-3's base objective?

This is unfortunate, no? The AI safety community had this whole thing going with mesa-optimization and whatnot... now you propose to abandon the terminology and shift to this new frame? But what about all the people using the old terminology? Is the old terminology unsalvageable?

To be clear, that's definitely not what I'm arguing. I continue to think that the Risks from Learned Optimization terminology is really good, for the specific case that it's talking about. The problem is just that it's not general enough to handle all possible ways of training a... (read more)

4DanielFilan8mo
GPT-3 was trained using supervised learning, which I would have thought was a pretty standard way of training a model using machine learning. What training scenarios do you think the Risks from Learned Optimization terminology can handle, and what's the difference between those and the way GPT-3 was trained?
What exactly is GPT-3's base objective?

My current position is that this is the wrong question to be asking—instead, I think the right question is just “what is GPT-3's training story?” Then, we can just talk about to what extent the training rationale is enough to convince us that we would get the desired training goal vs. some other model, like a deceptive model, instead—rather than having to worry about what technically counts as the base objective, mesa-objective, etc.

6Daniel Kokotajlo8mo
I was wondering if that was the case, haha. Thanks! This is unfortunate, no? The AI safety community had this whole thing going with mesa-optimization and whatnot... now you propose to abandon the terminology and shift to this new frame? But what about all the people using the old terminology? Is the old terminology unsalvageable? I do like your new thing and it seems better to me in some ways, but worse in others. I feel like I expect a failure mode where people exploit ambiguity and norm-laden concepts to convince themselves of happy fairy tales. I should think more about this and write a comment. ETA: Here's an attempt to salvage the original inner/outer alignment problem framing: We admit up front that it's a bit ambiguous what the base objective is, and thus there will be cases where it's ambiguous whether a mesa-optimizer is aligned to the base objective. However, we say this isn't a big deal. We give a handful of examples of "reasonable construals" of the base objective, like I did in the OP, and say that all the classic arguments are arguments for the plausibility of cases where a mesa-optimizer is misaligned with every reasonable construal of the base objective. Moreover, we make lemons out of lemonade, and point out that the fact there are multiple reasonable construals is itself reason to think inner alignment problems are serious and severe. I'm imagining an interlocutor who thinks "bah, it hasn't been established yet that inner-alignment problems are even a thing; it still seems like the default hypothesis is that you get what you train for, i.e. you get an agent that is trying to maximize predictive accuracy or whatever." And then we say "Oh? What exactly is it trying to maximize? Predictive accuracy full stop? Or predictive accuracy conditional on dataset D? Or is it instead trying to maximize reward, in which case it'd hack its reward channel if it could? Whichever one you think it is, would you not agree that it's plausible that it might inste
1Charlie Steiner8mo
Yeah, agreed. It's true that GPT obeys the objective "minimize the cross-entropy loss between the output and the distribution of continuations in the training data." But this doesn't mean it doesn't also obey objectives like "write coherent text", to the extent that we can tell a useful story about how the training set induces that behavior. (It is amusing to me how our thoughts immediately both jumped to our recent hobbyhorses.)
Forecasting progress in language models

(Moderation note: added to the Alignment Forum from LessWrong.)

Towards Deconfusing Gradient Hacking

(Moderation note: added to the Alignment Forum from LessWrong.)

[AN #166]: Is it crazy to claim we're in the most important century?

Note though that it does not defuse all such uneasiness -- you can still look at how early we appear to be (given the billions of years of civilization that could remain in the future), and conclude that the simulation hypothesis is true, or that there is a Great Filter in our future that will drive us extinct with near-certainty. In such situations there would be no extraordinary impact to be had today by working on AI risk.

I don't think I agree with this—in particular, it seems like even given the simulation hypothesis, there could still be quite a lo... (read more)

5Rohin Shah9mo
Yeah, I agree the statement is false as I literally wrote it, though what I meant was that you could easily believe you are in the kind of simulation where there is no extraordinary impact to have.
Load More