All of Richard_Ngo's Comments + Replies

I believe this because of how the world looks "brittle" (e.g., nanotech exists) and because lots of technological progress seems cognition-constrained (such as, again, nanotech). This is a big part of why I think heavy-precedent-style justifications are doomed.

Apart from nanotech, what are the main examples or arguments you would cite in favor of these claims?

Separately, how close is your conception of nanotech to "atomically precise manufacturing", which seems like Drexler's preferred framing right now?

1Vojtech Kovarik13d
One way in which the world seems brittle / having free energy AI could use to gain advantage: We haven't figured out good communication practices for the digital age. We don't have good collective epistemics. And we dont seem to be on track to have this solved in the next 20 years. As a result I expect that with enough compute and understanding of network science, and perhaps a couple more things, you could sabotage the whole civilization. ("Enough" is meant to stand for "a lot, but within reach of an early AGI". Heck, if Google somehow spent the next 5 years just on that, I would give them fair odds.)
5Thomas Kwa15d
not Nate or a military historian, but to me it seems pretty likely for a ~100 human-years more technologically advanced actor to get decisive strategic advantage over the world. * In military history it seems pretty common for some tech advance to cause one side to get a big advantage. This seems to be true today [https://www.lesswrong.com/posts/bffJJvCC78LZjFa3Z/what-a-20-year-lead-in-military-tech-might-look-like] as well with command-and-control and various other capabilities * I would guess pure fusion weapons [https://en.wikipedia.org/wiki/Pure_fusion_weapon] are technologically possible, which means an AI sophisticated enough to design one can get nukes without uranium * Currently on the cutting edge, the most advanced actors have large multiples over everyone else in important metrics. This is due to either a few years' lead or better research practices still within the human range * SMIC is mass producing the 14nm node whereas Samsung is at 3nm, which is something like 5x better FLOPS/watt * algorithmic improvements driven by cognitive labor of ML engineers have caused multiple OOM improvement in value/FLOPS * SpaceX gets 10x better cost per ton [https://aerospace.csis.org/data/space-launch-to-low-earth-orbit-how-much-does-it-cost/] to orbit than the next cheapest space launch provider, and this is before Starship. Also their internal costs are lower This seems sufficient for "what failure looks like" scenarios, with faster disempowerment through hard takeoff likely to depend on other pathways like nanotech, social engineering, etc. As for the whole argument against "heavy precedent", I'm not convinced either way and haven't thought about it a ton.

A couple of differences between Kolmogorov complexity/Shannon entropy and the loss function of autoregressive LMs (just to highlight them, not trying to say anything you don't already know):

  • The former are (approximately) symmetric, the latter isn't (it can be much harder to predict a string front-to-back than back-to-front)
  • The former calculate compression values as properties of a string (up to choice of UTM). The latter calculates compression values as properties of a string, a data distribution, and a model (and even then doesn't strictly determine the r
... (read more)

I agree that we'll have a learning function that works on the data actually input, but it seems strange to me to characterize that learned model as "reflecting back on that data" in order to figure out what it cares about (as opposed to just developing preferences that were shaped by the data).

4Eliezer Yudkowsky1mo
The cogitation here is implicitly hypothesizing an AI that's explicitly considering the data and trying to compress it, having been successfully anchored on that data's compression as identifying an ideal utility function. You're welcome to think of the preferences as a static object shaped by previous unreflective gradient descent; it sure wouldn't arrive at any better answers that way, and would also of course want to avoid further gradient descent happening to its current preferences.

if some kind of compassionate-LDT is a source of hope about not destroying all the value in our universe-share and getting ourselves killed, then it must be hope about us figuring out such a theory and selecting for AGIs that implement it from the start, rather than that maybe an AGI would likely convergently become that way before taking over the world.


I weakly disagree here, mainly because Nate's argument for very high levels of risk goes through strong generalization/a "sharp left turn" towards being much more coherent + goal-directed. So I find it hard... (read more)

Broadly agree with this post. Couple of small things:

Then later, it is smart enough to reflect back on that data and ask: “Were the humans pointing me towards the distinction between goodness and badness, with their training data? Or were they pointing me towards the distinction between that-which-they'd-label-goodness and that-which-they'd-label-badness, with things that look deceptively good (but are actually bad) falling into the former bin?” And to test this hypothesis, it would go back to its training data and find some example bad-but-deceptively-goo

... (read more)
5Eliezer Yudkowsky1mo
By "were the humans pointing me towards..." Nate is not asking "did the humans intend to point me towards..." but rather "did the humans actually point me towards..." That is, we're assuming some classifier or learning function that acts upon the data actually input, rather than a succesful actual fully aligned works-in-real-life DWIM which arrives at the correct answer given wrong data.
3davidad1mo
For the record, I have a convergently similar intuition [https://twitter.com/davidad/status/1582691978520633344?s=20&t=lXiuvapBGXmQVuXLoODbhg] : FDT removes the Cartesian specialness of the ego at the decision nodes (by framing each decision as a mere logical consequence of an agent-neutral nonphysical fact about FDT itself), but retains the Cartesian specialness of the ego at the utility node(s). I’ve thought about this for O(10 hours), and I also believe it could be crazy, but it does align quite well with the conclusions of Compassionate Moral Realism [https://ndpr.nd.edu/reviews/compassionate-moral-realism/].

I considered this, but it seems like the latter is 4x longer while covering fairly similar content?

2Evan Hubinger2mo
I've found that people often really struggle to understand the content from the former but got it when I gave them the latter—and also I think the latter post covers a lot of newer stuff that's not in the old one (e.g. different models of inductive biases).

Given the baseline classifier's 0.003% failure rate, you would have to sample and label 30,000 in-distribution examples to find a failure (which would cost about $10,000). With our tools, our contractors are able to find an adversarial example on the baseline classifier every 13 minutes (which costs about $8 – about 1000x cheaper).

This isn't comparing apples to apples, though? If you asked contractors to find adversarial examples without using the tools, they'd likely find them at a rate much higher than 0.003%.

1dmz2mo
That's right. We did some followup experiments doing the head-to-head comparison: the tools seem to speed up the contractors by 2x for the weak adversarial examples they were finding (and anecdotally speed us up a lot more when we use them to find more egregious failures). See https://www.lesswrong.com/posts/n3LAgnHg6ashQK3fF/takeaways-from-our-robust-injury-classifier-project-redwood#Quick_followup_results [https://www.lesswrong.com/posts/n3LAgnHg6ashQK3fF/takeaways-from-our-robust-injury-classifier-project-redwood#Quick_followup_results;] ; an updated arXiv paper with those experiments is appearing on Monday.

None of the "hard takeoff people" or hard takeoff models predicted or would predict that the sorts of minor productivity advancements we are starting to see would lead to a FOOM by now.

The hard takeoff models predict that there will be less AI-caused productivity advancements before a FOOM than soft takeoff models. Therefore any AI-caused productivity advancements without FOOM are relative evidence against the hard takeoff models.

You might say that this evidence is pretty weak; but it feels hard to discount the evidence too much when there are few concrete claims by hard-takeoff proponents about what advances would surprise them. Everything is kinda prosaic in hindsight.

3Daniel Kokotajlo2mo
I'm not sure about that actually. Hard takeoff and soft takeoff disagree about the rate of slope change, not about the absolute height of the line. I guess if you are thinking about the "soft takeoff means shorter timelines" then yeah it also means higher AI progress prior to takeoff, and in particular predicts more stuff happening now. But people generally agree that despite that effect, the overall correlation between short timelines and fast takeoff is positive. Anyhow, even if you are right, I definitely think the evidence is pretty weak. Both sides make pretty much the exact same retrodictions and were in fact equally unsurprised by the last few years. I agree that Yudkowsky deserves spanking for not working harder to make concrete predictions/bets with Paul, but he did work somewhat hard, and also it's not like Paul, Ajeya, etc. are going around sticking their necks out much either. Finding concrete stuff to bet on (amongst this group of elite futurists) is hard. I speak from experience here, I've talked with Paul and Ajeya and tried to find things in the next 5 years we disagree on and it's not easy, EVEN THOUGH I HAVE 5-YEAR TIMELINES. We spent about an hour probably. I agree we should do it more. (Think about you vs. me. We both thought in detail about what our median futures look like. They were pretty similar, especially in the next 5 years!)

Thanks for the comments Vika! A few responses:

It might be good to clarify that this is an example architecture and the claims apply more broadly.

Makes sense, will do.

Phase 1 and 2 seem to map to outer and inner alignment respectively. 

That doesn't quite seem right to me. In particular:

  • Phase 3 seems like the most direct example of inner misalignment; I basically think of "goal misgeneralization" as a more academically respectable way of talking about inner misalignment.
  • Phase 1 introduces the reward misspecification problem (which I treat as synonymous
... (read more)

How? E.g. Jacob left a comment here about his motivations, does that count as a falsification? Or, if you'd say that this is an example of rationalization, then what would the comment need to look like in order to falsify your claim? Does Paul's comment here mentioning the discussions that took place before launching the GPT-3 work count as a falsification? if not, why not?

Jacob's comment does not count, since it's not addressing the "actually consider whether the project will net decrease chance of extinction" or the "could the answer have plausibly been 'no' and then the project would not have happened" part.

Paul's comment does address both of those, especially this part at the end:

To be clear, this is not post hoc reasoning. I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arg

... (read more)

At a sufficiently high level of abstraction, I agree that "cost of experimenting" could be seen as the core difficulty. But at a very high level of abstraction, many other things could also be seen as the core difficulty, like "our inability to coordinate as a civilization" or "the power of intelligence" or "a lack of interpretability", etc. Given this, John's comment seemed like mainly rhetorical flourishing rather than a contentful claim about the structure of the difficult parts of the alignment problem.

Also, I think that "on our first try" thing isn't ... (read more)

RLHF helps with outer alignment because it leads to rewards which more accurately reflect human preferences than the hard-coded reward functions (including the classic specification gaming examples, but also intrinsic motivation functions like curiosity and empowerment) which are used to train agents in the absence of RLHF.

The smiley faces example feels confusing as a "classic" outer alignment problem because AGIs won't be trained on a reward function anywhere near as limited as smiley faces. An alternative like "AGIs are trained on a reward function in wh... (read more)

I think the smiling example is much more analogous than you are making it out here. I think the basic argument for "this just encourages taking control of the reward" or "this just encourages deception" goes through the same way.

Like, RLHF is not some magical "we have definitely figured out whether a behavior is really good or bad" signal, it's historically been just some contractors thinking for like a minute about whether a thing is fine. I don't think there is less bayesian evidence conveyed by people smiling (like, the variance in smiling is greater th... (read more)

I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is. I think this is an ITT which a lot of people in the broader LW cluster would fail. I think the basic mistake that's being made here is failing to recognize that reality doesn't grade on a curve when it comes to understanding the world - your arguments can be false even if nobody has refuted them. That's particularly true when it comes to very high-level abstractions, like the ones th... (read more)

Comments on parts of this other than the ITT thing (response to the ITT part is here)...

(and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment)

I don't usually focus much on the outer/inner abstraction, and when I do I usually worry about outer alignment. I consider RLHF to have been negative progress on outer alignment, same as inner alignment; I wasn't relying on that particular abstraction at all.

Historically, the way that great scientists have gotten around

... (read more)

I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is.

I don't want to pour a ton of effort into this, but here's my 5-paragraph ITT attempt.

"As an analogy for alignment, consider processor manufacturing. We didn't get to gigahertz clock speed and ten nanometer feature size by trying to tackle all the problems of 10 nm manufacturing processes right out the gate. That would never have worked; too many things independently go wrong to isola... (read more)

2Oliver Habryka3mo
I am confused. How does RLHF help with outer alignment? Isn't optimizing fur human approval the classical outer-alignment problem? (e.g. tiling the universe with smiling faces) I don't think the argument for RLHF runs through outer alignment. I think it has to run through using it as a lens to study how models generalize, and eliciting misalignment (i.e. the points about empirical data that you mentioned, I just don't understand where the inner/outer alignment distinction comes from in this context)

There's one attitude towards alignment techniques which is something like "do they prevent all catastrophic misalignment?" And there's another which is more like "do they push out the frontier of how advanced our agents need to be before we get catastrophic misalignment?" I don't think the former approach is very productive, because by that standard no work is ever useful. So I tend to focus on the latter, with the theory of victory being "push the frontier far enough that we get a virtuous cycle of automating alignment work".

2Joe_Collman3mo
Ok, thanks - I can at least see where you're coming from then. Do you think debate satisfies the latter directly - or is the frontier only pushed out if it helps in the automation process? Presumably the latter (??) - or do you expect e.g. catastrophic out-with-a-whimper dynamics before deceptive alignment? I suppose I'm usually thinking of a "do we learn anything about what a scalable alignment approach would look like?" framing. Debate doesn't seem to get us much there (whether for scalable ELK, or scalable anything else), unless we can do the automating alignment work thing - and I'm quite sceptical of that (weakly, and with hand-waving). I can buy a safe-AI-automates-quite-a-bit-of-busy-work argument; once we're talking about AI that's itself coming up with significant alignment breakthroughs, I have my doubts. What we want seems to be unhelpfully complex, such that I expect we need to target our helper AIs precisely (automating capabilities research seems much simpler). Since we currently can't target our AIs precisely, I imagine we'd use (amplified-)human-approval. My worry is that errors/noise compounds for indirect feedback (e.g. deep debate tree on a non-crisp question), and that direct feedback is only as good as our (non-amplified) ability to do alignment research. AI that doesn't have these problems seems to be the already dangerous variety (e.g. AI that can infer our goal and optimize for it). I'd be more optimistic if I thought we were at/close-to a stage where we know the crisp high-level problems we need to solve, and could ask AI assistants for solutions to those crisp problems. That said, I certainly think it's worth thinking about how we might get to a "virtuous cycle of automating alignment work". It just seems to me that it's bottlenecked on the same thing as our direct attempts to tackle the problem: our depth of understanding.

Huh, I thought you agreed with statements like "if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier".

I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).

I don't really know what "reliable empirical feedback" means in this context - if you have sufficiently reliable feedback mechanisms, then you've solved m... (read more)

I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).

I do really feel like it would have been really extremely hard to build rockets if we had to get it right on the very first try.

I think for rockets the fact that it is so costly to experiment with stuff, explains the majority of the difficulty of rocket engineering. I agree you also have very little chance to build a successful space rocket without having a g... (read more)

I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).

I basically buy that argument, though I do still think lack of shots is the main factor which makes alignment harder than most other technical fields in their preparadigmatic stage.

Alignment as a field is hard precisely because we do not expect to see empirical evidence before it is too late.

I don't think this is the core reason that alignment is hard - even if we had access to a bunch of evidence about AGI misbehavior now, I think it'd still be hard to convert that into a solution for alignment. Nor do I believe we'll see no empirical evidence of power-seeking behavior before it's too late (and I think opinions amongst alignment researchers are pretty divided on this question).

I don't think this is the core reason that alignment is hard - even if we had access to a bunch of evidence about AGI misbehavior now, I think it'd be very hard to convert that into a solution for alignment.

If I imagine that we magically had a boxing setup which let us experiment with powerful AGI alignment without dying, I do agree it would still be hard to solve alignment. But it wouldn't be harder than the core problems of any other field of science/engineering. It wouldn't be unusually hard, by the standards of technical research.

Of course, "empirical ... (read more)

Huh, I thought you agreed with statements like "if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier".

My model is that John is talking about "evidence on whether an AI alignment solution is sufficient", and you understood him to say "evidence on whether the AI Alignment problem is real/difficult". My guess is you both agree on the former, but I am not confident.

No, I'm thinking of cases where Alice>Bob, and trying to gesture towards the distinction between "Bob knows that Alice believes X" and "Bob can use X to make predictions".

For example, suppose that Bob is a mediocre physicist and Alice just invented general relativity. Bob knows that Alice believes that time and space are relative, but has no idea what that means. So when trying to make predictions about physical events, Bob should still use Newtonian physics, even when those calculations require assumptions that contradict Alice's known beliefs.

3Abram Demski3mo
I think Bob still doesn't really need a two-part strategy in this case. Bob knows that Alice believes "time and space are relative", so Bob believes this proposition, even though Bob doesn't know the meaning of it. Bob doesn't need any special-case rule to predict Alice. The best thing Bob can do in this case still seems like, predict Alice based off of Bob's own beliefs. (Perhaps you are arguing that Bob can't believe something without knowing what that thing means? But to me this requires bringing in extra complexity which we don't know how to handle anyway, since we don't have a bayesian definition of "definition" to distinguish "Bob thinks X is true but doesn't know what X means" from a mere "Bob thinks X is true".) A similar example would be an auto mechanic. You expect the mechanic to do things like pop the hood, get underneath the vehicle, grab a wrench, etc. However, you cannot predict which specific actions are useful for a given situation. We could try to use a two-part model as you suggest, where we (1) maintain an incoherent-but-useful model of car-specific beliefs mechanics have, such as "wrenches are often needed"; (2) use the best of our own beliefs where that model doesn't apply. However, this doesn't seem like it's ever really necessary or like it saves processing power for bounded reasoners, because we also believe that "wrenches are sometimes useful". This belief isn't specific enough that we could reproduce the mechanic's actions by acting on these beliefs; but, that's fine, that's just because we don't know enough. (Perhaps you have in mind a picture where we can't let incoherent beliefs into our world-model -- our limited understanding of Alice's physics, or of the mechanic's work, means that we want to maintain a separate, fully coherent world-model, and apply our limited understanding of expert knowledge only as a patch. If this is what you are getting at, this seems reasonable, so long as we can still count the whole resulting thing "m

Interesting post! Two quick comments:

Sometimes we analyze agents from a logically omniscient perspective. ... However, this omniscient perspective eliminates Vingean agency from the picture.

Another example of this happening comes when thinking about utilitarian morality, which by default doesn't treat other agents as moral actors (as I discuss here).

Bob has minimal use for attributing beliefs to Alice, because Bob doesn't think Alice is mistaken about anything -- the best he can do is to use his own beliefs as a proxy, and try to figure out what Alice will

... (read more)
2Abram Demski3mo
Interesting point! It sounds to me like you're thinking of cases on my spectrum, somewhere between Alice>Bob and Bob>Alice. If Bob thinks Alice knows strictly more than Bob, then Bob can just use Bob's own beliefs, even when specific-things-bob-knows-Alice-believes are relevant -- because Bob also already believes those things, by hypothesis. So it's only in intermediate cases that Bob might get a benefit from a split strategy like the one you describe.
  • Another response is "The AI paralyzes your face into smiling." 
    • But this is actually a highly nontrivial claim about the internal balance of value and computation which this reinforcement schedule carves into the AI. Insofar as this response implies that an AI will primarily "care about" literally making you smile, that seems like a highly speculative and unsupported claim about the AI internalizing a single powerful decision-relevant criterion / shard of value, which also happens to be related to the way that humans conceive of the situation (i.e. som
... (read more)
2Alex Turner4mo
I don't know? Seems like a representative kind of "potential risk" I've read about before, but I'm not going to go dig it up right now. (My post also isn't primarily about who said what, so I'm confused by your motivation for posting this question?)

I intend to convert this report to a nicely-formatted PDF with academic-style references. Please comment below, or message me, if you're interested in being paid to do this. EDIT: have now hired someone to do it.

More generally, I'll likely make a number of edits over the coming weeks, so comments and feedback would be very welcome.

3Robert Kirk3mo
Me, modelling skeptical ML researchers who may read this document: It felt to me that Large-scale goals are likely to incentivize misaligned power-seeking and AGIs’ behavior will eventually be mainly guided by goals they generalize to large scales were the least well-argued sections (in that while reading them I felt less convinced, and the arguments were more hand-wavy than before). In particular, the argument that we won't be able to use other AGIs to help with supervision because of collusion is entirely contained in footnote 22, and doesn't feel that robust to me - or at least it seems easier for a skeptical reader to dismiss that, and hence not think the rest of section 3 is well-founded. Maybe it's worth adding another argument for why we probably can't just use other AGIs to help with alignment, or at least that we don't currently have good proposals for doing so that we're confident will work (e.g. how do we know the other AGIs are aligned and are hence actually helping). Also seems to be saying that positive goals won't generalise correctly because we need to get the positive goals exactly correct on the first try. I don't know if that is exactly an argument for why positive goals won't generalise correctly. It feels like this paragraph is trying to preempt the counterargument to this section that goes something like "Why wouldn't we just interactively adjust the objective if we see bad behaviour?", by justifying why we would need to get it right robustly and on the first try and throughout training, because the AGI will stop us doing this modification later on. Maybe it would be better to frame it that way if that was the intention. Note that I agree with the document and I'm in favour of producing more ML-researcher-accessible descriptions of and motivations for the alignment problem, hence this effort to make the document more robust to skeptical ML researchers.

Just skimmed the course. One suggestion (will make more later): adding the Goal Misgeneralization paper from Langosco et al. as a core readings in the week on Detecting and Forecasting Emergent Behavior.

Hmm, perhaps clearer to say "reward does not automatically reinforce reward-focused thoughts into terminal values", given that we both agree that agents will have thoughts about reward either way.

But if you agree that reward gets reinforced as an instrumental value, then I think your claims here probably need to actually describe the distinction between terminal and instrumental values. And this feels pretty fuzzy - e.g. in humans, I think the distinction is actually not that clear-cut.

In other words, if everyone agrees that reward likely becomes a strong ... (read more)

Amplification can just be used as a method for making more and better common-sense improvements, though. You could also do all sorts of other stuff with it, but standard examples (like "catch agents when they lie to us") seem very much like common-sense improvements.

+1 on this comment, I feel pretty confused about the excerpt from Paul that Steve quoted above. And even without the agent deliberately deciding where to avoid exploring, incomplete exploration may lead to agents which learn non-reward goals before convergence - so if Paul's statement is intended to refer to optimal policies, I'd be curious why he thinks that's the most important case to focus on.

I have relatively little idea how to "improve" a reward function so that it improves the inner cognition chiseled into the policy, because I don't know the mapping from outer reward schedules to inner cognition within the agent.

You don't need to know the full mapping in order to suspect that, when we reward agents for doing undesirable things, we tend to get more undesirable cognition. For example, if we reward agents for lying to us, then we'll tend to get less honest agents. We can construct examples where this isn't true but it seems like a pretty reaso... (read more)

3Alex Turner4mo
This specific point is why I said "relatively" little idea, and not zero idea. You have defended the common-sense version of "improving" a reward function (which I agree with, don't reward obvious bad things), but I perceive you to have originally claimed a much more aggressive and speculative claim, which is something like "'amplified' reward signals are improvements over non-'amplified' reward signals" (which might well be true, but how would we know?).

The way I attempt to avoid confusion is to distinguish between the RL algorithm's optimization target and the RL policy's optimization target, and then avoid talking about the "RL agent's" optimization target, since that's ambiguous between the two meanings. I dislike the title of this post because it implies that there's only one optimization target, which exacerbates this ambiguity. I predict that if you switch to using this terminology, and then start asking a bunch of RL researchers questions, they'll tend to give broadly sensible answers (conditional ... (read more)

4Alex Turner4mo
Actually, while I did recheck the Reward is Enough paper, I think I did misunderstand part of it in a way which wasn't obvious to me while I reread, which makes the paper much less egregious. I am updating that you are correct and I am not spending enough effort on favorably interpreting existing discourse. I still disagree with parts of that essay and still think Sutton & co don't understand the key points. I still think you underestimate how much people don't get these points. I am provisionally retracting the comment you replied to while I compose a more thorough response (may be a little while). Agreed on both counts for your first sentence. The "and" in "reward does not magically spawn thoughtsaboutreward, and reinforce those reward-focused thoughts" is doing important work; "magically" is meant to apply to the conjunction of the clauses. I added the second clause in order to pre-empt this objection. Maybe I should have added "reinforce those reward-focused thoughts into terminal values." Would that have been clearer? (I also have gone ahead and replaced "magically" with "automatically.")

Even if that part was easy, it still seems like a very small lever. A system capable of taking over the world will be able to generate those ideas for itself, and a system with strong motivations to take over the world won't have them changed by small amounts of training text.

1Ofer4mo
Maybe the question here is whether including certain texts in relevant training datasets can cause [language models that pose an x-risk] to be created X months sooner than otherwise. The relevant texts I'm thinking about here are: 1. Descriptions of certain tricks to evade our safety measures. 2. Texts that might cause the ML model to (better) model AIS researchers or potential AIS interventions, or other potential AI systems that the model might cooperate with (or that might "hijack" the model's logic).

Worrying about which alignment writing ends up in the training data feels like a very small lever for affecting alignment; my general heuristic is that we should try to focus on much bigger levers.

1Ofer4mo
Is that because you think it would be hard to get the relevant researchers to exclude any given class of texts from their training datasets [EDIT: or prevent web crawlers from downloading the texts etc.]? Or even if that part was easy, you would still feel that that lever is very small?

Great post. Two comments:

That can be folded into the utility function, however. Just make the ratings of the deferential person mostly copy the ratings of their partner.

Can you say more specifically how this is done?

the axiom of Independence of Irrelevant Alternatives....is not really a desiderata at all, it's actually an extremely baffling property.

The reason it's a desideratum is because it makes bargaining more robust to variation in how the game is defined. I agree it's counterintuitive within the context of a given game though. So maybe the best appro... (read more)

  1. Stop worrying about finding “outer objectives” which are safe to maximize.[9] I think that you’re not going to get an outer-objective-maximizer (i.e. an agent which maximizes the explicitly specified reward function). 
    1. Instead, focus on building good cognition within the agent. 
    2. In my ontology, there's only an inner alignment problem: How do we grow good cognition inside of the trained agent?

This feels very strongly reminiscent of an update I made a while back, and which I tried to convey in this section of AGI safety from first principle... (read more)

2Alex Turner4mo
I think that few people understand these points already. If RL professionals did understand this point, there would be pushback on Reward is Enough from RL professionals pointing out that reward is not the optimization target. After 15 minutes of searching, I found no [https://robotic.substack.com/p/reward-is-not-enough ] one [https://www.reddit.com/r/MachineLearning/comments/pwju6t/what_are_your_thoughts_on_the_reward_is_enough/] making [https://arxiv.org/abs/2112.15422] the [https://venturebeat.com/2021/07/10/building-artificial-intelligence-reward-is-not-enough/] counterpoint [https://www.lesswrong.com/posts/frApEhpyKQAcFvbXJ/reward-is-not-enough]. I mean, that thesis is just so wrong, and it's by famous researchers, and no one points out the obvious error. RL researchers don't get it.[1] [#fndtxqyvw6s2q]It's not complicated to me. (Do you know of any instance at all of someone else (outside of alignment) making the points in this post?) Currently not convinced by / properly understanding Paul's counterpoints. 1. ^ [#fnrefdtxqyvw6s2q]Although I flag that we might be considering different kinds of "getting it", where by my lights, "getting it" means "not consistently emitting statements which contravene the points of this post", while you might consider "if pressed on the issue, will admit reward is not the optimization target" to be "getting it."

When you say things like "Any reasoning derived from the reward-optimization premise is now suspect until otherwise supported", this assumes that the people doing this reasoning were using the premise in the mistaken way

I have considered the hypothesis that most alignment researchers do understand this post already, while also somehow reliably emitting statements which, to me, indicate that they do not understand it. I deem this hypothesis unlikely. I have also considered that I may be misunderstanding them, and think in some small fraction of instances I ... (read more)

Yeah, my comment was sloppily phrased; I agree with "I think that the disagreement is more about what kind of concreteness is possible or desirable in this domain."

I feel confused about the difference between your "attempt to formalize" step and Paul's "attempt to concretize" step. It feels like you can view either as a step towards the other - if you successfully formalize, then presumably you'll be able to concretize; but also one valuable step towards formalizing is by finding concrete examples and then generalizing from them. I think everyone agrees that it'd be great to end up with a formalism for the problem, and then disagrees on how much that process should involve "finding concrete examples of the problem". ... (read more)

This comment made me notice a kind of duality:
- Paul wants to focus on finding concrete problems, and claims that Nate/Eliezer aren't being very concrete with their proposed problems.
- Nate/Eliezer want to focus on finding concrete solutions, and claim that Paul/other alignment researchers aren't being very concrete with their proposed solutions.

It seems like "how well do we understand the problem" is one a crux here. I disagree with John's comment because it feels like he's assuming too much about our understanding of the problem. If you follow his strategy, then you can spend arbitrarily long trying to find a faithful concrete operationalization of a part of the problem that doesn't exist.

I don't feel like this is right (though I think this duality feels like a real thing that is important sometimes and is interesting to think about, so appreciated the comment).

ARC is spending its time right now (i) trying to write down concrete algorithms that solve ELK using heuristic arguments, and then trying to produce concrete examples in which they do the wrong thing, (ii) trying to write down concrete formalizations of heuristic arguments that have the desiderata needed for those algorithms to work, and trying to identify cases in which our algorith... (read more)

6johnswentworth5mo
I don't think that's how this works? The strategy I'm recommending explicitly contains two parts where we gain evidence about whether a part of the problem actually exists: * noticing an intuitive pattern in the failure-modes of some strategies * attempting to formalize (which presumably includes backpropagating our mathematics into our intuitions) ... so if a part of the problem doesn't exist, then (a) we probably don't notice a pattern in the first place, but even if our notoriously unreliable human pattern-matchers over-match, then (b) while we're attempting to formalize we we have plenty of opportunity to notice that maybe the pattern doesn't actually exist the way we thought it did. It feels like you're looking for a duality which does not exist. I mean, the duality between "look for concrete solutions" and "look for concrete problems" I buy (and that would indeed cause one side to be over-optimistic and the other over-pessimistic in exactly the pattern we actually see between Paul and Nate/Eliezer). But it feels like you're also looking for a duality between how-Paul's-recommended-search-order-just-fails and how-mine-just-fails. And the reason that duality does not exist is because my recommended search order is using strictly more evidence; Paul is basically advocating ignoring a whole class of very useful evidence, and that makes his strategy straightforwardly suboptimal. If we were both picking different points on a pareto frontier, then yeah, there'd be a trade-off. But Paul just isn't on the pareto frontier.

Thanks for the post, I agree with a lot of it. A few quick comments on your dialogue with imaginary me/Rohin, which highlight the main points of disagreement:

And even if not that-exact-thing, then there are all sorts of ways that some other thing could come out of left field and just render the problem easy. So I don't see why you're worried.

More accurate to say "I don't see why you're so confident". I think I see why you're worried, and I'm worried too for the same reasons. Indeed, I wrote a similar post recently which lists out research directions and re... (read more)

The post is phrased pretty strongly (e.g. it makes claims about things being "inaccessible" and "intractable").

Especially given the complexity of the topic, I expect the strength of these claims to be misleading. What one person thinks of as "roundabout methods" another might consider "directly specifying". I find it pretty hard to tell whether I actually disagree with your and Alex's views, or just the way you're presenting them.

5Alex Turner5mo
I think the strongest claim is in the title, which does concisely describe my current worldview and also Quintin's point that "the genome faces similar inaccessibility issues as us wrt to learned world models." I went back and forth several times on whether to title the post "Human values & biases seem inaccessible to the genome", but I'm presently sticking to the current title, because I think it's true&descriptive&useful in both of the above senses, even though it has the cost of (being interpreted as) stating as fact an inference which I presently strongly believe. Beyond that, I think I did a pretty good job of demarcating inference vs observation, of demarcating fact vs model? I'm open to hearing suggested clarifications. I meant for the following passage to resolve that ambiguity: But I suppose it still leaves some room to wonder. I welcome suggestions for further clarifying the post (although it's certainly not your responsibility to do so!). I'm also happy to hop on a call / meet up with you sometime, Richard.

I'm feeling very excited about this agenda. Is there currently a publicly-viewable version of the living textbook? Or any more formal writeup which I can include in my curriculum? (If not I'll include this post, but I expect many people would appreciate a more polished writeup.)

1Diffractor5mo
If you're looking for curriculum materials, I believe that the most useful reference would probably be my "Infra-exercises", a sequence of posts containing all the math exercises you need to reinvent a good chunk of the theory yourself. Basically, it's the textbook's exercise section, and working through interesting math problems and proofs on one's own has a much better learning feedback loop and retention of material than slogging through the old posts. The exercises are short on motivation and philosophy compared to the posts, though, much like how a functional analysis textbook takes for granted that you want to learn functional analysis and doesn't bother motivating it. The primary problem is that the exercises aren't particularly calibrated in terms of difficulty, and in order for me to get useful feedback, someone has to actually work through all of them, so feedback has been a bit sparse. So I'm stuck in a situation where I keep having to link everyone to the infra-exercises over and over and it'd be really good to just get them out and publicly available, but if they're as important as I think, then the best move is something like "release them one at a time and have a bunch of people work through them as a group" like the fixpoint exercises, instead of "just dump them all as public documents". I'll ask around about speeding up the public - ation of the exercises and see what can be done there. I'd strongly endorse linking this introduction even if the exercises are linked as well, because this introduction serves as the table of contents to all the other applicable posts.

I do expect this to happen. The question is merely: what's the best predictor of how hard it is to find inference algorithms more efficient effective than gradient descent? Is it whether those inference algorithms are more complex than gradient descent? Or is it whether those inference algorithms run for longer than gradient descent? Since gradient descent is very simple but takes a long time to run, my bet is the latter: there are many simple ways to convert compute to optimisation, but few compute-cheap ways to convert additional complexity to optimization.

1Charlie Steiner5mo
Faster than gradient descent is not a selective pressure, at least if we're considering typical ML training procedures. What is a selective pressure is regularization, which functions much more like a complexity prior than a speed prior. So (again sticking to modern day ML as an example, if you have something else in mind that would be interesting) of course there will be a cutoff in terms of speed, excluding all algorithms that don't fit into the neural net. But among algorithms that fit into the NN, the penalty on their speed will be entirely explainable as a consequence of regularization that e.g. favors circuits that depend on fewer parameters, and would therefore be faster after some optimization steps. If examples of successful parameters were sparse and tended to just barely fit into the NN, then this speed cutoff will be very important. But in the present day we see that good parameters tend to be pretty thick on the ground, and you can fairly smoothly move around in parameter space to make different tradeoffs.

No, I wasn't advocating adding a speed penalty, I was just pointing at a reason to think that a speed prior would give a more accurate answer to the question of "which is favored" than the bounded simplicity prior you're assuming:

Suppose that your imitator works by something akin to Bayesian inference with some sort of bounded simplicity prior (I think it's true of transformers)

But now I realise that I don't understand why you think this is true of transformers. Could you explain? It seems to me that there are many very simple hypotheses which take a long time to calculate, and which transformers therefore can't be representing.

2Vanessa Kosoy5mo
The word "bounded" in "bounded simplicity prior" referred to bounded computational resources. A "bounded simplicity prior" is a prior which involves either a "hard" (i.e. some hypotheses are excluded) or a "soft" (i.e. some hypotheses are down-weighted) bound on computational resources (or both), and also inductive bias towards simplicity (specifically it should probably behave as ~ 2^{-description complexity}). For a concrete example, see the prior I described here [https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/shortform?commentId=ovBmi2QFikE6CRWtj] (w/o any claim to originality).

In that case, gradient descent will reduce the weights that are used to calculate that specific activation value.

Hence, the honest-imitation hypothesis is heavily penalized compared to hypotheses that are in themselves agents which are more "epistemically sophisticated" than the outer loop of the AI.

In a deep learning context, the latter hypothesis seems much more heavily favored when using a simplicity prior (since gradient descent is simple to specify) than a speed prior (since gradient descent takes a lot of computation). So as long as the compute costs of inference remain smaller than the compute costs of training, a speed prior seems more appropriate for evaluating how easily hypotheses can become more epistemically sophisticated than the outer loop.

4Vanessa Kosoy5mo
Not quite sure what you're saying here. Is the claim that speed penalties would help shift the balance against mesa-optimizers? This kind of solutions are worth looking into, but I'm not too optimistic about them atm. First, the mesa-optimizer probably won't add a lot of overhead compared to the considerable complexity of emulating a brain. In particular, it need not work by anything like our own ML algorithms. So, if it's possible to rule out mesa-optimizers like this, it would require a rather extreme penalty. Second, there are limits on how much you can shape the prior while still having feasible learning. And I suspect that such an extreme speed penalty would not cut it. Third, depending on the setup, an extreme speed penalty might harm generalization[1] [#fn-Cg88iLv35SdAwM3LM-1]. But we definitely need to understand it more rigorously. -------------------------------------------------------------------------------- 1. The most appealing version is Christiano's "minimal circuits", but that only works for inputs of fixed size. It's not so clear what's the variable-input-size ("transformer") version of that. ↩︎ [#fnref-Cg88iLv35SdAwM3LM-1]
1Charlie Steiner5mo
This seems like a good thing to keep in mind, but also sounds too pessimistic about the ability of gradient descent to find inference algorithms that update more efficiently than gradient descent.

Quick brainstorm:

  1. Context-sensitivity: the goals that a corrigible AGI pursues should depend sensitively on the intentions of its human users when it’s run.
  2. Default off: a corrigible AGI run in a context where the relevant intentions or instructions aren’t present shouldn’t do anything.
  3. Explicitness: a corrigible AGI should explain its intentions at a range of different levels of abstraction before acting. If its plan stops being a central example of the explained intentions (e.g. due to unexpected events), it should default to a pre-specified fallback.
  4. Goal r
... (read more)

I think it's less about how many holes there are in a given plan, and more like "how much detail does it need before it counts as a plan?" If someone says that their plan is "Keep doing alignment research until the problem is solved", then whether or not there's a hole in that plan is downstream of all the other disagreements about how easy the alignment problem is. But it seems like, separate from the other disagreements, Eliezer tends to think that having detailed plans is very useful for making progress.

Analogy for why I don't buy this: I don't think that the Wright brothers' plan to solve the flying problem would count as a "plan" by Eliezer's standards. But it did work.

Strong +1s to many of the points here. Some things I'd highlight:

  1. Eliezer is not doing the type of reasoning that can justifiably defend the level of confidence he claims to have. If he were, he'd have much more to say about the specific details of consequentialism, human evolution, and the other key intuitions shaping his thinking. In my debate with him he mentioned many times how difficult he's found it to explain these ideas to people. I think if he understood these ideas well enough to justify the confidence of his claims, then he wouldn't have found th
... (read more)

Meta-level: +1 for actually writing a thing.

Also meta-level: -1 because when I read this I get the sense that you started from a high-level intuition and then constructed a set of elaborate explanations of your intuition, but then phrased it as an argument.

I personally find this frustrating because I keep seeing people being super confident in their high-level intuitive metaphorical view of consequentialism and then never doing the work of actually digging beneath those metaphors. (Less a criticism of this post, more a criticism of everyone upvoting this p... (read more)

Thanks for the post, I think it's a useful framing. Two things I'd be interested in understanding better:

In the one real example of intelligence being developed we have to look at, continuous application of natural selection in fact found Homo sapiens sapiens, and the capability-gain curves of the ecosystem for various measurables were in fact sharply kinked by this new species (e.g., using machines, we sharply outperform other animals on well-established metrics such as “airspeed”, “altitude”, and “cargo carrying capacity”).

As I said in a reply to Eliezer... (read more)

Sorry, I should have been clearer. Let's suppose that a copy of you spent however long it takes to write an honest textbook with the solution to alignment (let's call it N Yudkowsky-years), and an evil copy of you spent N Yudkowsky-years writing a deceptive textbook trying to make you believe in a false solution to alignment, and you're given one but not told which. How long would it take you to reach 90% confidence about which you'd been given? (You're free to get a team together to run a bunch of experiments and implementations, I'm just asking that you ... (read more)

Depends what the evil clones are trying to do.

Get me to adopt a solution wrong in a particular direction, like a design that hands the universe over to them?  I can maybe figure out the first time through who's out to get me, if it's 200 Yudkowsky-years.  If it's 200,000 Yudkowsky-years I think I'm just screwed.

Get me to make any lethal mistake at all?  I don't think I can get to 90% confidence period, or at least, not without spending an amount of Yudkowsky-time equivalent to the untrustworthy source.

Hmm, okay,  here's a variant. Assume it would take N Yudkowsky-years to write the textbook from the future described above. How many Yudkowsky-years does it take to evaluate a textbook that took N Yudkowsky-years to write, to a reasonable level of confidence (say, 90%)?

3Eliezer Yudkowsky6mo
If I know that it was written by aligned people? I wouldn't just be trying to evaluate it myself; I'd try to get a team together to implement it, and understanding it well enough to implement it would be the same process as verifying whatever remaining verifiable uncertainty was left about the origins, where most of that uncertainty is unverifiable because the putative hostile origin is plausibly also smart enough to sneak things past you.

Thanks for writing this, I agree that people have underinvested in writing documents like this. I agree with many of your points, and disagree with others. For the purposes of this comment, I'll focus on a few key disagreeements.

My model of this variety of reader has an inside view, which they will label an outside view, that assigns great relevance to some other data points that are not observed cases of an outer optimization loop producing an inner general intelligence, and assigns little importance to our one data point actually featuring the pheno

... (read more)

Maybe one way to pin down a disagreement here: imagine the minimum-intelligence AGI that could write this textbook (including describing the experiments required to verify all the claims it made) in a year if it tried. How many Yudkowsky-years does it take to safely evaluate whether following a textbook which that AGI spent a year writing will kill you?

Infinite?  That can't be done?

Small request: given that it's plausible that a bunch of LW material on this topic will end up quoted out of context, would you mind changing the headline example in section 5 to something less bad-if-quoted-out-of-context?

Load More