All of Richard_Ngo's Comments + Replies

Introduction To The Infra-Bayesianism Sequence

I'm feeling very excited about this agenda. Is there currently a publicly-viewable version of the living textbook? Or any more formal writeup which I can include in my curriculum? (If not I'll include this post, but I expect many people would appreciate a more polished writeup.)

Where I agree and disagree with Eliezer

I do expect this to happen. The question is merely: what's the best predictor of how hard it is to find inference algorithms more efficient effective than gradient descent? Is it whether those inference algorithms are more complex than gradient descent? Or is it whether those inference algorithms run for longer than gradient descent? Since gradient descent is very simple but takes a long time to run, my bet is the latter: there are many simple ways to convert compute to optimisation, but few compute-cheap ways to convert additional complexity to optimization.

1Charlie Steiner7d
Faster than gradient descent is not a selective pressure, at least if we're considering typical ML training procedures. What is a selective pressure is regularization, which functions much more like a complexity prior than a speed prior. So (again sticking to modern day ML as an example, if you have something else in mind that would be interesting) of course there will be a cutoff in terms of speed, excluding all algorithms that don't fit into the neural net. But among algorithms that fit into the NN, the penalty on their speed will be entirely explainable as a consequence of regularization that e.g. favors circuits that depend on fewer parameters, and would therefore be faster after some optimization steps. If examples of successful parameters were sparse and tended to just barely fit into the NN, then this speed cutoff will be very important. But in the present day we see that good parameters tend to be pretty thick on the ground, and you can fairly smoothly move around in parameter space to make different tradeoffs.
Where I agree and disagree with Eliezer

No, I wasn't advocating adding a speed penalty, I was just pointing at a reason to think that a speed prior would give a more accurate answer to the question of "which is favored" than the bounded simplicity prior you're assuming:

Suppose that your imitator works by something akin to Bayesian inference with some sort of bounded simplicity prior (I think it's true of transformers)

But now I realise that I don't understand why you think this is true of transformers. Could you explain? It seems to me that there are many very simple hypotheses which take a long time to calculate, and which transformers therefore can't be representing.

2Vanessa Kosoy7d
The word "bounded" in "bounded simplicity prior" referred to bounded computational resources. A "bounded simplicity prior" is a prior which involves either a "hard" (i.e. some hypotheses are excluded) or a "soft" (i.e. some hypotheses are down-weighted) bound on computational resources (or both), and also inductive bias towards simplicity (specifically it should probably behave as ~ 2^{-description complexity}). For a concrete example, see the prior I described here [https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/shortform?commentId=ovBmi2QFikE6CRWtj] (w/o any claim to originality).
Thoughts on gradient hacking

In that case, gradient descent will reduce the weights that are used to calculate that specific activation value.

Where I agree and disagree with Eliezer

Hence, the honest-imitation hypothesis is heavily penalized compared to hypotheses that are in themselves agents which are more "epistemically sophisticated" than the outer loop of the AI.

In a deep learning context, the latter hypothesis seems much more heavily favored when using a simplicity prior (since gradient descent is simple to specify) than a speed prior (since gradient descent takes a lot of computation). So as long as the compute costs of inference remain smaller than the compute costs of training, a speed prior seems more appropriate for evaluating how easily hypotheses can become more epistemically sophisticated than the outer loop.

4Vanessa Kosoy7d
Not quite sure what you're saying here. Is the claim that speed penalties would help shift the balance against mesa-optimizers? This kind of solutions are worth looking into, but I'm not too optimistic about them atm. First, the mesa-optimizer probably won't add a lot of overhead compared to the considerable complexity of emulating a brain. In particular, it need not work by anything like our own ML algorithms. So, if it's possible to rule out mesa-optimizers like this, it would require a rather extreme penalty. Second, there are limits on how much you can shape the prior while still having feasible learning. And I suspect that such an extreme speed penalty would not cut it. Third, depending on the setup, an extreme speed penalty might harm generalization[1] [#fn-Cg88iLv35SdAwM3LM-1]. But we definitely need to understand it more rigorously. -------------------------------------------------------------------------------- 1. The most appealing version is Christiano's "minimal circuits", but that only works for inputs of fixed size. It's not so clear what's the variable-input-size ("transformer") version of that. ↩︎ [#fnref-Cg88iLv35SdAwM3LM-1]
1Charlie Steiner8d
This seems like a good thing to keep in mind, but also sounds too pessimistic about the ability of gradient descent to find inference algorithms that update more efficiently than gradient descent.
Let's See You Write That Corrigibility Tag

Quick brainstorm:

  1. Context-sensitivity: the goals that a corrigible AGI pursues should depend sensitively on the intentions of its human users when it’s run.
  2. Default off: a corrigible AGI run in a context where the relevant intentions or instructions aren’t present shouldn’t do anything.
  3. Explicitness: a corrigible AGI should explain its intentions at a range of different levels of abstraction before acting. If its plan stops being a central example of the explained intentions (e.g. due to unexpected events), it should default to a pre-specified fallback.
  4. Goal r
... (read more)
Where I agree and disagree with Eliezer

I think it's less about how many holes there are in a given plan, and more like "how much detail does it need before it counts as a plan?" If someone says that their plan is "Keep doing alignment research until the problem is solved", then whether or not there's a hole in that plan is downstream of all the other disagreements about how easy the alignment problem is. But it seems like, separate from the other disagreements, Eliezer tends to think that having detailed plans is very useful for making progress.

Analogy for why I don't buy this: I don't think that the Wright brothers' plan to solve the flying problem would count as a "plan" by Eliezer's standards. But it did work.

Where I agree and disagree with Eliezer

Strong +1s to many of the points here. Some things I'd highlight:

  1. Eliezer is not doing the type of reasoning that can justifiably defend the level of confidence he claims to have. If he were, he'd have much more to say about the specific details of consequentialism, human evolution, and the other key intuitions shaping his thinking. In my debate with him he mentioned many times how difficult he's found it to explain these ideas to people. I think if he understood these ideas well enough to justify the confidence of his claims, then he wouldn't have found th
... (read more)
Optimality is the tiger, and agents are its teeth

Meta-level: +1 for actually writing a thing.

Also meta-level: -1 because when I read this I get the sense that you started from a high-level intuition and then constructed a set of elaborate explanations of your intuition, but then phrased it as an argument.

I personally find this frustrating because I keep seeing people being super confident in their high-level intuitive metaphorical view of consequentialism and then never doing the work of actually digging beneath those metaphors. (Less a criticism of this post, more a criticism of everyone upvoting this p... (read more)

A central AI alignment problem: capabilities generalization, and the sharp left turn

Thanks for the post, I think it's a useful framing. Two things I'd be interested in understanding better:

In the one real example of intelligence being developed we have to look at, continuous application of natural selection in fact found Homo sapiens sapiens, and the capability-gain curves of the ecosystem for various measurables were in fact sharply kinked by this new species (e.g., using machines, we sharply outperform other animals on well-established metrics such as “airspeed”, “altitude”, and “cargo carrying capacity”).

As I said in a reply to Eliezer... (read more)

AGI Ruin: A List of Lethalities

Sorry, I should have been clearer. Let's suppose that a copy of you spent however long it takes to write an honest textbook with the solution to alignment (let's call it N Yudkowsky-years), and an evil copy of you spent N Yudkowsky-years writing a deceptive textbook trying to make you believe in a false solution to alignment, and you're given one but not told which. How long would it take you to reach 90% confidence about which you'd been given? (You're free to get a team together to run a bunch of experiments and implementations, I'm just asking that you ... (read more)

Depends what the evil clones are trying to do.

Get me to adopt a solution wrong in a particular direction, like a design that hands the universe over to them?  I can maybe figure out the first time through who's out to get me, if it's 200 Yudkowsky-years.  If it's 200,000 Yudkowsky-years I think I'm just screwed.

Get me to make any lethal mistake at all?  I don't think I can get to 90% confidence period, or at least, not without spending an amount of Yudkowsky-time equivalent to the untrustworthy source.

AGI Ruin: A List of Lethalities

Hmm, okay,  here's a variant. Assume it would take N Yudkowsky-years to write the textbook from the future described above. How many Yudkowsky-years does it take to evaluate a textbook that took N Yudkowsky-years to write, to a reasonable level of confidence (say, 90%)?

3Eliezer Yudkowsky18d
If I know that it was written by aligned people? I wouldn't just be trying to evaluate it myself; I'd try to get a team together to implement it, and understanding it well enough to implement it would be the same process as verifying whatever remaining verifiable uncertainty was left about the origins, where most of that uncertainty is unverifiable because the putative hostile origin is plausibly also smart enough to sneak things past you.
AGI Ruin: A List of Lethalities

Thanks for writing this, I agree that people have underinvested in writing documents like this. I agree with many of your points, and disagree with others. For the purposes of this comment, I'll focus on a few key disagreeements.

My model of this variety of reader has an inside view, which they will label an outside view, that assigns great relevance to some other data points that are not observed cases of an outer optimization loop producing an inner general intelligence, and assigns little importance to our one data point actually featuring the pheno

... (read more)

Maybe one way to pin down a disagreement here: imagine the minimum-intelligence AGI that could write this textbook (including describing the experiments required to verify all the claims it made) in a year if it tried. How many Yudkowsky-years does it take to safely evaluate whether following a textbook which that AGI spent a year writing will kill you?

Infinite?  That can't be done?

Reshaping the AI Industry

Small request: given that it's plausible that a bunch of LW material on this topic will end up quoted out of context, would you mind changing the headline example in section 5 to something less bad-if-quoted-out-of-context?

Richard Ngo's Shortform

I expect it to be difficult to generate adversarial inputs which will fool a deceptively aligned AI. One proposed strategy for doing so is relaxed adversarial training, where the adversary can modify internal weights. But this seems like it will require a lot of progress on interpretability. An alternative strategy, which I haven't yet seen any discussion of, is to allow the adversary to do a data poisoning attack before generating adversarial inputs - i.e. the adversary gets to specify inputs and losses for a given number of SGD steps, and then the adversarial input which the base model will be evaluated on afterwards. (Edit: probably a better name for this is adversarial meta-learning.)

Richard Ngo's Shortform

A general principle: if we constrain two neural networks to communicate via natural language, we need some pressure towards ensuring they actually use language in the same sense as humans do, rather than (e.g.) steganographically encoding the information they really care about.

The most robust way to do this: pass the language via a human, who tries to actually understand the language, then does their best to rephrase it according to their own understanding.

What do you lose by doing this? Mainly: you can no longer send messages too complex for humans to und... (read more)

2Robert Kirk1mo
Another possible way to provide pressure towards using language in a human-sense way is some form of multi-tasking/multi-agent scenario, inspired by this paper: Multitasking Inhibits Semantic Drift [https://arxiv.org/abs/2104.07219]. They show that if you pretrain multiple instructors and instruction executors to understand language in a human-like way (e.g. with supervised labels), and then during training mix the instructors and instruction executors, it makes it difficult to drift from the original semantics, as all the instructors and instruction executors would need to drift in the same direction; equivalently, any local change in semantics would be sub-optimal compared to using language in the semantically correct way. The examples in the paper are on quite toy problems, but I think in principle this could work.
6johnswentworth1mo
That doesn't actually solve the problem. The system could just encode the desired information in the semantics of some unrelated sentences - e.g. talk about pasta to indicate X = 0, or talk about rain to indicate X = 1.
Richard Ngo's Shortform

Imagine taking someone's utility function, and inverting it by flipping the sign on all evaluations. What might this actually look like? Well, if previously I wanted a universe filled with happiness, now I'd want a universe filled with suffering; if previously I wanted humanity to flourish, now I want it to decline.

But this is assuming a Cartesian utility function. Once we treat ourselves as embedded agents, things get trickier. For example, suppose that I used to want people with similar values to me to thrive, and people with different values from me to ... (read more)

Richard Ngo's Shortform

A possible way to convert money to progress on alignment: offering a large (recurring) prize for the most interesting failures found in the behavior of any (sufficiently-advanced) model. Right now I think it's very hard to find failures which will actually cause big real-world harms, but you might find failures in a way which uncovers useful methodologies for the future, or at least train a bunch of people to get much better at red-teaming.

(For existing models, it might be more productive to ask for "surprising behavior" rather than "failures" per se, sinc... (read more)

3Oliver Habryka1mo
I like this. Would this have to be publicly available models? Seems kind of hard to do for private models.
[Intro to brain-like-AGI safety] 9. Takeaways from neuro 2/2: On AGI motivation

I'm loving this whole sequence, but I particularly love: 

9.2.2 Preferences are over “thoughts”, which can relate to outcomes, actions, plans, etc., but are different from all those things

That feels very crisp, clear, and informative.

Richard Ngo's Shortform

Probably the easiest "honeypot" is just making it relatively easy to tamper with the reward signal. Reward tampering is useful as a honeypot because it has no bad real-world consequences, but could be arbitrarily tempting for policies that have learned a goal that's anything like "get more reward" (especially if we precommit to letting them have high reward for a significant amount of time after tampering, rather than immediately reverting).

Intuitions about solving hard problems

I like this pushback, and I'm a fan of productive mistakes. I'll have a think about how to rephrase to make that clearer. Maybe there's just a communication problem, where it's hard to tell the difference between people claiming "I have an insight (or proto-insight) which will plausibly be big enough to solve the alignment problem", versus "I have very little traction on the alignment problem but this direction is the best thing I've got". If the only effect of my post is to make a bunch of people say "oh yeah, I meant the second thing all along", then I'd... (read more)

4Adam Shimi2mo
When phrased like that, I agree with you. I am personally relatively suspicious of claims by a bunch of people to have found a path to alignment, but actually excited by some of their productive mistakes (as discussed a bit in my post). I also fully agree that I want people to use the second, and my "history of alignment" research direction aims at concretely teasing the productive mistakes and revealed bits of evidence without falling for the "this is obviously a solution" or "this is obviously not a solution and thus useless". +1000. And teasing out more generally the assumptions, the insights, the new parts of works and approach is I think super necessary and on my research agenda. That's also part of the reason why I feel asking newcomers to be distillers [https://www.alignmentforum.org/posts/zo9zKcz47JxDErFzQ/call-for-distillers] is not necessarily a great idea: good distillation of the type we're discussing requires IMO quite a deep understanding of the landscape, the problem and the underlying ideas. Otherwise you at best get a decent summary, and we need more. Haven't reread your sequence in quite some time, but I think the value of such exploratory sequence is to make clearer the intuitions underlying the direction, even if they haven't lead yet to productive mistakes. So I like your disclaimer, but I think the even better way of doing this is to clarify for different posts and ideas what are the intuitions you're building on and where the current formalims/descriptions/analogies are failing to capture them. This might also be a bit of miscommunication, but I felt like your discussion of Turing could also have applied especially in Darwin's case, where the initial insight required a lot of additional pieces and clarification to make a clean and ordered theory that you can actually defend. Generally I was pointing at the risk of hindsight bias, where the fact that the insight is clean and powerful once the full theory is known and considered didn't mean
Buck's Shortform

One thing that makes me suspicious about this argument is that, even though I can gradient hack myself, I don't think I can make suggestions about what my parameters should be changed to.

How can I gradient hack myself? For example, by thinking of strawberries every time I'm about to get a reward. Now I've hacked myself to like strawberries. But I have no idea how that's implemented in my brain, I can't "pick the parameters for myself", even if you gave me a big tensor of gradients.

Two potential alternatives to the thing you said:

  • maybe competitive alignment
... (read more)
Late 2021 MIRI Conversations: AMA / Discussion

When I read other people, I often feel like they're operating in a 'narrower segment of their model', or not trying to fit the whole world at once, or something. They often seem to emit sentences that are 'not absurd', instead of 'on their mainline', because they're mostly trying to generate sentences that pass some shallow checks instead of 'coming from their complete mental universe.'

To me it seems like this is what you should expect other people to look like both when other people know less about a domain than you do, and also when you're overconfident ... (read more)

4Matthew "Vaniver" Graves4mo
It feels similar but clearly distinct? Like, in that situation Eliezer often seems to say things that I parse as "I don't have any special knowledge here", which seems like a different thing than "I can't easily sample from my distribution over how things go right", and I also have the sense of Paul being willing to 'go specific' and Eliezer not being willing to 'go specific'. You're thinking of this bit of the conversation [https://www.lesswrong.com/posts/cCrpbZ4qTCEYXbzje/ngo-and-yudkowsky-on-scientific-reasoning-and-pivotal-acts] , starting with: (Or maybe a bit earlier and later, but that was my best guess for where to start the context.) The main quotes from the middle that seems relevant: and ending with: Rereading that section, my sense is that it reads like a sort of mirror of the Eliezer->Paul "I don't know how to operate your view" section; like, Eliezer can say "I think nukes are less worrying for reasons ABC, also you can observe me being not worried about other things-people-are-concerned-by XYZ", but I wouldn't have expected you (or the reader who hasn't picked up Eliezer-thinking from elsewhere) to have been able to come away from that with why you trying to be Eliezer from 1930s would have thought 'and then it turned out okay' would have been a political-history-book-sentence, or the relative magnitudes of the surprise. [Like, I think my 1930s-Eliezer puts like 3-30% on "and then it turned out okay" for nukes, and my 2020s-Eliezer puts like 0.03-3% on that for AGI? But it'd be nice to hear if Eliezer thinks AGI turning out as well as nukes is like 10x the surprise of nukes turning out this well conditioned on pre-1930s, or more like 1000x the surprise.]
Late 2021 MIRI Conversations: AMA / Discussion
  1. Where is ML in this textbook? Is it under a section called "god-forsaken approaches" or does it play a key role? Follow-up: Where is logical induction?

Key role, but most current ML is in the "applied" section, where the "theory" section instead explains the principles by which neural nets (or future architectures) work on the inside. Logical induction is a sidebar at some point explaining the theoretical ideal we're working towards, like I assume AIXI is in some textbooks.

  1. Is there anything else you can share about this textbook? Do you know any of the other chapter names?

Planning, Abstraction, Reasoning, Self-awareness.

ARC's first technical report: Eliciting Latent Knowledge

I'm curious if you have a way to summarise what you think the "core insight" of ELK is, that allows it to improve on the way other alignment researchers think about solving the alignment problem.

Gradient Hacking via Schelling Goals

Interesting post :) I'm intuitively a little skeptical - let me try to figure out why.

I think I buy that some reasoning process could consistently decide to hack in a robust way. But there are probably parts of that reasoning process that are still somewhat susceptible to being changed by gradient descent. In particular, hacking relies on the agent knowing what its current mesa-objective is - but that requires some type of introspective access, which may be difficult and the type of thing which could be hindered by gradient descent (especially when you're ... (read more)

ARC's first technical report: Eliciting Latent Knowledge

Ah, that makes sense. In the section where you explain the steps of the game, I interpreted the comments in parentheses as further explanations of the step, rather than just a single example. (In hindsight the latter interpretation is obvious, but I was reading quickly - might be worth making this explicit for others who are doing the same.) So I thought that Bayes nets were built into the methodology. Apologies for the oversight!

I'm still a little wary of how much the report talks about concepts in a humans' Bayes net without really explaining why this is... (read more)

2Ajeya Cotra6mo
Ah got it. To be clear, Paul and Mark do in practice consider a bank of multiple counterexamples for each strategy with different ways the human and predictor could think, though they're all pretty simple in the same way the Bayes net example is (e.g. deduction from a set of axioms); my understanding is that essentially the same kind of counterexamples apply for essentially the same underlying reasons for those other simple examples. The doc sticks with one running example for clarity / length reasons.
ARC's first technical report: Eliciting Latent Knowledge

Speaking just for myself, I think about this as an extension of the worst-case assumption. Sure, humans don't reason using Bayes nets -- but if we lived in a world where the beings whose values we want to preserve did reason about the world using a Bayes net, that wouldn't be logically inconsistent or physically impossible, and we wouldn't want alignment to fail in that world.

If you solve something given worst-case assumptions, you've solved it for all cases. Whereas if you solve it for one specific case (e.g. Bayes nets) then it may still fail if that's n... (read more)

7Ajeya Cotra6mo
Sorry, there were two things you could have meant when you said the assumption that the human uses a Bayes net seemed crucial. I thought you were asking why the builder couldn't just say "That's unrealistic" when the breaker suggested the human runs a Bayes net. The answer to that is what I said above -- because the assumption is that we're working in the worst case, the builder can't invoke unrealism to dismiss the counterexample. If the question is instead "Why is the builder allowed to just focus on the Bayes net case?", the answer to that is the iterative nature of the game. The Bayes net case (and in practice a few other simple cases) was the case the breaker chose to give, so if the builder finds a strategy that works for that case they win the round. Then the breaker can come back and add complications which break the builder's strategy again, and the hope is that after many rounds we'll get to a place where it's really hard to think of a counterexample that breaks the builder's strategy despite trying hard.
5Paul Christiano6mo
The breaker is definitely allowed to introduce counterexamples where the human isn't well-modeled using a Bayes net. Our training strategies (introduced here [https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#heading=h.1xpao6tk9oiv] ) don't say anything at all about Bayes nets and so it's not clear if this immediately helps the breaker---they are the one who introduced the assumption that the human used a Bayes nets (in in order to describe a simplified situation where the naive training strategy failed here [https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#heading=h.sm4amv12m66a] ). We're definitely not intentionally viewing Bayes nets as part of the definition of the game. It seems very plausible that after solving the problem for humans-who-use-Bayes-nets we will find a new counterexample that only works for humans-who-don't-use-Bayes-nets, in which case we'll move on to those counterexamples. It seems even more likely that the builder will propose an algorithm that exploits cognition that humans can do which isn't well captured by the Bayes net model, which is also fair game. (And indeed several of our approaches to do it, e.g. when imagining humans learning new things about the world by performing experiments here [https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#heading=h.3l614s96sz9t] or reasoning about plausibility of model joint distributions here [https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#heading=h.w0iwyfch6ysy] ). That said, it looks to us like if any of these algorithms worked for Bayes nets, they would at least work for a very broad range of human models, the Bayes net assumption doesn't seem to be changing the picture much qualitatively. Echoing Mark in his comment, we're definitely interested in ways that this assumption seems importantly unrealistic. If you just think it's generally a mediocre mode
ARC's first technical report: Eliciting Latent Knowledge

We’ll assume the humans who constructed the dataset also model the world using their own internal Bayes net.

This seems like a crucial premise of the report; could you say more about it? You discuss why a model using a Bayes net might be "oversimplified and unrealistic", but as far as I can tell you don't talk about why this is a reasonable model of human reasoning.

9Ajeya Cotra6mo
Speaking just for myself, I think about this as an extension of the worst-case assumption. Sure, humans don't reason using Bayes nets -- but if we lived in a world where the beings whose values we want to preserve did reason about the world using a Bayes net, that wouldn't be logically inconsistent or physically impossible, and we wouldn't want alignment to fail in that world. Additionally, I think the statement made in the report about AIs also applies to humans: We're using some sort of cognitive algorithms to reason about the world, and it's plausible that strategies which resemble inference on graphical models play a role in some of our understanding. There's no obvious way that a messier model of human reasoning which incorporates all the other parts should make ELK easier ; there's nothing that we could obviously exploit to create a strategy.
5Mark Xu6mo
We don't think that real humans are likely to be using Bayes nets to model the world. We make this assumption for much the same reasons that we assume models use Bayes nets, namely that it's a test case where we have a good sense of what we want a solution to ELK to look like. We think the arguments given in the report will basically extend to more realistic models of how humans reason (or rather, we aren't aware of a concrete model of how humans reason for which the arguments don't apply). If you think there's a specific part of the report where the human Bayes net assumption seems crucial, I'd be happy to try to give a more general form of the argument in question.
Interlude: Agents as Automobiles

I guess I just don't feel like you've established that it would have been reasonable to have credence above 90% in either of those cases. Like, it sure seems obvious to me that computers and automobiles are super useful. But I have a huge amount of evidence now about both of those things that I can't really un-condition on. So, given that I know how powerful hindsight bias can be, it feels like I'd need to really dig into the details of possible alternatives before I got much above 90% based on facts that were known back then.

(Although this depends on how ... (read more)

4Daniel Kokotajlo6mo
Fair. If you don't share my intuition that people in 1950 should have had more than 90% credence that computers would be militarily useful, or that people at the dawn of steam engines should have predicted that automobiles would be useful (conditional on them being buildable) then that part of my argument has no force on you. Maybe instead of picking examples from the past, I should pick an example of a future technology that everyone agrees is 90%+ likely to be super useful if developed, even though Joe's skeptical arguments can still be made.
Interlude: Agents as Automobiles

Interesting post. Overall, though, it feels like you aren't taking hindsight bias seriously enough. E.g. as one example:

Some people thought battleships would beat carriers. Others thought that the entire war would be won from the air. Predicting the future is hard; we shouldn’t be confident. Therefore, we shouldn’t assign more than 90% credence to the claim that powerful, portable computers (assuming we figure out how to build them) will be militarily useful, e.g. in weapon guidance systems or submarine sensor suites.

In this particular case, an alternative... (read more)

3Daniel Kokotajlo6mo
Thanks! Ah, I shouldn't have put the word "portable" in there then. I meant to be talking about computers in general, not computers-on-missiles-as-opposed-to-ground-installations. I think I agree with this, and should edit to clarify that that's not the argument I'm making... what I'm saying is that sometimes, "Of course X requires Y" and "of course Y will be useful/incentivised" predictions can be made in advance, with more than 90% confidence. I think "computers will be militarily useful" and "self-propelled vehicles will be useful" are two examples of this. The intuition I'm trying to pump is not "Look at this case where X was useful, therefore APS-AI will be useful" but rather "look at this case where it would have been reasonable for someone to be more than 90% confident that X was useful despite the presence of Joe's arguments; therefore, we should be open to the possibility that we should be more than 90% confident that APS-AI will be useful despite Joe's arguments." Of course I still have work to do, to actually provide positive arguments that our credence should be higher than 90%... the point of my analogy was defensive, to defend against Joe's argument that because of such-and-such considerations we should have 20% credence on Not Useful/Incentivised. (Will think more about what you said and reply later)
Conversation on technology forecasting and gradualism

I didn't push this point at the time, but Paul's claim that "GPT-3 + 5 person-years of engineering effort [would] foom" seems really wild to me, and probably a good place to poke at his model more. Is this 5 years of engineering effort and then humans leaving it alone with infinite compute? Or are the person-years of engineering doled out over time?

Unlike Eliezer, I do think that language models not wildly dissimilar to our current ones will be able to come up with novel insights about ML, but there's a long way between "sometimes comes up with novel insig... (read more)

I didn't push this point at the time, but Paul's claim that "GPT-3 + 5 person-years of engineering effort [would] foom" seems really wild to me, and probably a good place to poke at his model more. Is this 5 years of engineering effort and then humans leaving it alone with infinite compute?

The 5 years are up front and then it's up to the AI to do the rest. I was imagining something like 1e25  flops running for billions of years.

I don't really believe the claim unless you provide computing infrastructure that is externally maintained or else extremely ... (read more)

2Rob Bensinger7mo
Maybe something like '5 years of engineering effort to start automating work that qualitatively (but incredibly slowly and inefficiently) is helping with AI research, and then a few decades of throwing more compute at that for the AI to reach superintelligence'? With infinite compute you could just recapitulate evolution, so I doubt Paul thinks there's a crux like that? But there could be a crux that's about whether GPT-3.5 plus a few decades of hardware progress achieves superintelligence, or about whether that's approximately the fastest way to get to superintelligence, or something.
Biology-Inspired AGI Timelines: The Trick That Never Works

The two extracts from this post that I found most interesting/helpful:

The problem is that the resource gets consumed differently, so base-rate arguments from resource consumption end up utterly unhelpful in real life.  The human brain consumes around 20 watts of power.  Can we thereby conclude that an AGI should consume around 20 watts of power, and that, when technology advances to the point of being able to supply around 20 watts of power to computers, we'll get AGI?

I'm saying that Moravec's "argument from comparable resource consumption" must

... (read more)
Ngo and Yudkowsky on AI capability gains

My recommended policy in cases where this applies is "trust your intuitions and operate on the assumption that you're not a crackpot." 
 

Oh, certainly Eliezer should trust his intuitions and believe that he's not a crackpot. But I'm not arguing about what the person with the theory should believe, I'm arguing about what outside observers should believe, if they don't have enough time to fully download and evaluate the relevant intuitions. Asking the person with the theory to give evidence that their intuitions track reality isn't modest epistemology.

Ngo and Yudkowsky on AI capability gains

the easiest way to point out why they are dumb is with counterexamples. We can quickly "see" the counterexamples. E.g., if you're trying to see AGI as the next step in capitalism, you'll be able to find counterexamples where things become altogether different (misaligned AI killing everything; singleton that brings an end to the need to compete).

I'm not sure how this would actually work. The proponent of the AGI-capitalism analogy might say "ah yes, AGI killing everyone is another data point on the trend of capitalism becoming increasingly destructive". Or... (read more)

4Lukas_Gloor7mo
My only reply is "You know it when you see it." And yeah, a crackpot would reason the same way, but non-modest epistemology says that if it's obvious to you that you're not a crackpot then you have to operate on the assumption that you're not a crackpot. (In the alternative scenario, you won't have much impact anyway.) Specifically, the situation I mean is the following: * You have an epistemic track record like Eliezer or someone making lots of highly upvoted posts in our communities. * You find yourself having strong intuitions about how to apply powerful principles like "consequentialism" to new domains, and your intuitions are strong because it feels to you like you have a gears-level understanding that others lack. You trust your intuitions in cases like these. My recommended policy in cases where this applies is "trust your intuitions and operate on the assumption that you're not a crackpot." Maybe there's a potential crux here about how much of scientific knowledge is dependent on successful predictions. In my view, the sequences have convincingly argued that locating the hypothesis in the first place is often done in the absence of already successful predictions, which goes to show that there's a core of "good reasoning" that lets you jump to (tentative) conclusions, or at least good guesses, much faster than if you were to try lots of things at random.
Ngo and Yudkowsky on AI capability gains

Your comment is phrased as if the object-level refutations have been tried, while conveying the meta-level intuitions hasn't been tried. If anything, it's the opposite: the sequences (and to some extent HPMOR) are practically all content about how to think, whereas Yudkowsky hasn't written anywhere near as extensively on object-level AI safety.

This has been valuable for community-building, but less so for making intellectual progress - because in almost all domains, the most important way to make progress is to grapple with many object-level problems, unti... (read more)

8Adam Shimi7mo
Thanks for giving more details about your perspective. It's not clear to me that the sequences and HPMOR are good pointers for this particular approach to theory building. I mean, I'm sure there are posts in the sequences that touch on that (Einstein's Arrogance [https://www.lesswrong.com/posts/MwQRucYo6BZZwjKE7/einstein-s-arrogance] is an example I already mentioned), but I expect that they only talk about it in passing and obliquely, and that such posts are spread all over the sequences. Plus the fact that Yudkowsky said that there was a new subsequence to write lead me to believe that he doesn't think the information is clearly stated already. So I don't think you can really put the current confusion as an evidence that the explanation of how that kind of theory would work doesn't help, given that this isn't readily available in a form I or anyone reading this can access AFAIK. Completely agree that these intuitions are important training data. But your whole point in other comments is that we want to understand why we should expect these intuitions to differ from apparently bad/useless analogies between AGI and other stuff. And some explanation of where these intuitions come from could help with evaluating these intuitions, even more because Yudkowsky has said that he could write a sequence about the process. This sounds to me like a strawman of my position (which might be my fault for not explaining it well). * First, I don't think explaining a methodology is a "very high-level epistemological principle", because it let us concretely pick apart and criticize the methodology as a truthfinding method. * Second, the object-level work has already been done by Yudkowsky! I'm not saying that some outside-of-the-field epistemologist should ponder really hard about what would make sense for alignment without ever working on it concretely and then give us their teaching. Instead I'm pushing for a researcher who has built a coherent collections o
Ngo and Yudkowsky on AI capability gains

I don't expect such a sequence to be particularly useful, compared with focusing on more object-level arguments. Eliezer says that the largest mistake he made in writing his original sequences was that he "didn’t realize that the big problem in learning this valuable way of thinking was figuring out how to practice it, not knowing the theory". Better, I expect, to correct the specific mistakes alignment researchers are currently making, until people have enough data points to generalise better.

5Adam Shimi7mo
I'm honestly confused by this answer. Do you actually think that Yudkowsky having to correct everyone's object-level mistakes all the time is strictly more productive and will lead faster to the meat of the deconfusion than trying to state the underlying form of the argument and theory, and then adapting it to the object-level arguments and comments? I have trouble understanding this, because for me the outcome of the first one is that no one gets it, he has to repeat himself all the time without making the debate progress, and this is one more giant hurdle for anyone trying to get into alignment and understand his position. It's unclear whether the alternative would solve all these problems (as you quote from the preface of the Sequences, learning the theory is often easier and less useful than practicing), but it still sounds like a powerful accelerator. There is no dichotomy of "theory or practice", we probably need both here. And based on my own experience reading the discussion posts and the discussions I've seen around these posts, the object-level refutations have not been particularly useful forms of practice, even if they're better than nothing.
Ngo and Yudkowsky on AI capability gains

it seems to me that you want properly to be asking "How do we know this empirical thing ends up looking like it's close to the abstraction?" and not "Can you show me that this abstraction is a very powerful one?"

I agree that "powerful" is probably not the best term here, so I'll stop using it going forward (note, though, that I didn't use it in my previous comment, which I endorse more than my claims in the original debate).

But before I ask "How do we know this empirical thing ends up looking like it's close to the abstraction?", I need to ask "Does the ab... (read more)

Ngo and Yudkowsky on AI capability gains

I'm still trying to understand the scope of expected utility theory, so examples like this are very helpful! I'd need to think much more about it before I had a strong opinion about how much they support Eliezer's applications of the theory, though.

Ngo and Yudkowsky on AI capability gains

Not a problem. I share many of your frustrations about modesty epistemology and about most alignment research missing the point, so I sympathise with your wanting to express them.

On consequentialism: I imagine that it's pretty frustrating to keep having people misunderstand such an important concept, so thanks for trying to convey it. I currently feel like I have a reasonable outline of what you mean (e.g. to the level where I could generate an analogy about as good as Nate's laser analogy), but I still don't know whether the reason you find it much more c... (read more)

Ngo and Yudkowsky on AI capability gains

My model of Eliezer says that there is some deep underlying concept of consequentialism, of which the "not very coherent consequentialism" is a distorted reflection; and that this deep underlying concept is very closely related to expected utility theory. (I believe he said at one point that he started using the word "consequentialism" instead of "expected utility maximisation" mainly because people kept misunderstanding what he meant by the latter.)

I don't know enough about conservative vector fields to comment, but on priors I'm pretty skeptical of this being a good example of coherent utilities; I also don't have a good guess about what Eliezer would say here.

Ngo and Yudkowsky on AI capability gains

Thanks! I think that this is a very useful example of an advance prediction of utility theory; and that gathering more examples like this is one of the most promising way to make progress on bridging the gap between Eliezer's and most other people's understandings of consequentialism.

Potentially important thing to flag here: at least in my mind, expected utility theory (i.e. the property Eliezer was calling "laser-like" or "coherence") and consequentialism are two distinct things. Consequentialism will tend to produce systems with (approximate) coherent expected utilities, and that is one major way I expect coherent utilities to show up in practice. But coherent utilities can in-principle occur even without consequentialism (e.g. conservative vector fields in physics), and consequentialism can in-principle not be very coherent (e.g. if... (read more)

Ngo and Yudkowsky on AI capability gains

My objection is mostly fleshed out in my other comment. I'd just flag here that "In other words, you have to do things the "hard way"--no shortcuts" assigns the burden of proof in a way which I think is not usually helpful. You shouldn't believe my argument that I have a deep theory linking AGI and evolution unless I can explain some really compelling aspects of that theory. Because otherwise you'll also believe in the deep theory linking AGI and capitalism, and the one linking AGI and symbolic logic, and the one linking intelligence and ethics, and the on... (read more)

It also isn't clear to me that Eliezer has established the strong inferences he draws from noticing this general pattern ("expected utility theory/consequentialism"). But when you asked Eliezer (in the original dialogue) to give examples of successful predictions, I was thinking "No, that's not how these things work." In the mistaken applications of Grand Theories you mention (AGI and capitalism, AGI and symbolic logic, intelligence and ethics, recursive self-improvement and cultural evolution, etc.), the easiest way to point out why they are dumb is with ... (read more)

Ngo and Yudkowsky on AI capability gains

I think we live in a world where there are very strong forces opposed to technological progress, which actively impede a lot of impactful work, including technologies which have the potential to be very economically and strategically important (e.g. nuclear power, vaccines, genetic engineering, geoengineering).

This observation doesn't lead me to a strong prediction that all such technologies will be banned; nor even that the most costly technologies will be banned - if the forces opposed to technological progress were even approximately rational, then bann... (read more)

Ngo and Yudkowsky on AI capability gains

Strong upvote, you're pointing at something very important here. I don't think I'm defending epistemic modesty, I think I'm defending epistemic rigour, of the sort that's valuable even if you're the only person in the world.

I suspect Richard isn't actually operating from a frame where he can produce the thing I asked for in the previous paragraphs (a strong model of where expected utility is likely to fail, a strong model of how a lack of "successful advance predictions"/"wide applications" corresponds to those likely failure modes, etc).

Yes, this is corre... (read more)

5Vanessa Kosoy7mo
FDT was made rigorous by [https://www.lesswrong.com/s/CmrW8fCmSLK7E25sa/p/e8qFDMzs2u9xf5ie6] infra-Bayesianism [https://www.lesswrong.com/s/CmrW8fCmSLK7E25sa/p/GS5P7LLLbSSExb3Sk], at least in the pseudocausal case.

I think some of your confusion may be that you're putting "probability theory" and "Newtonian gravity" into the same bucket.  You've been raised to believe that powerful theories ought to meet certain standards, like successful bold advance experimental predictions, such as Newtonian gravity made about the existence of Neptune (quite a while after the theory was first put forth, though).  "Probability theory" also sounds like a powerful theory, and the people around you believe it, so you think you ought to be able to produce a powerful advance p... (read more)

A positive case for how we might succeed at prosaic AI alignment

we already know how to build myopic optimizers

What are you referring to here?

A positive case for how we might succeed at prosaic AI alignment

That all makes sense. But I had a skim of (2), (3), (4), and (5) and it doesn't seem like they help explain why myopia is significantly more natural than "obey humans"?

3Evan Hubinger7mo
I mean, that's because this is just a sketch, but a simple argument for why myopia is more natural than “obey humans” is that if we don't care about competitiveness, we already know how to build myopic optimizers, whereas we don't know how to build an optimizer to “obey humans” at any level of capabilities. Furthermore, LCDT [https://www.alignmentforum.org/posts/Y76durQHrfqwgwM5o/lcdt-a-myopic-decision-theory] is a demonstration that we can at least reduce the complexity of specifying myopia to the complexity of specifying agency. I suspect we can get much better upper bounds on the complexity than that, though.
A positive case for how we might succeed at prosaic AI alignment

The key idea, in the case of HCH, would be to direct that optimization towards the goal of producing an action that is maximally close to what HCH would do.

Why do you expect this to be any easier than directing that optimisation towards the goal of "doing what the human wants"? In particular, if you train a system on the objective "imitate HCH", why wouldn't it just end up with the same long-term goals as HCH has? That seems like a much more natural thing for it to learn than the concept of imitating HCH, because in the process of imitating HCH it still ha... (read more)

Why do you expect this to be any easier than directing that optimisation towards the goal of "doing what the human wants"? In particular, if you train a system on the objective "imitate HCH", why wouldn't it just end up with the same long-term goals as HCH has?

To be clear, I was only talking about (1) here, which is just about what it might look like for an agent to be myopic, not how to actually get an agent that satisfies (1). I agree that you would most likely get a proxy-aligned model if you just trained on “imitate HCH”—but just training on “imitat... (read more)

Load More