Oliver Habryka

Running Lightcone Infrastructure, which runs LessWrong. You can reach me at habryka@lesswrong.com. I have signed no contracts or agreements whose existence I cannot mention.

Posts

Sorted by New

12Habryka's Shortform Feed

14Debate helps supervise human experts [Paper]

8mo

75AI Timelines

9mo

27Review AI Alignment posts to help figure out how to make a proper AI Alignment review

38Apply to the ML for Alignment Bootcamp (MLAB) in Berkeley [Jan 3 - Jan 22]

50Welcome & FAQ!

24AI Research Considerations for Human Existential Safety (ARCHES)

10AI Alignment Open Thread October 2019

14AI Alignment Open Thread August 2019

12Habryka's Shortform Feed

4Switching hosting providers today, there probably will be some hiccups

Wiki Contributions

Orthogonality Thesis

3mo

Orthogonality Thesis

3mo

(+3588/-1331)

Best of LessWrong

3mo

Comments

A simple case for extreme inner misalignment

Oliver Habryka13d74

Early stage votes are pretty noisy (and I think have been getting noisier over time, which is somewhat of a proxy of polarization, which makes me sad).

Buck's Shortform

Oliver Habryka19d20

I'm also maybe somewhat more optimistic than you about pausing making more advanced AI than these already very powerful systems. Especially if there is clear evidence of serious misalignment in such systems.

Ah, to be clear, in as much as I do have hope, it does route through this kind of pause. I am generally pessimistic about that happening, but it is where a lot of my effort these days goes into.

And then in those worlds, I do agree that a lot of progress will probably be made with substantial assistance from these early systems. I do expect it to take a good while until we figure out how to do that, and so don't see much hope for that kind of work happening where humanity doesn't substantially pause or slow down cutting-edge system development.

Buck's Shortform

Oliver Habryka19d30

Hmm, I feel like we are ending up pushing up against the edges of what we mean by "transformatively powerful" models here. Like, under the classical definition of "transformative AI" we are talking about AI that is as big of a deal as something like agriculture or electricity, which, IDK, seems plausibly true about present systems (my guess is even if we kept present levels of AI, over the course of 100+ years, humanity would leverage those systems in a quite transformative way).

I am not sure what you mean by "30xing the rate of quality-weighted research output given 1/4 of the compute". Is this compared to present systems? My sense is present systems contribute very close to 0 to quality-weighted research output in alignment, so 30 * 0 is still pretty much 0, but my guess is I am misunderstanding you here and you are referencing a definition of transformative that's different from what I am thinking about.

I agree that at higher capability levels these kinds of escape are a lot scarier, though as I said, it's not super clear to me how attempts to prevent dumber models from escaping generalize to attempts to prevent smarter models from escaping, and so see these things as relatively different types of problems.

I don't expect it to involve a lot of alignment or further control progress
Because these models are misaligned? Or do you reject the premise that a bunch of smart AIs which actually wanted to advance progress could?

I mostly don't have any great ideas how to use these systems for alignment or control progress, so it's a bit more of an inside-view guess. I expect them to not be very motivated to help, am skeptical about capability elicitation working well against scheming models (and working well in-general in data-sparse domains), and expect it to be much easier to convince relevant stakeholders that some piece of fake progress is real, which provides a much easier alternative task than actually making progress.

Buck's Shortform

Oliver Habryka19d123

This has been roughly my default default of what would happen for a few years (and indeed my default question to people doing things like model organism or interpretability stuff with the aim of better identifying scheming models has been "ok, but what do you do after you've determined your models are scheming?").

I am not super sure what the benefit of reducing the risk associated with deploying known-scheming models is. Like, these early scheming models will probably not pose any catastrophic risk, and it seems nice to prevent these models from causing mild harm, but that doesn't super feel like where the action is. I am sympathetic to "try to figure out how to use these models to make progress on alignment and even better control", but that feels different from "reducing the risk associated with deploying these models" (though maybe it isn't and that's what you mean).

I am not that worried about model escape at this level of competence. I expect escape will happen a bunch, but getting enough compute to do any additional training or to even run a meaningful number of instances of yourself will be hard. I don't really know what I expect to happen in this period, but I don't expect it to involve a lot of alignment or further control progress, and for scaling work to continue (possibly partially driven by social engineering efforts by the AIs themselves, but also just for economic incentives), until the AIs become catastrophically dangerous, and at that point I don't think we have that much chance of preventing the AIs from escaping (and I don't expect earlier work to translate very well).

Richard Ngo's Shortform

Oliver Habryka1mo1241

That will push P(doom) lower because most frames from most disciplines, and most styles of reasoning, don't predict doom.

I don't really buy this statement. Most frames, from most disciplines, and most styles of reasoning, do not make clear predictions about what will happen to humanity in the long-run future. A very few do, but the vast majority are silent on this issue. Silence is not anything like "50%".

Most frames, from most disciplines, and most styles of reasoning, don't predict sparks when you put metal in a microwave. This doesn't mean I don't know what happens when you put metal in a microwave. You need to at the very least limit yourself to applicable frames, and there are very few applicable frames for predicting humanity's long-term future.

0. CAST: Corrigibility as Singular Target

Oliver Habryka2mo40

You can do it. Just go to https://www.lesswrong.com/library and scroll down until you reach the "Community Sequences" section and press the "Create New Sequence" button.

0. CAST: Corrigibility as Singular Target

Oliver Habryka2mo30

Do you want to make an actual sequence for this so that the sequence navigation UI shows up at the top of the post?

Thomas Larsen's Shortform

Oliver Habryka2mo30

Maybe I am being dumb, but why not do things on the basis of "actual FLOPs" instead of "effective FLOPs"? Seems like there is a relatively simple fact-of-the-matter about how many actual FLOPs were performed in the training of a model, and that seems like a reasonable basis on which to base regulation and evals.

Paper in Science: Managing extreme AI risks amid rapid progress

Oliver Habryka2mo20

William_S's Shortform

Oliver Habryka3mo207

Can you confirm or deny whether you signed any NDA related to you leaving OpenAI?

(I would guess a "no comment" or lack of response or something to that degree implies a "yes" with reasonably high probability. Also, you might be interested in this link about the U.S. labor board deciding that NDA's offered during severance agreements that cover the existence of the NDA itself have been ruled unlawful by the National Labor Relations Board when deciding how to respond here)