All of habryka's Comments + Replies

Mod note: It felt fine to do this once or twice, but it's not an intended use-case of AI Alignment Forum membership to post to the AI Alignment Forum with content that you didn't write. 

I would have likely accepted this submission to the AI Alignment Forum anyways, so it seems best to just go via the usual submission channels. I don't want to set a precedent of weirdly confusing co-authorship for submission purposes. You can also ping me on Intercom in-advance if you want to get an ahead notice of whether the post fits on the AIAF, or want to make sure it goes live there immediately. 

1Dan H6d
I asked for permission via Intercom to post this series on March 29th. Later, I asked for permission to use the [Draft] indicator and said it was written by others. I got permission for both of these, but the same person didn't give permission for both of these requests. Apologies this was not consolidated into one big ask with lots of context. (Feel free to get rid of any undue karma.)
1[comment deleted]6d

Mod note: I removed Dan H as a co-author since it seems like that was more used as convenience for posting it to the AI Alignment Forum. Let me know if you want me to revert.

If the difference between these papers is: we do activations, they do weights, then I think that warrants more conceptual and empirical comparisons.

Yeah, it's totally possible that, as I said, there is a specific other paper that is important to mention or where the existing comparison seems inaccurate. This seems quite different from a generic "please have more thorough related work sections" request like the one you make in the top-level comment (which my guess is was mostly based on your misreading of the post and thinking the related work section only spans two paragraphs). 

5Dan H16d
Yes, I'll tend to write up comments quickly so that I don't feel as inclined to get in detailed back-and-forths and use up time, but here we are. When I wrote it, I thought there were only 2 things mentioned in the related works until Daniel pointed out the formatting choice, and when I skimmed the post I didn't easily see comparisons or discussion that I expected to see, hence I gestured at needing more detailed comparisons. After posting, I found a one-sentence comparison of the work I was looking for, so I edited to include that I found it, but it was oddly not emphasized. A more ideal comment would have been "It would be helpful to me if this work would more thoroughly compare to (apparently) very related works such as ..."

The level of comparison between the present paper and this paper seems about the same as I see in papers you have been a co-author in. 

E.g. in the Related Works section is basically just a list of papers, with maybe half a sentence describing their relation to the paper. This seems normal and fine, and I don't see even papers you are a co-author on doing something substantively different here (this is again separate from whether there are any important papers omitted from the list of related works, or whether any s... (read more)

2Dan H16d
In many of my papers, there aren't fairly similar works (I strongly prefer to work in areas before they're popular), so there's a lower expectation for comparison depth, though breadth is always standard. In other works of mine, such as this paper [] on learning the the right thing in the presence of extremely bad supervision/extremely bad training objectives, we contrast with the two main related works for two paragraphs, and compare to these two methods for around half of the entire paper. The extent of an adequate comparison depends on the relatedness. I'm of course not saying every paper in the related works needs its own paragraph. If they're fairly similar approaches, usually there also needs to be empirical juxtapositions as well. If the difference between these papers is: we do activations, they do weights, then I think that warrants a more in-depth conceptual comparisons or, preferably, many empirical comparisons.

I don't understand this comment. I did a quick count of related works that are mentioned in the "Related Works" section (and the footnotes of that section) and got around 10 works, so seems like this is meeting your pretty arbitrarily established bar, and there are also lots of footnotes and references to related work sprinkled all over the post, which seems like the better place to discuss related work anyways.

I am not familiar enough with the literature to know whether this post is omitting any crucial pieces of related work, but the relevant section of ... (read more)

4Dan H16d
Background for people who understandably don't habitually read full empirical papers: Related Works sections in empirical papers tend to include many comparisons in a coherent place. This helps contextualize the work and helps busy readers quickly identify if this work is meaningfully novel relative to the literature. Related works must therefore also give a good account of the literature. This helps us more easily understand how much of an advance this is. I've seen a good number of papers steering with latent arithmetic in the past year, but I would be surprised if this is the first time many readers of AF/LW have seen it, which would make this paper seem especially novel. A good related works section would more accurately and quickly communicate how novel this is. I don't think this norm is gatekeeping nor pedantic; it becomes essential when the number of papers becomes high. The total number of cited papers throughout the paper is different from the number of papers in the related works. If a relevant paper is buried somewhere randomly in a paper and not contrasted with explicitly in the related works section, that is usually penalized in peer review.

Yeah, does sure seem like we should update something here. I am planning to spend more time on AIAF stuff soon, but until then, if someone has a drop-in paragraph, I would probably lightly edit it and then just use whatever you send me/post here.

This is not commenting on the substance of this post, but I really feel like the title of this post should be "The self-alignment problem". 

Like, we talk about "The alignment problem" not "The unalignment problem". The current title makes me think that the problem is that I somehow have to unalign myself, which doesn't really make sense.

I don't know / talked with a few people before posting, and it seems opinions differ. We also talk about e.g. "the drought problem" where we don't aim to get landscape dry. Also as Kaj wrote, the problem also isn't how to get self-unaligned
3Kaj Sotala2mo
But then, "the self-alignment problem" would likewise make it sound like it's about how you need to align yourself with yourself. And while it is the case that increased self-alignment is generally very good and that not being self-aligned causes problems for the person in question, that's not actually the problem the post is talking about.

Direct optimizers typically have a very specific architecture requiring substantial iteration and search. Luckily, it appears that our current NN architectures, with a fixed-length forward pass and a lack of recurrence or support for branching computations as is required in tree search makes the implementation of powerful mesa-optimizers inside the network quite challenging.

I think this is being too confident on what "direct optimizers" require. 

There is an ontology, mostly inherited from the graph-search context, in which "direct optimizers" require ... (read more)

Perhaps I've simply been misreading John, and he's been intending to say "I have some beliefs, and separately I have some suggestive technical results, and they feel kinda related to me! Which is not to say that any onlooker is supposed to be able to read the technical results and then be persuaded of any of my claims; but it feels promising and exciting to me!".

For what it's worth, I ask John about once ever month or two about his research progress and his answer has so far been (paraphrased) "I think I am making progress. I don't think I have anything to... (read more)

John has also made various caveats to me, of the form "this field is pre-paradigmatic and the math is merely suggestive at this point". I feel like he oversold his results even so.

Part of it is that I get the sense that John didn't understand the limitations of his own results--like the fact that the telephone theorem only says anything in the infinite case, and the thing it says then does not (in its current form) arise as a limit of sensible things that can be said in finite cases. Or like the fact that the alleged interesting results of the gKPD theorem... (read more)

This is just false, because it is not taking into account the cost of doing expected value maximization, since giving consistent preferability scores is just very expensive and hard to do reliably.

I do really want to put emphasis on the parenthetical remark "(at least in some situations, though they may not arise)". Katja is totally aware that the coherence arguments require a bunch of preconditions that are not guaranteed to be the case for all situations, or even any situation ever, and her post is about how there is still a relevant argument here.

Crossposting this comment from the EA Forum: 

Nuno says: 

I appreciate the whole post. But I personally really enjoyed the appendix. In particular, I found it informative that Yudkowsk can speak/write with that level of authoritativeness, confidence, and disdain for others who disagree, and still be wrong (if this post is right).

I respond:

(if this post is right)

The post does actually seem wrong though. 

I expect someone to write a comment with the details at some point (I am pretty busy right now, so can only give a quick meta-level gleam), but... (read more)

Copying my response from the EA forum:

(if this post is right)

The post does actually seem wrong though. 

Glad that I added the caveat.

Also, the title of "there are no coherence arguments" is just straightforwardly wrong. The theorems cited are of course real theorems, they are relevant to agents acting with a certain kind of coherence, and I don't really understand the semantic argument that is happening where it's trying to say that the cited theorems aren't talking about "coherence", when like, they clearly are.

Well, part of the semantic nuance is tha... (read more)

I’m following previous authors in defining ‘coherence theorems’ as

theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy.

On that definition, there are no coherence theorems. VNM is not a coherence theorem, nor is Savage’s Theorem, nor is Bolker-Jeffrey, nor are Dutch Book Arguments, nor is Cox’s Theorem, nor is the Complete Class Theorem.

there are theorems that are relevant to the question of agent coherence

I'd have no proble... (read more)

Yep, I think it's pretty plausible this is just a data-quality issue, though I find myself somewhat skeptical of this. Maybe worth a bet? 

I would be happy to bet that conditional on them trying to solve this with more supervised training and no RLHF, we are going to see error modes substantially more catastrophic than current Chat-GPT. 

Yeah, this is basically my point. Not sure whether whether you are agreeing or disagreeing. I was specifically quoting Paul's comment saying "I've seen only modest qualitative differences" in order to disagree and say "I think we've now seen substantial qualitative differences". 

We have had 4chan play around with Chat-GPT for a while, with much less disastrous results than what happened when they got access to Sydney.

It is not news to anyone here that average-case performance on proxy metrics on some tame canned datasets may be unrelated to out-of-dis

... (read more)
I was elaborating in more ML-y jargon, and also highlighting that there are a lot of wildcards omitted from Paul's comparison: retrieval especially was an interesting dynamic.

I think the qualitative difference between the supervised tuning done in text-davinci-002 and the RLHF in text-davinci-003 is modest (e.g. I've seen head-to-head comparisons suggesting real but modest effects on similar tasks).

Ok, I think we might now have some additional data on this debate. It does indeed look like to me that Sydney was trained with the next best available technology after RLHF, for a few months, at least based on Gwern's guesses here: (read more)

Benchmarking on static datasets on ordinary tasks (typically not even adversarially collected in the first place) may not be a good way to extrapolate to differences in level of abuse for PR-sensitive actors like megacorps, especially for abusers that are attacking the retrieval functionality (as Sydney users explicitly were trying to populate Bing hits to steer Sydney), a functionality not involved in said benchmarking at all. Or to put it another way, the fact that text-davinci-003 does only a little better than text-davinci-002 in terms of accuracy % may tell you little about how profitable in $ each will be once 4chan & the coomers get their hands on it... It is not news to anyone here that average-case performance on proxy metrics on some tame canned datasets may be unrelated to out-of-distribution robustness on worst-case adversary-induced decision-relevant losses, in much the same way that model perplexity tells us little about what a model is useful for or how vulnerable it is.
4Lawrence Chan3mo
For what it's worth, I buy the claim from Gwern that Microsoft trained Sydney pretty poorly, much worse than is achievable with SFT on highly rated data. For example, Sydney shows significant repetition, which you don't see even on text-davinci-002 or (early 2022) LaMDA, both trained without RLHF. 

Relevant piece of data: 

Feb 1 (Reuters) - ChatGPT, the popular chatbot from OpenAI, is estimated to have reached 100 million monthly active users in January, just two months after launch, making it the fastest-growing consumer application in history, according to a UBS study on Wednesday.

The report, citing data from analytics firm Similarweb, said an average of about 13 million u

... (read more)

I didn't realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by $10B out of $200-500B, so like +2-5%, I'm probably happy to agree (and I also think it had other accelerating effects beyond that).

Makes sense, sorry for the miscommunication. I really didn't feel like I was making a particularly controversial claim with the $10B, so was confused why it seemed so unreasonable to you. 

I do think those $10B are going to be substantially more harmful for timelines than other money in AI, because I do ... (read more)

2Kaj Sotala4mo
Maybe - but Microsoft and Google are huge organizations, and huge organizations have an incentive to push for regulation that imposes costs that they can pay while disproportionately hampering smaller competitors. It seems plausible to me that both M & G might prefer a regulatory scheme that overall slows down progress while cementing their dominance, since that would be a pretty standard regulatory-capture-driven-by-the-dominant-actors-in-the-field kind of scenario. A sudden wave of destabilizing AI breakthroughs - with DALL-E/Midjourney/Stable Diffusion suddenly disrupting art and Chat-GPT who-knows-how-many-things - can also make people on the street concerned and both more supportive of AI regulation in general, as well as more inclined to take AGI scenarios seriously in particular. I recently saw a blog post from someone [] speculating that this might cause a wide variety of actors - M & G included - with a desire to slow down AI progress to join forces to push for widespread regulation.

How much total investment do you think there is in AI in 2023?

My guess is total investment was around the $200B - $500B range, with about $100B of that into new startups and organizations, and around $100-$400B of that in organizations like Google and Microsoft outside of acquisitions. I have pretty high uncertainty on the upper end here, since I don't know what fraction of Google's revenue gets reinvested again into AI, how much Tesla is investing in AI, how much various governments are investing, etc.

How much variance do you think there is in the level o

... (read more)
2Matthew "Vaniver" Gray4mo
IMO it's much easier to support high investment numbers in "AI" if you consider lots of semiconductor / AI hardware startup stuff as "AI investments". My suspicion is that while GPUs were primarily a crypto thing for the last few years, the main growth outlook driving more investment is them being an AI thing. 
I'd be interested to know how you estimate the numbers here, they seem quite inflated to me. If 4 big tech companies were to invest $50B each in 2023 then, assuming average salary as $300k and 2:1 capital to salary then investment would be hiring about 50B/900K = 55,000 people to work on this stuff. For reference the total headcount at these orgs is roughly 100-200K. 50B/yr is also around 25-50% of the size of the total income, and greater than profits for most which again seems high. Perhaps my capital ratio is way too low but I would find it hard to believe that these companies can meaningfully put that level of capital into action so quickly. I would guess more on the order of $50B between the major companies in 2023. Agree with paul's comment above that timeline shifts are the most important variable.

I didn't realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by $10B out of $200-500B, so like +2-5%, I'm probably happy to agree (and I also think it had other accelerating effects beyond that).

I would guess that a 2-5% increase in total investment could speed up AGI timelines 1-2 weeks depending on details of the dynamics, like how fast investment was growing, how much growth is exogenous vs endogenous, diminishing returns curves, importance of human capital, etc.. If you mean +2-5% investment in ... (read more)

I think it's unlikely that the reception of ChatGPT increased OpenAI's valuation by $10B, much less investment in OpenAI, even before thinking about replaceability.

Note that I never said this, so I am not sure what you are responding to. I said Chat-GPT increases investment in AI by $10B, not that it increased investment into specifically OpenAI. Companies generally don't have perfect mottes. Most of that increase in investment is probably in internal Google allocation and in increased investment into the overall AI industry.

I think the effect would have been very similar if it had been trained via supervised learning on good dialogs

I don't currently think this is the case, and seems like the likely crux. In general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train for, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in vario... (read more)

5Oliver Habryka4mo
Relevant piece of data: []  I had some decent probability on this outcome but I have increased my previous estimate of the impact of Chat-GPT by 50%, since I didn't expect something this radical ("the single fastest growing consumer product in history").

I don't currently think this is the case, and seems like the likely crux. In-general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in various fine-tuned ways (including preventing the AI from saying controversial things), which had been the biggest problem

... (read more)

RLHF is just not that important to the bottom line right now. Imitation learning works nearly as well, other hacky techniques can do quite a lot to fix obvious problems, and the whole issue is mostly second order for the current bottom line.

I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF. 

My current best guess is that Chat-GPT alone, via sparking an arms-race between Google and Microsoft, and by increasing OpenAIs valuation, should be modeled a... (read more)

I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF. 

I think the qualitative difference between the supervised tuning done in text-davinci-002 and the RLHF in text-davinci-003 is modest (e.g. I've seen head-to-head comparisons suggesting real but modest effects on similar tasks).

I think the much more important differences are:

  1. It was trained to interact directly with the end user as a conversational assistant rather than in an API intended to be use
... (read more)

my guess is most of that success is attributable to the work on RLHF, since that was really the only substantial difference between Chat-GPT and GPT-3

I don't think this is right -- the main hype effect of chatGPT over previous models feels like it's just because it was in a convenient chat interface that was easy to use and free. My guess is that if you did a head-to-head comparison of RLHF and kludgey random hacks involving imitation and prompt engineering, they'd seem similarly cool to a random journalist / VC, and generate similar excitement.

I think this is my second-favorite post in the MIRI dialogues (for my overall review see here). 

I think this post was valuable to me in a much more object-level way. I think this post was the first post that actually just went really concrete on the current landscape of efforts int he domain of AI Notkilleveryonism and talked concretely about what seems feasible for different actors to achieve, and what isn't, in a way that parsed for me, and didn't feel either like something obviously political, or delusional. 

I didn't find the part about differ... (read more)

I feel like this post is the best current thing to link to for understanding the point of coherence arguments in AI Alignment, which I think are really crucial, and even in 2023 I still see lots of people make bad arguments either overextending the validity of coherence arguments, or dismissing coherence arguments completely in an unproductive way.

I wrote up a bunch of my high-level views on the MIRI dialogues in this review, so let me say some things that are more specific to this post. 

Since the dialogues are written, I keep coming back to the question of the degree to which consequentialism is a natural abstraction that will show up in AI systems we train, and while this dialogue had some frustrating parts where communication didn't go perfectly, I still think it has some of the best intuition pumps for how to think about consequentialism in AI systems. 

The other part I liked the most w... (read more)

This was quite a while ago, probably over 2 years, though I do feel like I remember it quite distinctly. I guess my model of you has updated somewhat here over the years, and now is more interested in heads-down work.

3Rohin Shah5mo
Yeah, that sounds entirely plausible if it was over 2 years ago, just because I'm terrible at remembering my opinions from that long ago.

I think I was actually helping Robby edit some early version of this post a few months before it was posted on LessWrong, so I think my exposure to it was actually closer to ~18-20 months ago.

I do think that still means I set a lot of my current/recent plans into motion before this was out, and your post is appreciated.

I think this post might be the best one of all the MIRI dialogues. I also feel confused about how to relate to the MIRI dialogues overall.

A lot of the MIRI dialogues consist of Eliezer and Nate saying things that seem really important and obvious to me, and a lot of my love for them comes from a feeling of "this actually makes a bunch of the important arguments for why the problem is hard". But the nature of the argument is kind of closed off. 

Like, I agree with these arguments, but like, if you believe these arguments, having traction on AI Alignment... (read more)

If it's a mistake you made over the last two years, I have to say in your defense that this post didn't exist 2 years ago.

I've thought a good amount about Finite Factored Sets in the past year or two, but I do sure keep going back to thinking about the world primarily in the form of Pearlian causal influence diagrams, and I am not really sure why. 

I do think this one line by Scott at the top gave me at least one pointer towards what was happening: 

but I'm trained as a combinatorialist, so I'm giving a combinatorics talk upfront.

In the space of mathematical affinities, combinatorics is among the branches of math I feel most averse to, and I think that explains a good... (read more)

2Scott Garrabrant5mo
This [] underrated post is pretty good at explaining how to translate between FFSs and DAGs.

I think this is a fun idea, but also, I think these explanations are mostly actually pretty bad, and at least my inner Eliezer is screaming at most of these rejected outputs, as well as the reasoning behind them.

I also don't think it provides any more substantial robustness guarantees than the existing fine-tuning, though I do think if we train the model to be a really accurate Eliezer-simulator, that this approach has more hope (but that's not the current training objective of either base-GPT3 or the helpful assistant model).

2Seb Farquhar6mo
I also predict that real Eliezer would say about many of these things that they were basically not problematic outputs themselves, just represent how hard it is to stop outputs conditioned on having decided they are problematic. The model seems to totally not get this. Meta level: let's use these failures to understand how hard alignment is, but not accidentally start thinking that alignment=='not providing information that is readily available on the internet but that we think people shouldn't use'.

Promoted to curated: I found engaging with this post quite valuable. I think in the end I disagree with the majority of arguments in it (or at least think they omit major considerations that have previously been discussed on LessWrong and the AI Alignment Forum), but I found thinking through these counterarguments and considering each one of them seriously a very valuable thing to do to help me flesh out my models of the AI X-Risk space.

IMO a big part of why mechanistic interp is getting a lot of attention in the x-risk community is that neural networks are surprisingly more interpretable than we might have naively expected and there's a lot of shovel-ready work in this area. I think if you asked many people three years ago, they would've said that we'd never find a non-trivial circuit in GPT-2-small, a 125m parameter model; yet Redwood has reverse engineered the IOI circuit in GPT-2-small. Many people were also surprised by Neel Nanda's modular addition work.

I don't think I've seen ma... (read more)

1Vivek Hebbar7mo
I have seen one person be surprised (I think twice in the same convo) about what progress had been made. ETA: Our observations are compatible.  It could be that people used to a poor and slow-moving state of interpretability are surprised by the recent uptick, but that the absolute progress over 6 years is still disappointing.

Oh, huh, I think this moderation action makes me substantially less likely to comment further on your posts, FWIW. It's currently will within your rights to do so, and I am on the margin excited about more people moderating things, but I feel hesitant participating with the current level of norm-specification + enforcement.

I also turned my strong-upvote into a small-upvote, since I have less trust in the comment section surfacing counterarguments, which feels particularly sad for this post (e.g. I was planning to respond to your comment with examples of pa... (read more)

I appreciate the effort and strong-upvoted this post because I think it's following a good methodology of trying to build concrete gear-level models and concretely imagining what will happen, but also think this is really very much not what I expect to happen, and in my model of the world is quite deeply confused about how this will go (mostly by vastly overestimating the naturalness of the diamond abstraction, underestimating convergent instrumental goals and associated behaviors, and relying too much on the shard abstraction). I don't have time to write a whole response, but in the absence of a "disagreevote" on posts am leaving this comment.

Thanks. Am interested in hearing more at some point. 

I also want to note that insofar as this extremely basic approach ("reward the agent for diamond-related activities") is obviously doomed for reasons the community already knew about, then it should be vulnerable to a convincing linkpost comment which points out a fatal, non-recoverable flaw in my reasoning (like: "TurnTrout, you're ignoring the obvious X and Y problems, linked here:"). I'm posting this comment as an invitation for people to reply with that, if appropriate![1]

And if there is nothing... (read more)

Oh, I do think a bunch of my problems with WebGPT is that we are training the system on direct internet access.

I agree that "train a system with internet access, but then remove it, then hope that it's safe", doesn't really make much sense. In-general, I expect bad things to happen during training, and separately, a lot of the problems that I have with training things on the internet is that it's an environment that seems like it would incentivize a lot of agency and make supervision really hard because you have a ton of permanent side effects.

3Rohin Shah9mo
Oh you're making a claim directly about other people's approaches, not about what other people think about their own approaches. Okay, that makes sense (though I disagree). I was suggesting that the plan was "train a system without Internet access, then add it at deployment time" (aka "box the AI system during training"). I wasn't at any point talking about WebGPT.

Here is an example quote from the latest OpenAI blogpost on AI Alignment:

Language models are particularly well-suited for automating alignment research because they come “preloaded” with a lot of knowledge and information about human values from reading the internet. Out of the box, they aren’t independent agents and thus don’t pursue their own goals in the world. To do alignment research they don’t need unrestricted access to the internet. Yet a lot of alignment research tasks can be phrased as natural language or coding tasks.

This sounds super straig... (read more)

3Rohin Shah9mo
The immediately preceding paragraph is: I would have guessed the claim is "boxing the AI system during training will be helpful for ensuring that the resulting AI system is aligned", rather than "after training, the AI system might be trying to pursue its own goals, but we'll ensure it can't accomplish them via boxing". But I can see your interpretation as well.

I think the smiling example is much more analogous than you are making it out here. I think the basic argument for "this just encourages taking control of the reward" or "this just encourages deception" goes through the same way.

Like, RLHF is not some magical "we have definitely figured out whether a behavior is really good or bad" signal, it's historically been just some contractors thinking for like a minute about whether a thing is fine. I don't think there is less bayesian evidence conveyed by people smiling (like, the variance in smiling is greater th... (read more)

and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment

I am confused. How does RLHF help with outer alignment? Isn't optimizing fur human approval the classical outer-alignment problem? (e.g. tiling the universe with smiling faces)

I don't think the argument for RLHF runs through outer alignment. I think it has to run through using it as a lens to study how models generalize, and eliciting misalignment (i.e. the points about empirical data that you mentioned, I just don't understand where the inner/outer alignment distinction comes from in this context)

2Richard Ngo9mo
RLHF helps with outer alignment because it leads to rewards which more accurately reflect human preferences than the hard-coded reward functions (including the classic specification gaming examples [], but also intrinsic motivation functions like curiosity and empowerment []) which are used to train agents in the absence of RLHF. The smiley faces example feels confusing as a "classic" outer alignment problem because AGIs won't be trained on a reward function anywhere near as limited as smiley faces. An alternative like "AGIs are trained on a reward function in which all behavior on a wide range of tasks is classified by humans as good or bad" feels more realistic, but also lacks the intuitive force of the smiley face example - it's much less clear in this example why generalization will go badly, given the breadth of the data collected.

I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).

I do really feel like it would have been really extremely hard to build rockets if we had to get it right on the very first try.

I think for rockets the fact that it is so costly to experiment with stuff, explains the majority of the difficulty of rocket engineering. I agree you also have very little chance to build a successful space rocket without having a g... (read more)

1David Scott Krueger9mo
Well you could probably build a rocket that looks like it works, anyways.  Could you build one you would want to try to travel to the moon in?  (Are you imagining you get to fly in these rockets?  Or just launch and watch from ground?  I was imagining the 2nd...)

At a sufficiently high level of abstraction, I agree that "cost of experimenting" could be seen as the core difficulty. But at a very high level of abstraction, many other things could also be seen as the core difficulty, like "our inability to coordinate as a civilization" or "the power of intelligence" or "a lack of interpretability", etc. Given this, John's comment seemed like mainly rhetorical flourishing rather than a contentful claim about the structure of the difficult parts of the alignment problem.

Also, I think that "on our first try" thing isn't ... (read more)

I think the story would be way different if the actual risk posed by WebGPT was meaningful (say if it were driving >0.1% of the risk of OpenAI's activities).

Huh, I definitely expect it to drive >0.1% of OpenAI's activities. Seems like the WebGPT stuff is pretty close to commercial application, and is consuming much more than 0.1% of OpenAI's research staff, while probably substantially increasing OpenAI's ability to generally solve reinforcement learning problems. I am confused why you would estimate it at below 0.1%. 1% seems more reasonable to m... (read more)

I think the direct risk of OpenAI's activities is overwhelmingly dominated by training new smarter models and by deploying the public AI that could potentially be used in unanticipated ways.

I agree that if we consider indirect risks broadly (including e.g. "this helps OpenAI succeed or raise money and OpenAI's success is dangerous") then I'd probably move back towards "what % of OpenAI's activities is it."

I believe the most important drivers of catastrophic misalignment risk are models that optimize in ways humans don't understand or are deceptively aligned. So the great majority of risk comes from actions that accelerate those events, and especially making models smarter. I think your threat model here is quantitatively wrong, and that it's an important disagreement.

I agree with this! But I feel like this kind of reinforcement learning on a basically unsupervisable action-space while interfacing with humans and getting direct reinforcement on approval i... (read more)

But people attempting to box smart unaligned AIs, or believing that boxed AIs are significantly safer because they can't access the internet, seems to me like a bad situation. An AI smart enough to cause risk with internet access is very likely to be able to cause risk anyway, and at best you are creating a super unstable situation where a lab leak is catastrophic.

I do think we are likely to be in a bad spot, and talking to people at OpenAI, Deepmind and Anthropic (e.g. the places where most of the heavily-applied prosaic alignment work is happening), I... (read more)

6Rohin Shah9mo
... Who are you talking to? I'm having trouble naming a single person at either of OpenAI or Anthropic who seems to me to be interested in extensive boxing (though admittedly I don't know them that well). At DeepMind there's a small minority who think about boxing, but I think even they wouldn't think of this as a major aspect of their plan. I agree that they aren't aiming for a "much more comprehensive AI alignment solution" in the sense you probably mean it but saying "they rely on boxing" seems wildly off. My best-but-still-probably-incorrect guess is that you hear people proposing schemes that seem to you like they will obviously not work in producing intent aligned systems and so you assume that the people proposing them also believe that and are putting their trust in boxing, rather than noticing that they have different empirical predictions about how likely those schemes are to produce intent aligned systems.

If you thought that researchers working on WebGPT were shortening timelines significantly more efficiently than the average AI researcher, then the direct harm starts to become relevant compared to opportunity costs.

Yeah, my current model is that WebGPT feels like some of the most timelines-reducing work that I've seen (as has most of OpenAIs work). In-general, OpenAI seems to have been the organization that has most shortened timelines in the last 5 years, with the average researcher seeming ~10x more efficient at shortening timelines than even researc... (read more)

I think almost all of the acceleration comes from either products that generate $ and hype and further investment, or more directly from scaleup to more powerful models. I think "We have powerful AI systems but haven't deployed them to do stuff they are capable of" is a very short-term kind of situation and not particularly desirable besides.

I'm not sure what you are comparing RLHF or WebGPT to when you say "paradigm of AIs that are much harder to align." I think I probably just think this is wrong, in that (i) you are comparing to pure generative modeling... (read more)

I moved that thread over the AIAF as well!

Yeah, I agree that I am doing reasoning on people's motivations here, which is iffy and given the pushback I will be a bit more hesitant to do, but also like, in this case reasoning about people's motivations is really important, because what I care about is what the people working at OpenAI will actually do when they have extremely powerful AI in their hands, and that will depend a bunch on their motivations.

I am honestly a bit surprised to see that WebGPT was as much driven by people who I do know reasonably well and who seem to be driven primarily by sa... (read more)

I don't think "your AI wants to kill you but it can't get out of the box so it helps you with alignment instead" is the mainline scenario. You should be building an AI that wouldn't stab you if your back was turned and it was holding a knife, and if you can't do that then you should not build the AI.

That's interesting. I do think this is true about your current research direction (which I really like about your research and I do really hope we can get there), but when I e.g. talk to Carl Shulman he (if I recall correctly) said things like "we'll just h... (read more)

6Paul Christiano9mo
Even in those schemes, I think the AI systems in question will have much better levers for causing trouble than access to the internet, including all sorts of internal access and their involvement in the process of improving your AI (and that trying to constrain them so severely would mean increasing their intelligence far enough that you come out behind). The mechanisms making AI uprising difficult are not mostly things like "you are in a secure box and can't get out," they are mostly facts about all the other AI systems you are dealing with. That said, I think you are overestimating how representative these are of the "mainline" hope most places, I think the goal is primarily that AI systems powerful enough to beat all of us combined come after AI systems powerful enough to greatly improve the situation. I also think there are a lot of subtle distinctions about how AI systems are trained that are very relevant to a lot of these stories (e.g. WebGPT is not doing RL over inscrutable long-term consequences on the internet---just over human evaluations of the quality of answers or browsing behavior).

WebGPT is approximately "reinforcement learning on the internet".

There are some very minimal safeguards implemented (search via Bing API, but the AI can click on arbitrary links), but I do indeed think "reinforcement learning on the internet" is approximately the worst direction for modern AI to go in terms of immediate risks.

I don't think connecting GPT-3 to the internet is risky at current capability levels, but pushing AI in the direction of just hooking up language models with reinforcement learning to a browser seems like one of the worst directions f... (read more)

The primary job of OpenAI is to be a clear leader here and do the obvious good things to keep an AI safe, which will hopefully include boxing it. Saying "well, seems like the cost is kinda high so we won't do it" seems like exactly the kind of attitude that I am worried will cause humanity to go extinct. 

  • When you say "good things to keep an AI safe" I think you are referring to a goal like "maximize capability while minimizing catastrophic alignment risk." But in my opinion "don't give your models access to the internet or anything equally risky" is a
... (read more)

The main group of people working on alignment (other than interpretability) at OpenAI at the time of the Anthropic split at the end of 2020 was the Reflection team, which has since been renamed to the Alignment team. Of the 7 members of the team at that time (who are listed on the summarization paper), 4 are still working at OpenAI, and none are working at Anthropic.

I think this is literally true, but at least as far as I know is not really conveying the underlying dynamics and so I expect readers to walk away with the wrong impression.

Again, I might be... (read more)

1Jacob Hilton9mo
Without commenting on the specifics, I have edited to the post to mitigate potential confusion: "this fact alone is not intended to provide a complete picture of the Anthropic split, which is more complicated than I am able to explain here".

Huh, I thought you agreed with statements like "if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier".

My model is that John is talking about "evidence on whether an AI alignment solution is sufficient", and you understood him to say "evidence on whether the AI Alignment problem is real/difficult". My guess is you both agree on the former, but I am not confident.

6Richard Ngo9mo
I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work). I don't really know what "reliable empirical feedback" means in this context - if you have sufficiently reliable feedback mechanisms, then you've solved most of the alignment problem. But, out of the things John listed: I expect that we'll observe a bunch of empirical examples of each of these things happening (except for the hard takeoff phase change), and not know how to fix them.

WebGPT seemed like one of the most in-expectation harmful projects that OpenAI has worked on, with no (to me) obvious safety relevance, so my guess is I would still mostly categorize the things you list under the first misconception as capabilities research. InstructGPT also seems to be almost fully capabilities research (like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons)

(Edit: M... (read more)

like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons

This also seems like an odd statement - it seems reasonable to say "I think the net effect of InstructGPT is to boost capabilities" or even "If someone was motivated by x-risk it would be poor prioritisation/a mistake to work on InstructGPT". But it feels like you're assuming some deep insight into the intention behind the people w... (read more)

2Neel Nanda9mo
That seems weirdly strong. Why do you think that?

I was the project lead on WebGPT and my motivation was to explore ideas for scalable oversight and truthfulness (some further explanation is given here).

A while ago I got most of the way to set up a feature on LW/AIAF that would export LW/AIAF posts to a nicely formatted academic-looking PDF that is linkable. I ended up running into a hurdle somewhat close to the end and shelved the feature, but if there is a lot of demand here, I could probably finish up the work, which would make this process even easier.

A while ago I made a very quick Python script to pull Markdown from LW, then use pandoc to export to a PDF (because I prefer reading physical papers and Latex formatting). I used it somewhat regularly for ~6 months and found that it was good enough for my purposes. I assume the LW developers could write something much better, but I've thrown it into this Github [repo]( in case it's of help or interest.

I would especially especially love it if it popped out a .tex file that I could edit, since I'm very likely to be using different language on LW than I would in a fancy academic paper.
2Neel Nanda9mo
I would love this! I'm currently paying someone ~$200 to port my grokking post to LaTeX, getting a PDF automatically would be great

Yeah, I think Open AI tried to do some empirical work, but approximately just produced capability progress, in my current model of the world (though I also think the incentive environment there was particularly bad). I feel confused about the "learning to summarize from human feedback" work, and currently think it was overall bad for the world, but am not super confident (in general I feel very confused about the sign of RLHF research).

I think Rohin Shah doesn't think of himself as having produced empirical work that helps with AI Alignment, but only to ha... (read more)

I'm pretty confused about how to think about the value of various ML alignment papers. But I think even if some piece of empirical ML work on alignment is really valuable for reducing x-risk, I wouldn't expect its value to take the form of providing insight to readers like you or me. So you as a reader not getting much out of it is compatible with the work being super valuable, and we probably need to assess it on different terms.

The main channel of value that I see for doing work like "learning to summarize" and the critiques project and various interpret... (read more)

Load More