All of habryka's Comments + Replies

I didn't realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by $10B out of $200-500B, so like +2-5%, I'm probably happy to agree (and I also think it had other accelerating effects beyond that).

Makes sense, sorry for the miscommunication. I really didn't feel like I was making a particularly controversial claim with the $10B, so was confused why it seemed so unreasonable to you. 

I do think those $10B are going to be substantially more harmful for timelines than other money in AI, because I do ... (read more)

1Kaj Sotala8h
Maybe - but Microsoft and Google are huge organizations, and huge organizations have an incentive to push for regulation that imposes costs that they can pay while disproportionately hampering smaller competitors. It seems plausible to me that both M & G might prefer a regulatory scheme that overall slows down progress while cementing their dominance, since that would be a pretty standard regulatory-capture-driven-by-the-dominant-actors-in-the-field kind of scenario. A sudden wave of destabilizing AI breakthroughs - with DALL-E/Midjourney/Stable Diffusion suddenly disrupting art and Chat-GPT who-knows-how-many-things - can also make people on the street concerned and both more supportive of AI regulation in general, as well as more inclined to take AGI scenarios seriously in particular. I recently saw a blog post from someone [https://www.jonstokes.com/p/the-story-so-far-ai-makes-for-strange] speculating that this might cause a wide variety of actors - M & G included - with a desire to slow down AI progress to join forces to push for widespread regulation.

How much total investment do you think there is in AI in 2023?

My guess is total investment was around the $200B - $500B range, with about $100B of that into new startups and organizations, and around $100-$400B of that in organizations like Google and Microsoft outside of acquisitions. I have pretty high uncertainty on the upper end here, since I don't know what fraction of Google's revenue gets reinvested again into AI, how much Tesla is investing in AI, how much various governments are investing, etc.

How much variance do you think there is in the level o

... (read more)
1Matthew "Vaniver" Gray3d
IMO it's much easier to support high investment numbers in "AI" if you consider lots of semiconductor / AI hardware startup stuff as "AI investments". My suspicion is that while GPUs were primarily a crypto thing for the last few years, the main growth outlook driving more investment is them being an AI thing.
2Hoagy3d
I'd be interested to know how you estimate the numbers here, they seem quite inflated to me. If 4 big tech companies were to invest $50B each in 2023 then, assuming average salary as $300k and 2:1 capital to salary then investment would be hiring about 50B/900K = 55,000 people to work on this stuff. For reference the total headcount at these orgs is roughly 100-200K. 50B/yr is also around 25-50% of the size of the total income, and greater than profits for most which again seems high. Perhaps my capital ratio is way too low but I would find it hard to believe that these companies can meaningfully put that level of capital into action so quickly. I would guess more on the order of $50B between the major companies in 2023. Agree with paul's comment above that timeline shifts are the most important variable.
4Paul Christiano3d
I didn't realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by $10B out of $200-500B, so like +2-5%, I'm probably happy to agree (and I also think it had other accelerating effects beyond that). I would guess that a 2-5% increase in total investment could speed up AGI timelines 1-2 weeks depending on details of the dynamics, like how fast investment was growing, how much growth is exogenous vs endogenous, diminishing returns curves, importance of human capital, etc.. If you mean +2-5% investment in a single year then I would guess the impact is < 1 week. I haven't thought about it much, but my all things considered estimate for the expected timelines slowdown if you just hadn't done the ChatGPT release is probably between 1-4 weeks. Is that the kind of effect size you are imagining here? I guess the more important dynamic is probably more people entering the space rather than timelines per se? One thing worth pointing out in defense of your original estimate is that variance should add up to 100%, not effect sizes, so e.g. if the standard deviation is $100B then you could have 100 things each explaining ($10B)^2 of variance (and hence each responsible for +-$10B effect sizes after the fact).

I think it's unlikely that the reception of ChatGPT increased OpenAI's valuation by $10B, much less investment in OpenAI, even before thinking about replaceability.

Note that I never said this, so I am not sure what you are responding to. I said Chat-GPT increases investment in AI by $10B, not that it increased investment into specifically OpenAI. Companies generally don't have perfect mottes. Most of that increase in investment is probably in internal Google allocation and in increased investment into the overall AI industry.

I think the effect would have been very similar if it had been trained via supervised learning on good dialogs

I don't currently think this is the case, and seems like the likely crux. In general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train for, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in vario... (read more)

I don't currently think this is the case, and seems like the likely crux. In-general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in various fine-tuned ways (including preventing the AI from saying controversial things), which had been the biggest problem

... (read more)

RLHF is just not that important to the bottom line right now. Imitation learning works nearly as well, other hacky techniques can do quite a lot to fix obvious problems, and the whole issue is mostly second order for the current bottom line.

I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF. 

My current best guess is that Chat-GPT alone, via sparking an arms-race between Google and Microsoft, and by increasing OpenAIs valuation, should be modeled a... (read more)

I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF. 

I think the qualitative difference between the supervised tuning done in text-davinci-002 and the RLHF in text-davinci-003 is modest (e.g. I've seen head-to-head comparisons suggesting real but modest effects on similar tasks).

I think the much more important differences are:

  1. It was trained to interact directly with the end user as a conversational assistant rather than in an API intended to be use
... (read more)

my guess is most of that success is attributable to the work on RLHF, since that was really the only substantial difference between Chat-GPT and GPT-3

I don't think this is right -- the main hype effect of chatGPT over previous models feels like it's just because it was in a convenient chat interface that was easy to use and free. My guess is that if you did a head-to-head comparison of RLHF and kludgey random hacks involving imitation and prompt engineering, they'd seem similarly cool to a random journalist / VC, and generate similar excitement.

I think this is my second-favorite post in the MIRI dialogues (for my overall review see here). 

I think this post was valuable to me in a much more object-level way. I think this post was the first post that actually just went really concrete on the current landscape of efforts int he domain of AI Notkilleveryonism and talked concretely about what seems feasible for different actors to achieve, and what isn't, in a way that parsed for me, and didn't feel either like something obviously political, or delusional. 

I didn't find the part about differ... (read more)

I feel like this post is the best current thing to link to for understanding the point of coherence arguments in AI Alignment, which I think are really crucial, and even in 2023 I still see lots of people make bad arguments either overextending the validity of coherence arguments, or dismissing coherence arguments completely in an unproductive way.

I wrote up a bunch of my high-level views on the MIRI dialogues in this review, so let me say some things that are more specific to this post. 

Since the dialogues are written, I keep coming back to the question of the degree to which consequentialism is a natural abstraction that will show up in AI systems we train, and while this dialogue had some frustrating parts where communication didn't go perfectly, I still think it has some of the best intuition pumps for how to think about consequentialism in AI systems. 

The other part I liked the most w... (read more)

This was quite a while ago, probably over 2 years, though I do feel like I remember it quite distinctly. I guess my model of you has updated somewhat here over the years, and now is more interested in heads-down work.

3Rohin Shah19d
Yeah, that sounds entirely plausible if it was over 2 years ago, just because I'm terrible at remembering my opinions from that long ago.

I think I was actually helping Robby edit some early version of this post a few months before it was posted on LessWrong, so I think my exposure to it was actually closer to ~18-20 months ago.

I do think that still means I set a lot of my current/recent plans into motion before this was out, and your post is appreciated.

I think this post might be the best one of all the MIRI dialogues. I also feel confused about how to relate to the MIRI dialogues overall.

A lot of the MIRI dialogues consist of Eliezer and Nate saying things that seem really important and obvious to me, and a lot of my love for them comes from a feeling of "this actually makes a bunch of the important arguments for why the problem is hard". But the nature of the argument is kind of closed off. 

Like, I agree with these arguments, but like, if you believe these arguments, having traction on AI Alignment... (read more)

If it's a mistake you made over the last two years, I have to say in your defense that this post didn't exist 2 years ago.

I've thought a good amount about Finite Factored Sets in the past year or two, but I do sure keep going back to thinking about the world primarily in the form of Pearlian causal influence diagrams, and I am not really sure why. 

I do think this one line by Scott at the top gave me at least one pointer towards what was happening: 

but I'm trained as a combinatorialist, so I'm giving a combinatorics talk upfront.

In the space of mathematical affinities, combinatorics is among the branches of math I feel most averse to, and I think that explains a good... (read more)

2Scott Garrabrant23d
This [https://www.lesswrong.com/s/kxs3eeEti9ouwWFzr/p/hxuKtHH4jTdtmEAbK] underrated post is pretty good at explaining how to translate between FFSs and DAGs.

I think this is a fun idea, but also, I think these explanations are mostly actually pretty bad, and at least my inner Eliezer is screaming at most of these rejected outputs, as well as the reasoning behind them.

I also don't think it provides any more substantial robustness guarantees than the existing fine-tuning, though I do think if we train the model to be a really accurate Eliezer-simulator, that this approach has more hope (but that's not the current training objective of either base-GPT3 or the helpful assistant model).

2Seb Farquhar2mo
I also predict that real Eliezer would say about many of these things that they were basically not problematic outputs themselves, just represent how hard it is to stop outputs conditioned on having decided they are problematic. The model seems to totally not get this. Meta level: let's use these failures to understand how hard alignment is, but not accidentally start thinking that alignment=='not providing information that is readily available on the internet but that we think people shouldn't use'.

Promoted to curated: I found engaging with this post quite valuable. I think in the end I disagree with the majority of arguments in it (or at least think they omit major considerations that have previously been discussed on LessWrong and the AI Alignment Forum), but I found thinking through these counterarguments and considering each one of them seriously a very valuable thing to do to help me flesh out my models of the AI X-Risk space.

IMO a big part of why mechanistic interp is getting a lot of attention in the x-risk community is that neural networks are surprisingly more interpretable than we might have naively expected and there's a lot of shovel-ready work in this area. I think if you asked many people three years ago, they would've said that we'd never find a non-trivial circuit in GPT-2-small, a 125m parameter model; yet Redwood has reverse engineered the IOI circuit in GPT-2-small. Many people were also surprised by Neel Nanda's modular addition work.

I don't think I've seen ma... (read more)

1Vivek Hebbar3mo
I have seen one person be surprised (I think twice in the same convo) about what progress had been made. ETA: Our observations are compatible. It could be that people used to a poor and slow-moving state of interpretability are surprised by the recent uptick, but that the absolute progress over 6 years is still disappointing.

Oh, huh, I think this moderation action makes me substantially less likely to comment further on your posts, FWIW. It's currently will within your rights to do so, and I am on the margin excited about more people moderating things, but I feel hesitant participating with the current level of norm-specification + enforcement.

I also turned my strong-upvote into a small-upvote, since I have less trust in the comment section surfacing counterarguments, which feels particularly sad for this post (e.g. I was planning to respond to your comment with examples of pa... (read more)

I appreciate the effort and strong-upvoted this post because I think it's following a good methodology of trying to build concrete gear-level models and concretely imagining what will happen, but also think this is really very much not what I expect to happen, and in my model of the world is quite deeply confused about how this will go (mostly by vastly overestimating the naturalness of the diamond abstraction, underestimating convergent instrumental goals and associated behaviors, and relying too much on the shard abstraction). I don't have time to write a whole response, but in the absence of a "disagreevote" on posts am leaving this comment.

Thanks. Am interested in hearing more at some point. 

I also want to note that insofar as this extremely basic approach ("reward the agent for diamond-related activities") is obviously doomed for reasons the community already knew about, then it should be vulnerable to a convincing linkpost comment which points out a fatal, non-recoverable flaw in my reasoning (like: "TurnTrout, you're ignoring the obvious X and Y problems, linked here:"). I'm posting this comment as an invitation for people to reply with that, if appropriate![1]

And if there is nothing... (read more)

Oh, I do think a bunch of my problems with WebGPT is that we are training the system on direct internet access.

I agree that "train a system with internet access, but then remove it, then hope that it's safe", doesn't really make much sense. In-general, I expect bad things to happen during training, and separately, a lot of the problems that I have with training things on the internet is that it's an environment that seems like it would incentivize a lot of agency and make supervision really hard because you have a ton of permanent side effects.

3Rohin Shah5mo
Oh you're making a claim directly about other people's approaches, not about what other people think about their own approaches. Okay, that makes sense (though I disagree). I was suggesting that the plan was "train a system without Internet access, then add it at deployment time" (aka "box the AI system during training"). I wasn't at any point talking about WebGPT.

Here is an example quote from the latest OpenAI blogpost on AI Alignment:

Language models are particularly well-suited for automating alignment research because they come “preloaded” with a lot of knowledge and information about human values from reading the internet. Out of the box, they aren’t independent agents and thus don’t pursue their own goals in the world. To do alignment research they don’t need unrestricted access to the internet. Yet a lot of alignment research tasks can be phrased as natural language or coding tasks.

This sounds super straig... (read more)

3Rohin Shah5mo
The immediately preceding paragraph is: I would have guessed the claim is "boxing the AI system during training will be helpful for ensuring that the resulting AI system is aligned", rather than "after training, the AI system might be trying to pursue its own goals, but we'll ensure it can't accomplish them via boxing". But I can see your interpretation as well.

I think the smiling example is much more analogous than you are making it out here. I think the basic argument for "this just encourages taking control of the reward" or "this just encourages deception" goes through the same way.

Like, RLHF is not some magical "we have definitely figured out whether a behavior is really good or bad" signal, it's historically been just some contractors thinking for like a minute about whether a thing is fine. I don't think there is less bayesian evidence conveyed by people smiling (like, the variance in smiling is greater th... (read more)

and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment

I am confused. How does RLHF help with outer alignment? Isn't optimizing fur human approval the classical outer-alignment problem? (e.g. tiling the universe with smiling faces)

I don't think the argument for RLHF runs through outer alignment. I think it has to run through using it as a lens to study how models generalize, and eliciting misalignment (i.e. the points about empirical data that you mentioned, I just don't understand where the inner/outer alignment distinction comes from in this context)

2Richard Ngo5mo
RLHF helps with outer alignment because it leads to rewards which more accurately reflect human preferences than the hard-coded reward functions (including the classic specification gaming examples [https://www.deepmind.com/blog/specification-gaming-the-flip-side-of-ai-ingenuity] , but also intrinsic motivation functions like curiosity and empowerment [https://arxiv.org/abs/1908.06976]) which are used to train agents in the absence of RLHF. The smiley faces example feels confusing as a "classic" outer alignment problem because AGIs won't be trained on a reward function anywhere near as limited as smiley faces. An alternative like "AGIs are trained on a reward function in which all behavior on a wide range of tasks is classified by humans as good or bad" feels more realistic, but also lacks the intuitive force of the smiley face example - it's much less clear in this example why generalization will go badly, given the breadth of the data collected.

I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).

I do really feel like it would have been really extremely hard to build rockets if we had to get it right on the very first try.

I think for rockets the fact that it is so costly to experiment with stuff, explains the majority of the difficulty of rocket engineering. I agree you also have very little chance to build a successful space rocket without having a g... (read more)

1David Scott Krueger5mo
Well you could probably build a rocket that looks like it works, anyways. Could you build one you would want to try to travel to the moon in? (Are you imagining you get to fly in these rockets? Or just launch and watch from ground? I was imagining the 2nd...)

At a sufficiently high level of abstraction, I agree that "cost of experimenting" could be seen as the core difficulty. But at a very high level of abstraction, many other things could also be seen as the core difficulty, like "our inability to coordinate as a civilization" or "the power of intelligence" or "a lack of interpretability", etc. Given this, John's comment seemed like mainly rhetorical flourishing rather than a contentful claim about the structure of the difficult parts of the alignment problem.

Also, I think that "on our first try" thing isn't ... (read more)

I think the story would be way different if the actual risk posed by WebGPT was meaningful (say if it were driving >0.1% of the risk of OpenAI's activities).

Huh, I definitely expect it to drive >0.1% of OpenAI's activities. Seems like the WebGPT stuff is pretty close to commercial application, and is consuming much more than 0.1% of OpenAI's research staff, while probably substantially increasing OpenAI's ability to generally solve reinforcement learning problems. I am confused why you would estimate it at below 0.1%. 1% seems more reasonable to m... (read more)

I think the direct risk of OpenAI's activities is overwhelmingly dominated by training new smarter models and by deploying the public AI that could potentially be used in unanticipated ways.

I agree that if we consider indirect risks broadly (including e.g. "this helps OpenAI succeed or raise money and OpenAI's success is dangerous") then I'd probably move back towards "what % of OpenAI's activities is it."

I believe the most important drivers of catastrophic misalignment risk are models that optimize in ways humans don't understand or are deceptively aligned. So the great majority of risk comes from actions that accelerate those events, and especially making models smarter. I think your threat model here is quantitatively wrong, and that it's an important disagreement.

I agree with this! But I feel like this kind of reinforcement learning on a basically unsupervisable action-space while interfacing with humans and getting direct reinforcement on approval i... (read more)

But people attempting to box smart unaligned AIs, or believing that boxed AIs are significantly safer because they can't access the internet, seems to me like a bad situation. An AI smart enough to cause risk with internet access is very likely to be able to cause risk anyway, and at best you are creating a super unstable situation where a lab leak is catastrophic.

I do think we are likely to be in a bad spot, and talking to people at OpenAI, Deepmind and Anthropic (e.g. the places where most of the heavily-applied prosaic alignment work is happening), I... (read more)

6Rohin Shah5mo
... Who are you talking to? I'm having trouble naming a single person at either of OpenAI or Anthropic who seems to me to be interested in extensive boxing (though admittedly I don't know them that well). At DeepMind there's a small minority who think about boxing, but I think even they wouldn't think of this as a major aspect of their plan. I agree that they aren't aiming for a "much more comprehensive AI alignment solution" in the sense you probably mean it but saying "they rely on boxing" seems wildly off. My best-but-still-probably-incorrect guess is that you hear people proposing schemes that seem to you like they will obviously not work in producing intent aligned systems and so you assume that the people proposing them also believe that and are putting their trust in boxing, rather than noticing that they have different empirical predictions about how likely those schemes are to produce intent aligned systems.

If you thought that researchers working on WebGPT were shortening timelines significantly more efficiently than the average AI researcher, then the direct harm starts to become relevant compared to opportunity costs.

Yeah, my current model is that WebGPT feels like some of the most timelines-reducing work that I've seen (as has most of OpenAIs work). In-general, OpenAI seems to have been the organization that has most shortened timelines in the last 5 years, with the average researcher seeming ~10x more efficient at shortening timelines than even researc... (read more)

I think almost all of the acceleration comes from either products that generate $ and hype and further investment, or more directly from scaleup to more powerful models. I think "We have powerful AI systems but haven't deployed them to do stuff they are capable of" is a very short-term kind of situation and not particularly desirable besides.

I'm not sure what you are comparing RLHF or WebGPT to when you say "paradigm of AIs that are much harder to align." I think I probably just think this is wrong, in that (i) you are comparing to pure generative modeling... (read more)

I moved that thread over the AIAF as well!

Yeah, I agree that I am doing reasoning on people's motivations here, which is iffy and given the pushback I will be a bit more hesitant to do, but also like, in this case reasoning about people's motivations is really important, because what I care about is what the people working at OpenAI will actually do when they have extremely powerful AI in their hands, and that will depend a bunch on their motivations.

I am honestly a bit surprised to see that WebGPT was as much driven by people who I do know reasonably well and who seem to be driven primarily by sa... (read more)

I don't think "your AI wants to kill you but it can't get out of the box so it helps you with alignment instead" is the mainline scenario. You should be building an AI that wouldn't stab you if your back was turned and it was holding a knife, and if you can't do that then you should not build the AI.

That's interesting. I do think this is true about your current research direction (which I really like about your research and I do really hope we can get there), but when I e.g. talk to Carl Shulman he (if I recall correctly) said things like "we'll just h... (read more)

6Paul Christiano5mo
Even in those schemes, I think the AI systems in question will have much better levers for causing trouble than access to the internet, including all sorts of internal access and their involvement in the process of improving your AI (and that trying to constrain them so severely would mean increasing their intelligence far enough that you come out behind). The mechanisms making AI uprising difficult are not mostly things like "you are in a secure box and can't get out," they are mostly facts about all the other AI systems you are dealing with. That said, I think you are overestimating how representative these are of the "mainline" hope most places, I think the goal is primarily that AI systems powerful enough to beat all of us combined come after AI systems powerful enough to greatly improve the situation. I also think there are a lot of subtle distinctions about how AI systems are trained that are very relevant to a lot of these stories (e.g. WebGPT is not doing RL over inscrutable long-term consequences on the internet---just over human evaluations of the quality of answers or browsing behavior).

WebGPT is approximately "reinforcement learning on the internet".

There are some very minimal safeguards implemented (search via Bing API, but the AI can click on arbitrary links), but I do indeed think "reinforcement learning on the internet" is approximately the worst direction for modern AI to go in terms of immediate risks.

I don't think connecting GPT-3 to the internet is risky at current capability levels, but pushing AI in the direction of just hooking up language models with reinforcement learning to a browser seems like one of the worst directions f... (read more)

The primary job of OpenAI is to be a clear leader here and do the obvious good things to keep an AI safe, which will hopefully include boxing it. Saying "well, seems like the cost is kinda high so we won't do it" seems like exactly the kind of attitude that I am worried will cause humanity to go extinct. 

  • When you say "good things to keep an AI safe" I think you are referring to a goal like "maximize capability while minimizing catastrophic alignment risk." But in my opinion "don't give your models access to the internet or anything equally risky" is a
... (read more)

The main group of people working on alignment (other than interpretability) at OpenAI at the time of the Anthropic split at the end of 2020 was the Reflection team, which has since been renamed to the Alignment team. Of the 7 members of the team at that time (who are listed on the summarization paper), 4 are still working at OpenAI, and none are working at Anthropic.

I think this is literally true, but at least as far as I know is not really conveying the underlying dynamics and so I expect readers to walk away with the wrong impression.

Again, I might be... (read more)

1Jacob Hilton5mo
Without commenting on the specifics, I have edited to the post to mitigate potential confusion: "this fact alone is not intended to provide a complete picture of the Anthropic split, which is more complicated than I am able to explain here".

Huh, I thought you agreed with statements like "if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier".

My model is that John is talking about "evidence on whether an AI alignment solution is sufficient", and you understood him to say "evidence on whether the AI Alignment problem is real/difficult". My guess is you both agree on the former, but I am not confident.

6Richard Ngo5mo
I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work). I don't really know what "reliable empirical feedback" means in this context - if you have sufficiently reliable feedback mechanisms, then you've solved most of the alignment problem. But, out of the things John listed: I expect that we'll observe a bunch of empirical examples of each of these things happening (except for the hard takeoff phase change), and not know how to fix them.

WebGPT seemed like one of the most in-expectation harmful projects that OpenAI has worked on, with no (to me) obvious safety relevance, so my guess is I would still mostly categorize the things you list under the first misconception as capabilities research. InstructGPT also seems to be almost fully capabilities research (like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons)

(Edit: M... (read more)

like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons

This also seems like an odd statement - it seems reasonable to say "I think the net effect of InstructGPT is to boost capabilities" or even "If someone was motivated by x-risk it would be poor prioritisation/a mistake to work on InstructGPT". But it feels like you're assuming some deep insight into the intention behind the people w... (read more)

2Neel Nanda5mo
That seems weirdly strong. Why do you think that?

I was the project lead on WebGPT and my motivation was to explore ideas for scalable oversight and truthfulness (some further explanation is given here).

A while ago I got most of the way to set up a feature on LW/AIAF that would export LW/AIAF posts to a nicely formatted academic-looking PDF that is linkable. I ended up running into a hurdle somewhat close to the end and shelved the feature, but if there is a lot of demand here, I could probably finish up the work, which would make this process even easier.

A while ago I made a very quick Python script to pull Markdown from LW, then use pandoc to export to a PDF (because I prefer reading physical papers and Latex formatting). I used it somewhat regularly for ~6 months and found that it was good enough for my purposes. I assume the LW developers could write something much better, but I've thrown it into this Github [repo](https://github.com/juesato/lw_pdf_exporter/tree/main) in case it's of help or interest.

2Alex_Altair5mo
I would especially especially love it if it popped out a .tex file that I could edit, since I'm very likely to be using different language on LW than I would in a fancy academic paper.
2Neel Nanda5mo
I would love this! I'm currently paying someone ~$200 to port my grokking post to LaTeX, getting a PDF automatically would be great

Yeah, I think Open AI tried to do some empirical work, but approximately just produced capability progress, in my current model of the world (though I also think the incentive environment there was particularly bad). I feel confused about the "learning to summarize from human feedback" work, and currently think it was overall bad for the world, but am not super confident (in general I feel very confused about the sign of RLHF research).

I think Rohin Shah doesn't think of himself as having produced empirical work that helps with AI Alignment, but only to ha... (read more)

I'm pretty confused about how to think about the value of various ML alignment papers. But I think even if some piece of empirical ML work on alignment is really valuable for reducing x-risk, I wouldn't expect its value to take the form of providing insight to readers like you or me. So you as a reader not getting much out of it is compatible with the work being super valuable, and we probably need to assess it on different terms.

The main channel of value that I see for doing work like "learning to summarize" and the critiques project and various interpret... (read more)

Hmm, there might be some mismatch of words here. Like, most of the work so far on the problem has been theoretical. I am confused how you could not be excited about the theoretical work that established the whole problem, the arguments for why it's hard, and that helped us figure out at least some of the basic parameters of the problem. Given that (I think) you currently think AI Alignment is among the global priorities, you presumably think the work that allowed you to come to believe that (and that allowed others to do the same) was very valuable and imp... (read more)

I was mainly talking about the current margin when I talked about how excited I am about the theoretical vs empirical work I see "going on" right now and how excited I tend to be about currently-active researchers who are doing theory vs empirical research. And I was talking about the future when I said that I expect empirical work to end up with the lion's share of credit for AI risk reduction.

Eliezer, Bostrom, and co certainly made a big impact in raising the problem to people's awareness and articulating some of its contours. It's kind of a matter of s... (read more)

I mostly disagreed with bullet point two. The primary result of "empirical AI Alignment research" that I've seen in the last 5 years has been a lot of capabilities gain, with approximately zero in terms of progress on any AI Alignment problems. I agree more with the "in the long run there will be a lot of empirical work to be done", but right now on the margin, we have approximately zero traction on useful empirical work, as far as I can tell (outside of transparency research).

4Ajeya Cotra6mo
I agree that in an absolute sense there is very little empirical work that I'm excited about going on, but I think there's even less theoretical work going on that I'm excited about, and when people who share my views on the nature of the problem work on empirical work I feel that it works better than when they do theoretical work.
1Thomas Kwa6mo
Were any cautious people trying empirical alignment research before Redwood/Conjecture?

FWIW, I had a mildly negative reaction to this title. I agree with you, but I feel like the term "PSA" should be reserved for things that are really very straightforward and non-controversial, and I feel like it's a bit of a bad rhetorical technique to frame your arguments as a PSA. I like the overall content of the post, but feel like a self-summarizing post title like "Most AI Alignment research is not parallelizable" would be better.

I think this is a great comment that feels to me like it communicated a better intuition for why corrigibility might be natural than anything else I've read so far.

If Australia was pursuing a strategy of "lock down irrespective of cost", then I don't think it makes sense to describe the initial response as competent. It just happened to be right in this case, but in order for the overall response to helpful, it has to be adaptive to the actual costs. I agree that the early response on its own would have indicated a potentially competent decision-making algorithm, but the later followup showed that the algorithm seems to have mostly been correct on accident, and not on-purpose.

I do appreciate the link to the GDP cost article. I would have to look into the methodology more to comment on that, but it certainly seems like an interest analysis and suggestive result.

Australia seems to have suffered a lot more from the pandemic than the U.S., paying much more in the cost of lockdown than even a relatively conservative worst-case estimate would have been for the costs of an uncontrolled COVID pandemic. I don't know about the others, but given that you put Australia on this list, I don't currently trust the others to have acted sensibly.

1Jan_Kulveit7mo
I'm not sure if you actually read carefully what you are commenting on. I emphasized early response, or initial governmental-level response in both comments in this thread. Sure, multiple countries on the list made mistakes later, some countries sort of become insane, and so on. Later, almost everyone made mistakes with vaccines, rapid tests, investments in contact tracing, etc. Arguing that the early lockdown was more costly than "an uncontrolled pandemic" would be pretty insane position (cf GDP costs [https://www.abs.gov.au/articles/international-economic-comparisons-after-year-pandemic] , Italy had the closest thing to an uncontrolled pandemic). (Btw the whole notion of "an uncontrolled pandemic" is deeply confused - unless you are a totalitarian dictatorship, you cannot just order people "live as normally" during a pandemic when enough other people are dying; you get spontaneous "anarchic lockdowns" anyway, just later and in a more costly way)

(Mod note: I turned on two-factor voting. @Evan: Feel free to ask me to turn it off.)

Ok, but why isn't it better to have Godzilla fighting Mega-Godzilla instead of leaving Mega-Godzilla unchallenged?

Because Tokyo still gets destroyed.

Important thing to bear in mind here: the relevant point for comparison is not the fantasy-world where the Godzilla-vs-Mega-Godzilla fight happens exactly the way the clever elaborate scheme imagined. The relevant point for comparison is the realistic-world where something went wrong, and the elaborate clever scheme fell apart, and now there's monsters rampaging around anyway.

Yeah, OK, I think this distinction makes sense, and I do feel like this distinction is important. 

Having settled this, my primary response is: 

Sure, I guess it's the most prototypical catastrophic action until we have solved it, but like, even if we solve it, we haven't solved the problem where the AI does actually get a lot smarter than humans and takes a substantially more "positive-sum" action and kills approximately everyone with the use of a bioweapon, or launches all the nukes, or develops nanotechnology. We do have to solve this problem fi... (read more)

I'm using "catastrophic" in the technical sense of "unacceptably bad even if it happens very rarely, and even if the AI does what you wanted the rest of the time", rather than "very bad thing that happens because of AI", apologies if this was confusing.

My guess is that you will wildly disagree with the frame I'm going to use here, but I'll just spell it out anyway: I'm interested in "catastrophes" as a remaining problem after you have solved the scalable oversight problem. If your action is able to do one of these "positive-sum" pivotal acts in a single ac... (read more)

Mod note: I activated two-axis voting on this post, since it seemed like it would make the conversation go better.

it’s the AI doing something relatively easy that is entirely a zero-sum action that removes control of the situation from humans.

I also don't really understand the emphasis on zero-sum action. It seems pretty plausible that the AI will trade for compute with some other person around the world. My guess is, in-general, it will be easier to get access to the internet or have some kind of side-channel attack than to get root access to the datacenter the AI runs on. I do think most-likely the AI will then be capable of just taking resources instead of payin... (read more)

1Buck Shlegeris8mo
Whether this is what I'm trying to call a zero-sum action depends on whose resources it's trading. If the plan is "spend a bunch of the capital that its creators have given it on compute somewhere else", then I think this is importantly zero-sum--the resources are being taken from the creators of AI, which is why the AI was able to spend so many resources. If the plan was instead "produce some ten trillion dollar invention, then sell it, then use the proceeds to buy compute elsewhere", this would seem less zero-sum, and I'm saying that I expect the first kind of thing to happen before the second.

I feel like the focus on getting access to its own datacenter is too strong in this story. Seems like it could also just involve hacking some random remote server, or convincing some random person on the internet to buy some compute for them, or to execute some other plan for them (like producing a custom chip), or convincing a researcher that it should get more resources on the existing datacenter, or threatening some other stakeholder somewhere in order to give them power or compute of some kind. Also, all of course selected for plans that are least like... (read more)

2Buck Shlegeris8mo
Except for maybe "producing a custom chip", I agree with these as other possibilities, and I think they're in line with the point I wanted to make, which is that the catastrophic action involves taking someone else's resource such that it can prevent humans from observing it or interfering with it, rather than doing something which is directly a pivotal act. Does this distinction make sense? Maybe this would have been clearer if I'd titled it "AI catastrophic actions are mostly not pivotal acts"?

I like this. Would this have to be publicly available models? Seems kind of hard to do for private models.

2Ramana Kumar8mo
What kind of access might be needed to private models? Could there be a secure multi-party computation approach that is sufficient?

One of the primary questions that comes to mind for me is "well, did this whole thing actually work?". If I understand the paper correctly, while we definitely substantially decreased the fraction of random samples that got misclassified (which always seemed very likely to happen, and I am indeed a bit surprised at only getting it to move ~3 OOMs, which my guess is mostly capability related, since you used small models), we only doubled the amount of effort necessary to generate an adversarial counterexample. 

A doubling is still pretty substantial, an... (read more)

Excellent question -- I wish we had included more of an answer to this in the post.

I think we made some real progress on the defense side -- but I 100% was hoping for more and agree we have a long way to go.

I think the classifier is quite robust in an absolute sense, at least compared to normal ML models. We haven't actually tried it on the final classifier, but my guess is it takes at least several hours to find a crisp failure unassisted (whereas almost all ML models you can find are trivially breakable). We're interested in people giving it a shot! :)

Pa... (read more)

0Charbel-Raphael Segerie9mo
I do not understand the "we only doubled the amount of effort necessary to generate an adversarial counterexample.". Aren't we talking about 3oom?

This article explains the difference: https://www.consumeranalysis.com/guides/portable-ac/best-portable-air-conditioner/

For example, a 14,000 BTU model that draws 1,400 watts of power on maximum settings would have an EER of 10.0 as 14,000/1,400 = 10.0.

A 14,000 BTU unit that draws 1200 watts of power would have an EER of 11.67 as 14,000/1,200 = 11.67.

Taken at face value, this looks like a good and proper metric to use for energy efficiency. The lower the power draw (watts) compared to the cooling capacity (BTUs/hr), the higher the EER. And the higher the E

... (read more)
Load More