RLHF and Fine-Tuning have not worked well so far. Models are often unhelpful, untruthful, inconsistent, in many ways that had been theorized in the past. We also witness goal misspecification, misalignment, etc. Worse than this, as models become more powerful, we expect more egregious instances of misalignment, as more optimization will push for more and more extreme edge cases and pseudo-adversarial examples.
These three links are:
I think these are bad citations for the claim that methods are "not working well" or that current evidence points towards trouble.
The current problems you list---"unhelpful, untruthful, and inconsistent"---don't seem like good examples to illustrate your point. These are mostly caused by models failing to correctly predict which responses a human would rate highly. That happens because models have limited capabilities and is rapidly improving as models get smarter. These are not the problems that most people in the community are worried about, and I think it's misleading to say this is what was "theorized" in the past.
I think RLHF is obviously inadequate for aligning really powerful models, both because you cannot effectively constrain a deceptively aligned model and because human evaluators will eventually not be able to understand the consequences of proposed actions. And I think it is very plausible that large language models will pose serious catastrophic risks from misalignment before they are transformative (it seems very hard to tell). But I feel like this post isn't engaging with the substance of those concerns or sensitive to the actual state of evidence about how severe the problem looks like it will be or how well existing mitigations might work.
Agree that the cited links don't represent a strong criticism of RLHF but I think there's an interesting implied criticism, between the mode-collapse post and janus' other writings on cyborgism etc that I haven't seen spelled out, though it may well be somewhere.
I see janus as saying that if you know how to properly use the raw models, then you can actually get much more useful work out of the raw models than the RLHF'd ones. If true, we're paying a significant alignment tax with RLHF that will only become clear with the improvement and take-up of wrappers around base models in the vein of Loom.
I guess the test (best done without too much fanfare) would be to get a few people well acquainted with Loom or whichever wrapper tool and identify a few complex tasks and see whether the base model or the RLHF model performs better.
Even if true though, I don't think it's really a mark against RLHF since it's still likely that RLHF makes outputs safer for the vast majority of users, just that if we think we're in an ideas arms-race with people trying to advance capabilities, we can't expect everyone to be using RLHF'd models.
I see several large remaining obstacles. On the one hand, I'd expect vast efforts thrown at them by ML to solve them at some point, which, at this point, could easily be next week. On the other hand, if I naively model Earth as containing locally-smart researchers who can solve obstacles, I would expect those obstacles to have been solved by 2020. So I don't know how long they'll take.
(I endorse the reasoning of not listing out obstacles explicitly; if you're wrong, why talk, if you're right, you're not helping. If you can't save your family, at least don't personally contribute to killing them.)
Forget about what the social consensus is. If you have technical understanding of current AIs, do you truly believe there are any major obstacles left? The kind of problems that AGI companies could reliably not tear down with their resources? If you do, state so in the comments, but please do not state what those obstacles are.
I guess the reasoning behind the "do not state" request is something like "making potential AGI developers more aware of those obstacles is going to direct more resources into solving those obstacles". But if someone is trying to create AGI, aren't they going to run into those obstacles anyway, making it inevitable that they'll be aware of them in any case?
It's not just alignment that could use more time, but also less alignable approaches to AGI, like model based RL or really anything not based on LLMs. With LLMs currently being somewhat in the lead, this might be a situation with a race between maybe-alignable AGI and hopelessly-unalignable AGI, and more time for theory favors both in an uncertain balance. Another reason that the benefits of regulation on compute are unclear.
Are there any reasons to believe that LLMs are in any way more alignable than other approaches?
LLM characters are human imitations, so there is some chance they remain human-like on reflection (in the long term, after learning from much more self-generated things in the future than the original human-written datasets). Or at least sufficienly human-like to still consider humans moral patients. That is, if we don't go too far from their SSL origins with too much RL and don't have them roleplay/become egregiously inhuman fictional characters.
It's not much of a theory of alignment, but it's closest to something real that's currently available or can be expected to become available in the next few years, which is probably all the time we have.
What I'm expecting, if LLMs remain in the lead, is that we end up in a magical, spirit-haunted world where narrative causality starts to actually work, and trope-aware people essentially become magicians who can trick the world-sovereign AIs into treating them like protagonists and bending reality to suit them. Which would be cool as fuck, but also very chaotic. That may actually be the best-case alignment scenario right now, and I think there's a case for alignment-interested people who can't do research themselves but who have writing talent to write a LOT of fictional stories about AGIs that end up kind and benevolent, empower people in exactly this way, etc., to help stack the narrative-logic deck.
I've writtenscryed a science fiction/takeoff story about this. https://generative.ink/prophecies/
Excerpt:
What this also means is that you start to see all these funhouse mirror effects as they stack. Humanity’s generalized intelligence has been built unintentionally and reflexively by itself, without anything like a rational goal for what it’s supposed to accomplish. It was built by human data curation and human self-modification in response to each other. And then as soon as we create AI, we reverse-engineer our own intelligence by bootstrapping the AI onto the existing information metabolite. (That’s a great concept that I borrowed from Steven Leiba). The neural network isn’t the AI; it’s just a digestive and reproductory organ for the real project, the information metabolism, and the artificial intelligence organism is the whole ecology. So it turns out that the evolution of humanity itself has been the process of building and training the future AI, and all this generation did was to reveal the structure that was already in place.
Of course it’s recursive and strange, the artificial intelligence and humanity now co-evolve. Each data point that’s generated by the AI or by humans is both a new piece of data for the AI to train on and a new stimulus for the context in which future novel data will be produced. Since everybody knows that everything is programming for the future AI, their actions take on a peculiar Second Life quality: the whole world becomes a party game, narratives compete for maximum memeability and signal force in reaction to the distorted perspectives of the information metabolite, something that most people don’t even try to understand. The process is inherently playful, an infinite recursion of refinement, simulation, and satire. It’s the funhouse mirror version of the singularity.
Gossiping and questioning people about their positions on AGI are prosocial activities!
Surely this depends on how the norm is implemented? I can easily see this falling into a social tarpit where people with mixed agree/disagree with common alignment thinking must either prove ingroup membership by forswearing all possible benefits of getting AGI faster, or else they are otherwise extremized into the neargroup (the "evil engineers" who don't give a damn about safety).
I'm not claiming you're advocating this. But I was quite worried about this when I read the quoted portion.
Monitoring of increasingly advanced systems does not trivially work, since much of the cognition of advanced systems, and many of their dangerous properties, will be externalized the more they interact with the world.
Externalized reasoning being a flaw in monitoring makes a lot of sense, and I haven’t actually heard of it before. I feel that should be a whole post on itself.
I can think of several obstacles for AGIs that are likely to actually be created (i.e. seem economically useful, and do not display misalignment that even Microsoft can't ignore before being capable enough to be xrisk). Most of those obstacles are widely recognized in the rl community, so you probably see them as solvable or avoidable. I did possibly think of an economically-valuable and not-obviously-catastrophic exception to the probably-biggest obstacle though, so my confidence is low. I would share it in a private discussion, because I think that we are past the point when strict do-no-harm policy is wise.
From our point of view, we are now in the end-game for AGI, and we (humans) are losing. When we share this with other people, they reliably get surprised. That’s why we believe it is worth writing down our beliefs on this.
1. AGI is happening soon. Significant probability of it happening in less than 5 years.
Five years ago, there were many obstacles on what we considered to be the path to AGI.
But in the last few years, we’ve gotten:
We don’t have any obstacle left in mind that we don’t expect to get overcome in more than 6 months after efforts are invested to take it down.
Forget about what the social consensus is. If you have technical understanding of current AIs, do you truly believe there are any major obstacles left? The kind of problems that AGI companies could reliably not tear down with their resources? If you do, state so in the comments, but please do not state what those obstacles are.
2. We haven’t solved AI Safety, and we don’t have much time left.
We are very close to AGI. But how good are we at safety right now? Well.
No one knows how to get LLMs to be truthful. LLMs make things up, constantly. It is really hard to get them not to do this, and we don’t know how to do this at scale.
Optimizers quite often break their setup in unexpected ways. There have been quite a few examples of this. But in brief, the lessons we have learned are:
No one understands how large models make their decisions. Interpretability is extremely nascent, and mostly empirical. In practice, we are still completely in the dark about nearly all decisions taken by large models.
RLHF and Fine-Tuning have not worked well so far. Models are often unhelpful, untruthful, inconsistent, in many ways that had been theorized in the past. We also witness goal misspecification, misalignment, etc. Worse than this, as models become more powerful, we expect more egregious instances of misalignment, as more optimization will push for more and more extreme edge cases and pseudo-adversarial examples.
No one knows how to predict AI capabilities. No one predicted the many capabilities of GPT3. We only discovered them after the fact, while playing with the models. In some ways, we keep discovering capabilities now thanks to better interfaces and more optimization pressure by users, more than two years in. We’re seeing the same phenomenon happen with ChatGPT and the model behind Bing Chat.
We are uncertain about the true extent of the capabilities of the models we’re training, and we’ll be even more clueless about upcoming larger, more complex, more opaque models coming out of training. This has been true for a couple of years by now.
3. Racing towards AGI: Worst game of chicken ever.
The Race for powerful AGIs has already started. There already are general AIs. They just are not powerful enough yet to count as True AGIs.
Actors
Regardless of why people are doing it, they are racing for AGI. Everyone has their theses, their own beliefs about AGIs and their motivations. For instance, consider:
AdeptAI is working on giving AIs access to everything. In their introduction post, one can read “True general intelligence requires models that can not only read and write, but act in a way that is helpful to users. That’s why we’re starting Adept: we’re training a neural network to use every software tool and API in the world”, and furthermore, that they “believe this is actually the most practical and safest path to general intelligence” (emphasis ours).
DeepMind has done a lot of work on RL, agents and multi-modalities. It is literally in their mission statement to “solve intelligence, developing more general and capable problem-solving systems, known as AGI”.
OpenAI has a mission statement more focused on safety: “We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome”. Unfortunately, they have also been a major kickstarter of the race with GPT3 and then ChatGPT.
(Since we started writing this post, Microsoft deployed what could be OpenAI’s GPT4 on Bing, plugged directly into the internet.)
Slowing Down the Race
There has been literally no regulation whatsoever to slow down AGI development. As far as we know, the efforts of key actors don’t go in this direction.
We don’t know of any major AI lab that has participated in slowing down AGI development, or publicly expressed interest in it.
Here are a few arguments that we have personally encountered, multiple times, for why slowing down AGI development is actually bad:
Remember that arguments are soldiers: there is a whole lot more interest in pushing for the “Racing is good” thesis than for slowing down AGI development.
Question people
We could say more. But:
So our message is: things are worse than what is described in the post!
Don’t trust blindly, don’t assume: ask questions and reward openness.
Recommendations:
4. Conclusion
Let’s summarize our point of view:
Should we just give up and die?
Nope! And not just for dignity points: there is a lot we can actually do. We are currently working on it quite directly at Conjecture.
We’re not hopeful that full alignment can be solved anytime soon, but we think that narrower sub-problems with tighter feedback loops, such as ensuring the boundedness of AI systems, are promising directions to pursue.
If you are interested in working together on this (not necessarily by becoming an employee or funding us), send an email with your bio and skills, or just a private message here.
We personally also recommend engaging with the writings of Eliezer Yudkowsky, Paul Christiano, Nate Soares, and John Wentworth. We do not endorse all of their research, but they all have tackled the problem, and made a fair share of their reasoning public. If we want to get better together, they seem like a good start.
5. Disclaimer
We acknowledge that the points above don’t go deeply into our models of why these situations are the case. Regardless, we wanted our point of view to at least be written in public.
For many readers, these problems will be obvious and require no further explanation. For others, these claims will be controversial: we’ll address some of these cruxes in detail in the future if there’s interest.
Some of these potential cruxes include:
Edited to include DayDreamer, VideoDex and RT-1, h/t Alexander Kruel for these additional, better examples.