AI ALIGNMENT FORUM
AF

HomeLibraryQuestionsAll Posts
About

AI Alignment Posts

Popular Comments

2248
GPTs are Predictors, not Imitators
Best of LessWrong 2023

GPTs are being trained to predict text, not imitate humans. This task is actually harder than being human in many ways. You need to be smarter than the text generator to perfectly predict their output, and some text is the result of complex processes (e.g. scientific results, news) that even humans couldn't predict. 

GPTs are solving a fundamentally different and often harder problem than just "be human-like". This means we shouldn't expect them to think like humans.

by Eliezer Yudkowsky
Raemon6d1644
Please, Don't Roll Your Own Metaethics
What are you supposed to do other than roll your own metaethics?
cousin_it9d1013
Problems I've Tried to Legibilize
I'm worried about the approach of "making decisionmakers realize stuff". In the past couple years I've switched to a more conflict-theoretic view: the main problem to me is that the people building AI don't want to build aligned AI. Even if we solved metaethics and metaphilosophy tomorrow, and gave them the solution on a plate, they wouldn't take it. This is maybe easiest to see by looking at present harms. An actually aligned AI would politely decline to do such things as putting lots of people out of jobs or filling the internet with slop. So companies making AI for the market have to make it misaligned in at least these ways, otherwise it'll fail in the market. Extrapolating into the future, even if we do lots of good alignment research, markets and governments will pick out only those bits that contribute to market-aligned or government-aligned AI. Which (as I've been saying over and over) will be really bad for most people, because markets and governments don't necessarily need most people. So this isn't really a comment on the list of problems (which I think is great), but more about the "theory of change" behind it. I no longer have any faith in making decisionmakers understand something it's not profitable for them to understand. I think we need a different plan.
Vladimir_Nesov10d169
Comparing Payor & Löb
I would term □x→x "hope for x" rather than "reliability", because it's about willingness to enact x in response to belief in x, but if x is no good, you shouldn't do that. Indeed, for bad x, having the property of □x→x is harmful fatalism, following along with destiny rather than choosing it. In those cases, you might want to □x→¬x or something, though that only prevents x from being believed, that you won't need to face □x in actuality, it doesn't prevent the actual x. So □x→x reflects a value judgement about x reflected in agent's policy, something downstream of endorsement of x, a law of how the content of the world behaves according to an embedded agent's will. Payor's Lemma then talks about belief in hope □(□x→x), that is hope itself is exogenous and needs to be judged (endorsed or not). Which is reasonable for games, since what the coalition might hope for is not anyone's individual choice, the details of this hope couldn't have been hardcoded in any agent a priori and need to be negotiated during a decision that forms the coalition. A functional coalition should be willing to act on its own hope (which is again something we need to check for a new coalition, that might've already been the case for a singular agent), that is we need to check that □(□x→x) is sufficient to motivate the coalition to actually x. This is again a value judgement about whether this coalition's tentative aspirations, being a vehicle for hope that x, are actually endorsed by it. Thus I'd term □(□x→x) "coordination" rather than "trust", the fact that this particular coalition would tentatively intend to coordinate on a hope for x. Hope □x→x is a value judgement about x, and in this case it's the coalition's hope, rather any one agent's hope, and the coalition is a temporary nascent agency thing that doesn't necessarily know what it wants yet. The coalition asks: "If we find ourselves hoping for x together, will we act on it?" So we start with coordination about hope, seeing if this particular hope wants to settle as the coalition's actual values, and judging if it should by enacting x if at least coordination on this particular hope is reached, which should happen only if x is a good thing. (One intuition pump with some limitations outside the provability formalism is treating □x as "probably x", perhaps according to what some prediction market tells you. If "probably x" is enough to prompt you to enact x, that's some kind of endorsement, and it's a push towards increasing the equilibrium-on-reflection value of probability of x, pushing "probably x" closer to reality. But if x is terrible, then enacting it in response to its high probability is following along with self-fulfilling doom, rather doing what you can to push the equilibrium away from it.) Löb's Theorem then says that if we merely endorse a belief by enacting the believed outcome, this is sufficient for the outcome to actually happen, a priori and without that belief yet being in evidence. And Payor's Lemma says that if we merely endorse a coalition's coordinated hope by enacting the hoped-for outcome, this is sufficient for the outcome to actually happen, a priori and without the coordination around that hope yet being in evidence. The use of Löb's Theorem or Payor's Lemma is that the condition (belief in x, or coordination around hope for x) should help in making the endorsement, that is it should be easier to decide to x if you already believe that x, or if you already believe that your coalition is hoping for x. For coordination, this is important because every agent can only unilaterally enact its own part in the joint policy, so it does need some kind of premise about the coalition's nature (in this case, about the coalition's tentative hope for what it aims to achieve) in order to endorse playing its part in the coalition's joint policy. It's easier to decide to sign an assurance contract than to unconditionally donate to a project, and the role of Payor's Lemma is to say that if everyone does sign the assurance contract, then the project will in fact get funded sufficiently.
Load More
8Martin Randall
Does this look like a motte-and-bailey to you? 1. Bailey: GPTs are Predictors, not Imitators (nor Simulators). 2. Motte: The training task for GPTs is a prediction task. The title and the concluding sentence both plainly advocate for (1), but it's not really touched by the overall post, and I think it's up for debate (related: reward is not the optimization target). Instead there is an argument for (2). Perhaps the intention of the final sentence was to oppose Simulators? If that's the case, cite it, be explicit. This could be a really easy thing for an editor to fix. ---------------------------------------- Does this look like a motte-and-bailey to you? 1. Bailey: The task that GPTs are being trained on is ... harder than the task of being a human. 2. Motte: Being an actual human is not enough to solve GPT's task. As I read it, (1) is false, the task of being a human doesn't cap out at human intelligence. More intelligent humans are better at minimizing prediction error, achieving goals, inclusive genetic fitness, whatever you might think defines "the task of being a human". In the comments, Yudkowsky retreats to (2), which is true. But then how should I understand this whole paragraph from the post? If we're talking about how natural selection trained my genome, why are we talking about how well humans perform the human task? Evolution is optimizing over generations. My human task is optimizing over my lifetime. Also, if we're just arguing for different thinking, surely it mostly matters whether the training task is different, not whether it is harder? ---------------------------------------- Overall I think "Is GPT-N bounded by human capabilities? No." is a better post on the mottes and avoids staking out unsupported baileys. This entire topic is becoming less relevant because AIs are getting all sorts of synthetic data and RLHF and other training techniques thrown at them. The 2022 question of the capabilities of a hypothetical GPT-N that was only t

Recent Discussion

Problems I've Tried to Legibilize
34
Wei Dai
9d

Looking back, it appears that much of my intellectual output could be described as legibilizing work, or trying to make certain problems in AI risk more legible to myself and others. I've organized the relevant posts and comments into the following list, which can also serve as a partial guide to problems that may need to be further legibilized, especially beyond LW/rationalists, to AI researchers, funders, company leaders, government policymakers, their advisors (including future AI advisors), and the general public.

  1. Philosophical problems
    1. Probability theory
    2. Decision theory
    3. Beyond astronomical waste (possibility of influencing vastly larger universes beyond our own)
    4. Interaction between bargaining and logical uncertainty
    5. Metaethics
    6. Metaphilosophy: 1, 2
  2. Problems with specific philosophical and alignment ideas
    1. Utilitarianism: 1, 2
    2. Solomonoff induction
    3. "Provable" safety
    4. CEV
    5. Corrigibility
    6. IDA (and many scattered comments)
    7. UDASSA
    8. UDT
  3. Human-AI safety (x- and s-risks arising from the interaction between human nature and
...
(See More - 346 more words)
7Eliezer Yudkowsky2d
Has anyone else, or anyone outside the tight MIRI cluster, made progress on any of the problems you've tried to legibilize for them?
Wei Dai2d60

To give a direct answer, not a lot come to mind outside of the MIRI cluster. I think the Center on Long-Term Risk cluster did a bunch of work on decision theory and acausal trade, but it was mostly after I had moved on to other topics, so I'm not sure how much of it constituted progress. Christiano acknowledged some of the problems I pointed out with IDA and came up with some attempted solutions, which I'm not convinced really work.

However, in my previous post, Legible vs. Illegible AI Safety Problems, I explained my latest thinking that the most important... (read more)

Reply
3orthonormal3d
When it specifically comes to loss-of-control risks killing or sidelining all of humanity, I don't believe Sam or Dario or Demis or Elon want that to happen, because it would happen to them too. (Larry Page is different on that count, of course.) You do have conflict theory over the fact that some of them would like ASI to make them god-emperor of the universe, but all of them would definitely take a solution to "loss of control" if it were handed to them on a silver platter.
3cousin_it3d
I think AI offers a chance of getting huge power over others, so it would create competitive pressure in any case. In case of a market economy it's market pressure, but in case of countries it would be a military arms race instead. And even if the labs didn't get any investors and raced secretly, I think they'd still feel under a lot of pressure. The chance of getting huge power is what creates the problem, that's why I think spreading out power is a good idea. There would still be competition of course, but it would be normal economic levels of competition, and people would have some room to do the right things.
Please, Don't Roll Your Own Metaethics
50
Wei Dai
6d

One day, when I was an intern at the cryptography research department of a large software company, my boss handed me an assignment to break a pseudorandom number generator passed to us for review. Someone in another department invented it and planned to use it in their product, and wanted us to take a look first. This person must have had a lot of political clout or was especially confident in himself, because he rejected the standard advice that anything an amateur comes up with is very likely to be insecure and he should instead use one of the established, off the shelf cryptographic algorithms, that have survived extensive cryptanalysis (code breaking) attempts.

My boss thought he had to demonstrate the insecurity of the PRNG by coming up...

(See More - 486 more words)
16Richard_Ngo3d
"Please don't roll your own crypto" is a good message to send to software engineers looking to build robust products. But it's a bad message to send to the community of crypto researchers, because insofar as they believe you, then you won't get new crypto algorithms from them. In the context of metaethics, LW seems much more analogous to the "community of crypto researchers" than the "software engineers looking to build robust products". Therefore this seems like a bad message to send to LessWrong, even if it's a good message to send to e.g. CEOs who justify immoral behavior with metaethical nihilism.
Wei Dai3d50

You may have missed my footnote, where I addressed this?

To preempt a possible misunderstanding, I don't mean "don't try to think up new metaethical ideas", but instead "don't be so confident in your ideas that you'd be willing to deploy them in a highly consequential way, or build highly consequential systems that depend on them in a crucial way". Similarly "don't roll your own crypto" doesn't mean never try to invent new cryptography, but rather don't deploy it unless there has been extensive review, and consensus that it is likely to be secure.

Reply
3Wei Dai3d
By "metaethics" I mean "the nature of values/morality", which I think is how it's used in academic philosophy. Of course the nature of values/morality has a strong influence on "how humans should think about their values" so these are pretty closely connected, but definitionally I do try to use it the same way as in philosophy, to minimize confusion. This post can give you a better idea of how I typically use it. (But as you'll see below, this is actually not crucial for understanding my post.) So in the paragraph that you quoted (and the rest of the post), I was actually talking about philosophical fields/ideas in general, not just metaethics. While my title has "metaethics" in it, the text of the post talks generically about any "philosophical questions" that are relevant for AI x-safety. If we substitute metaethics (in my or the academic sense) into my post, then you can derive that I mean something like this: Different metaethics (ideas/theories about the nature of values/morality) have different implications for what AI designs or alignment approaches are safe, and if you design an AI assuming that one metaethical theory is true, it could be disastrous if a different metaethical theory actually turns out to be true. For example, if moral realism is true, then aligning the AI to human values would be pointless. What you really need to do is design the AI to be able to determine and follow objective moral truths. But this approach would be disastrous if moral realism is actually false. Similarly, if moral noncognitivism is true, that means that humans can't be wrong about their values, and implies "how humans should think about their values" is of no importance. If you design AI under this assumption, that would be disastrous if actually humans can be wrong about their values and they really need AIs to help them think about their values and avoid moral errors. I think in practice a lot of alignment researchers may not even have explicit metaethical theories
1lemonhope4d
The WWDSC is nearly a consensus. Certainly a plurality.
Lessons from building a model organism testbed
6
joshc, sarun0, Annie Sorkin, michaelwaves
1d

I often read interpretability papers and I come away thinking “ok, but what’s the point? What problem does this help us solve?” So last winter, I organized a MATS/Pivotal stream to build examples of deceptive models (aka “model organisms”). The goal was to build a diverse ‘zoo’ of these model organisms and empirically test whether white-box methods could help us detect their deceptive reasoning.

Unfortunately, I don’t think our empirical results were very informative. Our model organisms were too toy for me to expect that our results will transfer to the powerful AI (or even current state-of-the-art AI models). But I think we still developed methodological details that might help people build better model organism testbeds in the future.

I’ll first explain what model organism testbeds are and why...

(Continue Reading - 4196 more words)
Legible vs. Illegible AI Safety Problems
105
Wei Dai
14d

Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to). But some problems are illegible (obscure or hard to understand, or in a common cognitive blind spot), meaning there is a high risk that leaders and policymakers will decide to deploy or allow deployment even if they are not solved. (Of course, this is a spectrum, but I am simplifying it to a binary for ease of exposition.)

From an x-risk perspective, working on highly legible safety problems has low or even negative expected value. Similar to working on AI capabilities, it brings forward the...

(See More - 400 more words)
Alex Mallen3d10

Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems.

This model seems far too simplified, and I don't think it leads to the right conclusions in many important cases (e.g., Joe's):

  • Many important and legible safety problems don't slow development. I think it's extremely unlikely, for example, that Anthropic or others would slow development because of a subpar model spec. I think in the counterfactual where Joe doesn't work on the model spec (1) the mod
... (read more)
Reply
Evaluation Avoidance: How Humans and AIs Hack Reward by Disabling Evaluation Instead of Gaming Metrics
6
Johannes C. Mayer
5d

Have you ever binge-watched a TV series? Binge-watching puts you in a very peculiar mental state.

Assuming you don't reflectively endorse your binge-watching behavior, you'd probably feel pretty bad if you were to reflect on your situation. You might think:

"Man, I am wasting my time. I still need to do my tax return. But doing my tax return is so boring. But it's definitely something I need to do. This is clearly what I should be doing."

Binge-watching is an escape mechanism; it decouples you from your reward system. When bingeing a series, some part of your mind turns off. That part includes exactly that reflective circuitry, which would realize that you are wasting your time.

When that circuit fires, you'll feel bad.

Usually brains learn to optimize such that they...

(See More - 643 more words)
Load More
6Lessons from building a model organism testbed
joshc, sarun0, Annie Sorkin, michaelwaves
1d
0
5Will AI systems drift into misalignment?
joshc
4d
0
6Evaluation Avoidance: How Humans and AIs Hack Reward by Disabling Evaluation Instead of Gaming Metrics
Johannes C. Mayer
5d
0
3Self-interpretability: LLMs can describe complex internal processes that drive their decisions
Adam Morris, Dillon Plunkett
5d
0
14Supervised fine-tuning as a method for training-based AI control
Emil Ryd, Joe Benton, Vivek Hebbar
5d
0
50Please, Don't Roll Your Own Metaethics
Wei Dai
6d
14
36Steering Language Models with Weight Arithmetic
Fabien Roger, constanzafierro
7d
2
3Strengthening Red Teams: A Modular Scaffold for Control Evaluations
Chloe Loughridge
8d
0
23Ontology for AI Cults and Cyborg Egregores
Jan_Kulveit
8d
1
64Condensation
abramdemski
9d
7
Load More