Wei Dai

I think I need more practice talking with people in real time (about intellectual topics). (I've gotten much more used to text chat/comments, which I like because it puts less time pressure on me to think and respond quickly, but I feel like I now incur a large cost due to excessively shying away from talking to people, hence the desire for practice.) If anyone wants to have a voice chat with me about a topic that I'm interested in (see my recent post/comment history to get a sense), please contact me via PM.

Posts

Sorted by New

6Wei Dai's Shortform

7mo

6Wei Dai's Shortform

7mo

21AI doing philosophy = AI generating hands?

9mo

79UDT shows that decision theory is more puzzling than ever

52Meta Questions about Metaphilosophy

42A broad basin of attraction around human values?

64Morality is Scary

38List of resolved confusions about IDA

32Counterfactual Oracles = online supervised learning with random selection of training episodes

49AI Safety "Success Stories"

29Six AI Risk/Strategy Ideas

Wiki Contributions

(-35)

(+3/-3)

(+2/-2)

Updateless Decision Theory

11y

(+62)

The Hanson-Yudkowsky AI-Foom Debate

12y

(+23/-12)

Updateless Decision Theory

12y

(+172)

Signaling

12y

(+35)

Updateless Decision Theory

13y

(+22)

Comments

Sorted by

Newest

Mark Xu's Shortform

Wei Dai7h85

Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.

When we have controlled unaligned AIs doing economically useful work, they probably won't be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs. Such AIs can help solve some parts of alignment problems, parts that are easy to verify, but alignment as a whole will still be bottle-necked on philosophically confusing, hard to verify parts.

Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs. This seems unlikely to turn out well. They may also persuade some of us to trust their alignment work, even though we really shouldn't.

So to go down this road is to bet that alignment has no philosophically confusing or hard to verify parts. I see some people saying this explicitly in the comments here, but why do they think that? How do they know? (I'm afraid that some people just don't feel philosophically confused about much of anything, and will push forward on that basis.) But you do seem to worry about philosophical problems, which makes me confused about the position you take here.

BTW I have similar objections to working on relatively easy forms of (i.e., unscalable) alignment solutions, and using the resulting aligned AIs to solve alignment for more powerful AIs. But at least there, one might gain some insights into the harder alignment problems from working on the easy problems, potentially producing some useful strategic information or making it easier to verify future proposed alignment solutions. So while I don't think that's a good plan, this plan seems even worse.

Wei Dai's Shortform

Wei Dai9d30

As a tangent to my question, I wonder how many AI companies are already using RLAIF and not even aware of it. From a recent WSJ story:

Early last year, Meta Platforms asked the startup to create 27,000 question-and-answer pairs to help train its AI chatbots on Instagram and Facebook.

When Meta researchers received the data, they spotted something odd. Many answers sounded the same, or began with the phrase “as an AI language model…” It turns out the contractors had used ChatGPT to write-up their responses—a complete violation of Scale’s raison d’être.

So they detected the cheating that time, but in RLHF how would they know if contractors used AI to select which of two AI responses is more preferred?

BTW here's a poem(?) I wrote for Twitter, actually before coming across the above story:

The people try to align the board. The board tries to align the CEO. The CEO tries to align the managers. The managers try to align the employees. The employees try to align the contractors. The contractors sneak the work off to the AI. The AI tries to align the AI.

Wei Dai's Shortform

Wei Dai14d166

What is going on with Constitution AI? Does anyone know why no LLM aside from Claude (at least none that I can find) has used it? One would think that if it works about as well as RLHF (which it seems to), AI companies would be flocking to it to save on the cost of human labor?

Also, apparently ChatGPT doesn't know that Constitutional AI is RLAIF (until I reminded it) and Gemini thinks RLAIF and RLHF are the same thing. (Apparently not a fluke as both models made the same error 2 out of 3 times.)

Wei Dai's Shortform

Wei Dai17d110

About a week ago FAR.AI posted a bunch of talks at the 2024 Vienna Alignment Workshop to its YouTube channel, including Supervising AI on hard tasks by Jan Leike.

The Obliqueness Thesis

Wei Dai17d5-3

What do you think about my positions on these topics as laid out in and Six Plausible Meta-Ethical Alternatives and Ontological Crisis in Humans?

My overall position can be summarized as being uncertain about a lot of things, and wanting (some legitimate/trustworthy group, i.e., not myself as I don't trust myself with that much power) to "grab hold of the whole future" in order to preserve option value, in case grabbing hold of the whole future turns out to be important. (Or some other way of preserving option value, such as preserving the status quo / doing AI pause.) I have trouble seeing how anyone can justifiably conclude "so don’t worry about grabbing hold of the whole future" as that requires confidently ruling out various philosophical positions as false, which I don't know how to do. Have you reflected a bunch and really think you're justified in concluding this?

E.g. in Ontological Crisis in Humans I wrote "Maybe we can solve many ethical problems simultaneously by discovering some generic algorithm that can be used by an agent to transition from any ontology to another?" which would contradict your "not expecting your preferences to extend into the distant future with many ontology changes" and I don't know how to rule this out. You wrote in the OP "Current solutions, such as those discussed in MIRI’s Ontological Crises paper, are unsatisfying. Having looked at this problem for a while, I’m not convinced there is a satisfactory solution within the constraints presented." but to me this seems like very weak evidence for the problem being actually unsolvable.

The Obliqueness Thesis

Wei Dai17d84

As long as all mature superintelligences in our universe don't necessarily have (end up with) the same values, and only some such values can be identified with our values or what our values should be, AI alignment seems as important as ever. You mention "complications" from obliqueness, but haven't people like Eliezer recognized similar complications pretty early, with ideas such as CEV?

It seems to me that from a practical perspective, as far as what we should do, your view is much closer to Eliezer's view than to Land's view (which implies that alignment doesn't matter and we should just push to increase capabilities/intelligence). Do you agree/disagree with this?

It occurs to me that maybe you mean something like "Our current (non-extrapolated) values are our real values, and maybe it's impossible to build or become a superintelligence that shares our real values so we'll have to choose between alignment and superintelligence." Is this close to your position?

The Checklist: What Succeeding at AI Safety Will Involve

Wei Dai1mo1711

Unfortunately this ignores 3 major issues:

race dynamics (also pointed out by Akash)
human safety problems - given that alignment is defined "in the narrow sense of making sure AI developers can confidently steer the behavior of the AI systems they deploy", why should we believe that AI developers and/or parts of governments that can coerce AI developers will steer the AI systems in a good direction? E.g., that they won't be corrupted by power or persuasion or distributional shift, and are benevolent to begin with.
philosophical errors or bottlenecks - there's a single mention of "wisdom" at the end, but nothing about how to achieve/ensure the unprecedented amount of wisdom or speed of philosophical progress that would be needed to navigate something this novel, complex, and momentous. The OP seems to suggest punting such problems to "outside consensus" or "institutions or processes", with apparently no thought towards whether such consensus/institutions/processes would be up to the task or what AI developers can do to help (e.g., by increasing AI philosophical competence).

Like others I also applaud Sam for writing this, but the actual content makes me more worried, as it's evidence that AI developers are not thinking seriously about some major risks and risk factors.

TurnTrout's shortform feed

Wei Dai2mo50

Can you sketch out some ideas for showing/proving premises 1 and 2? More specifically:

For 1, how would you rule out future distributional shifts increasing the influence of "bad" circuits beyond ϵ?

For 2, it seems that you actually need to show a specific K, not just that there exists K>0, otherwise how would you be able to show that x-risk is low for a given curriculum? But this seems impossible, because the "bad" subset of circuits could constitute a malign superintelligence strategically manipulating the overall AI's output while staying within a logit variance budget of ϵ (i.e., your other premises do not rule this out), and how could you predict what such a malign SI might be able to accomplish?

In Defense of Open-Minded UDT

Wei Dai2mo141

But if UDT starts with a broad prior, it will probably not learn, because it will have some weird stuff in its prior which causes it to obey random imperatives from imaginary Gods.

Are you suggesting that this is a unique problem for UDT, or affects it more than other decision theories? It seems like Bayesian decision theories can have the same problem, for example a Bayesian agent might have a high prior that an otherwise non-interventionist God will reward them after death for not eating apples, and therefore not eat apples throughout their life. How is this different in principle from UDT refraining from paying the counterfactual mugger in your scenario to get reward from God in the other branch? Why wouldn't this problem be solved automatically given "good" or "reasonable" priors (whatever that means), which presumably would assign such gods low probabilities to begin with?

Interlocutor: The prior is subjective. An agent has no choice but to trust its own prior. From its own perspective, its prior is the most accurate description of reality it can articulate.

I wouldn't say this, because I'm not sure that the prior is subjective. From my current perspective I would say that it is part of the overall project of philosophy to figure out the nature of our priors and the contents of what they should be (if they're not fully subjective or have some degree of normativity).

So I think there are definitely problems in this area, but I'm not sure it has much to do with "learning" as opposed to "philosophy" and the examples / thought experiments you give don't seem to pump my intuition in that direction much. (How UDT works in iterated counterfactual mugging also seems fine to me.)

Decision theory does not imply that we get to have nice things

Wei Dai2mo31

my objection here is mostly to analogizing the creation of ASI to a prisoner’s dilemma like this.

The reason why it is disanalogous is because humanity has no ability to make our strategy conditional on the strategy of our opponent.

It's not part of the definition of PD that players can condition on each others' strategies. In fact PD was specifically constructed to prevent this (i.e., specifying that each prisoner has to act without observing how the other acted). It was Eliezer's innovation to suggest that the two players can still condition on each others' strategies by simulation or logical inference, but it's not sensible to say that inability to do this makes a game not a PD! (This may not be a crux in the current discussion, but seems like too big of an error/confusion to leave uncorrected.)

However, we have no ability to do so, and doing this sounds like it would require making enormous progress on our ability to predict the actions of future AI systems in a way that seems like it could be genuinely harder than just aligning it directly to our values

My recall of early discussions with Eliezer is that he was too optimistic about our ability to make predictions like this, and this seems confirmed by my recent review of his comments in the thread I linked. See also my parallel discussion with Eliezer. (To be honest, I thought I was making a fairly straightforward, uncontroversial claim, and now somewhat regret causing several people to spend a bunch of time back and forth on what amounts to a historical footnote.)