Wei Dai

Wei Dai's Comments

AI Alignment Open Thread October 2019

When I listen to old recordings of right wing talk show hosts from decades ago, they seem to be saying the same stuff that current people are saying today, about political correctness and being forced out of academia for saying things that are deemed harmful by the social elite, or about the Left being obsessed by equality and identity. So I would definitely say that a lot of people predicted this would happen.

I think what's surprising is that although academia has been left-leaning for decades, the situation had been relatively stable until the last few years, when things suddenly progressed very quickly, to the extent that even professors who firmly belong on the Left are being silenced or driven out of academia for disagreeing with an ever-changing party line. (It used to be that universities at least paid lip service to open inquiry, overt political correctness was confined to non-STEM fields, and there was relatively open discussion among people who managed to get into academia in the first place. At least that's my impression.) Here are a couple of links for you if you haven't been following the latest developments:

A quote from the second link:

Afterward, several faculty who had attended the gathering told me they were afraid to speak in my defense. One, a full professor and past chair, told me that what had happened was very wrong but he was scared to talk.

Another faculty member, who was originally from China and lived through the Cultural Revolution told me it was exactly like the shaming sessions of Maoist China, with young Red Guards criticizing and shaming elders they wanted to embarrass and remove.

(BTW I came across this without specifically searching for "cultural revolution".) Note that the author is in favor of carbon taxes in general and supported past attempts to pass carbon taxes, and was punished for disagreeing with a specific proposal that he found issue with. How many people (if any) predicted that things like this would be happening on a regular basis at this point?

AI Alignment Open Thread October 2019

Ahh. To be honest, I read that, but then responded to something different. I assumed you were just expressing general pessimism, since there’s no guarantee that we would converge on good values upon a long reflection (and you recently viscerally realized that values are very arbitrary).

I guess I was also expressing a more general update towards more pessimism, where even if nothing happens during the Long Reflection that causes it to prematurely build an AGI, other new technologies that will be available/deployed during the Long Reflection could also invalidate the historical tendency for "Cultural Revolutions" to dissipate over time and for moral evolution to continue along longer-term trends.

though I personally am skeptical that anything like the long reflection will ever happen.

Sure, I'm skeptical of that too, but given my pessimism about more direct routes to building an aligned AGI, I thought it might be worth pushing for it anyway.

AI Alignment Open Thread October 2019

I think it’s likely that another cultural revolution could happen, and this could adversely affect the future if it happens simultaneously with a transition into an AI based economy.

This seems to be ignoring the part of my comment at the top of this sub-thread, where I said "[...] has also made me more pessimistic about non-AGI or delayed-AGI approaches to a positive long term future (e.g., the Long Reflection)." In other words, I'm envisioning a long period of time in which humanity has the technical ability to create an AGI but is deliberately holding off to better figure out our values or otherwise perfect safety/alignment. I'm worried about something like the Cultural Revolution happening in this period, and you don't seem to be engaging with that concern?

AI Alignment Open Thread October 2019

I could be wrong here, but the stuff you mentioned appear either ephemeral, or too particular. The “last few years” of political correctness is hardly enough time to judge world-trends by, right? By contrast, the stuff I mentioned (end of slavery, explicit policies against racism and war) seem likely to stick and stay with us for decades, if not centuries.

It sounds like you think that something like another Communist Revolution or Cultural Revolution could happen (that emphasizes some random virtues at the expense of others), but the effect would be temporary and after it's over, longer term trends will reassert themselves. Does that seem fair?

In the context of AI strategy though (specifically something like the Long Reflection), I would be worried that a world in the grips of another Cultural Revolution would be very tempted to (or impossible to refrain from) abandoning the plan to delay AGI and instead build and lock their values into a superintelligent AI ASAP, even if that involves more safety risk. Predictability of longer term moral trends (even if true) doesn't seem to help with this concern.

AI Alignment Open Thread October 2019

By unpredictable I mean that nobody really predicted:

(Edit: 1-3 removed to keep a safer distance from object-level politics, especially on AF)

4 Russia and China adopted communism even though they were extremely poor. (They were ahead of the US in gender equality and income equality for a time due to that, even though they were much poorer.)

None of these seem well-explained by your "rich society" model. My current model is that social media and a decrease in the perception of external threats relative to internal threats both favor more virtue signaling, which starts spiraling out of control after some threshold is crossed. But the actual virtue(s) that end up being signaled/reinforced (often at the expense of other virtues) is historically contingent and hard to predict.

AI Alignment Open Thread October 2019

Studying recent cultural changes in the US and the ideas of virtue signaling and preference falsification more generally has also made me more pessimistic about non-AGI or delayed-AGI approaches to a positive long term future (e.g., the Long Reflection). I used to think that if we could figure out how to achieve strong global coordination on AI, or build a stable world government, then we'd be able to take our time, centuries or millennia if needed, to figure out how to build an aligned superintelligent AI. But it seems that human cultural/moral evolution often happens through some poorly understood but apparently quite unstable dynamics, rather than by philosophers gradually making progress on moral philosophy and ultimately converging to moral truth as I may have imagined or implicitly assumed. (I did pay lip service to concerns about "value drift" back then but I guess it just wasn't that salient to me.)

Especially worrying is that no country or culture seems immune to these unpredictable dynamics. My father used to tell me to look out for the next Cultural Revolution (having lived through one himself), and I always thought that it was crazy to worry about something like that happening in the West. Well I don't anymore.

Outer alignment and imitative amplification

I may have asked this already somewhere, but do you know if there's a notion of "outer aligned" that is applicable to oracles/predictors in general, as opposed to trying to approximate/predict HCH specifically? Basically the problem is that I don't know what "aligned" or "trying to do what we want" could mean in the general case. Is "outer alignment" meant to be applicable in the general case?

This post talks about outer alignment of the loss function. Do you think it also makes sense to talk about outer alignment of the training process as a whole, so that for example if there is a security hole in the hardware or software environment and the model takes advantage of the security hole to hack its loss/reward, then we'd call that an "outer alignment failure". Or would it make more sense to use different terminology for that?

Intuitively, I will say that a loss function is outer aligned at optimum if all the possible models that perform optimally according that loss function are aligned with our goals—that is, they are at least trying to do what we want.

So technically, one should say that a loss function is outer aligned at optimum with respect to some model class, right?

Also, related to Ofer's comment, can you clarify whether it's intended for this definition that the loss function only looks at the model's input/output behavior, or can it also take into account other information about the model?

HCH is just a bunch of humans after all and if you can instruct them not to do dumb things like instantiate arbitrary Turing machines

I believe the point about Turing machines was that given Low Bandwidth Overseer, it's not clear how to get HCH/IA to do complex tasks without making it instantiate arbitrary Turing machines. But other issues arise with HBO, as William Saunders wrote in the above linked post:

The reason for this system [LBO] being introduced is wanting to avoid security issues as the system scales. The fear is that there would be an “attack” on the system: an input that could be shown to an overseer that would cause the overseer to become corrupted and try to sabotage the system. This could be some kind of misleading philosophical argument, some form of blackmail, a human adversarial example, etc. If an input like this exists, then as soon as the first agent is corrupted, it can try to spread the attack to other agents. The first agent could be corrupted either by chance, or through an attack being included in the input.

I understand you don't want to go into details about whether theoretical HCH is aligned or not here, but I still want to flag that "instruct them not to do dumb things like instantiate arbitrary Turing machines" seems rather misleading. I'm also curious whether you have HBO or LBO in mind for this post.

Outer alignment and imitative amplification

Aside from some quibbles, this matches my understanding pretty well, but may leave the reader wondering why Paul Christiano and Ought decided to move away from imitative amplification to approval-based amplification. To try to summarize my understanding of their thinking (mostly from an email conversation in September of last year between me, you (Evan), Paul Christiano, and William Saunders):

  • William (and presumably Paul) think approval-based amplification can also be outer aligned. (I do not a good understand why they think this, and William said "still have an IOU pending to provide a more fleshed out argument why it won't fail.")
  • Paul thinks imitative amplification has a big problem when the overseer gets amplified beyond the capacity of the model class that's being trained. (Approximating HCH as closely as possible wouldn't lead to good results in that case unless we had a rather sophisticated notion of "close".)
  • I replied that we could do research into how the overseer could effectively dumb itself down, similar to how a teacher would dumb themselves down to teach a child. One approach is to use a trial-and-error process, for example ramping up the difficulty of what it’s trying to teach and then backing down if the model stops learning well, and trying a different way of improving task performance and checking if the model can learn that, and so on. (I didn't get a reply on this point.)
  • William also wrote, "RL-IA is easier to run human experiments in, because the size of trees to complete tasks, and the access to human experts with full knowledge of the tree (eg the Ought reading comprehension experiments) I'd lean towards taking the position that we should try to use SL-IA where possible, but some tasks might just be much easier to work with in RL-AI"
[AN #80]: Why AI risk might be solved without additional intervention from longtermists

It seems that the interviewees here either:

  1. Use "AI risk" in a narrower way than I do.
  2. Neglected to consider some sources/forms of AI risk (see above link).
  3. Have considered other sources/forms of AI risk but do not find them worth addressing.
  4. Are worried about other sources/forms of AI risk but they weren't brought up during the interviews.

Can you talk about which of these is the case for yourself (Rohin) and for anyone else whose thinking you're familiar with? (Or if any of the other interviewees would like to chime in for themselves?)

Is the term mesa optimizer too narrow?

When the brain makes a decision, it usually considers at most three or four alternatives for each action it does. Most of the actual work is therefore done at the heuristics stage, not the selection part. And even at the selection stage, I have little reason to believe that it is actually comparing alternatives against an explicit objective function.

Assuming this, it seems to me that the heuristics are being continuously trained by the selection stage, so that is the most important part even if heuristics are doing most of the immediate work in making each decision. And I'm not sure what you mean by "explicit objective function". I guess the objective function is encoded in the connections/weights of some neural network. Are you not counting that as an explicit objective function and instead only counting a symbolically represented function as "explicit"? If so, why would not being "explicit" disqualify humans as mesa optimizers? If not, please explain more what you mean?

Since this can all be done in a simple feedforward neural network, I find it hard to see why the best model of its behavior should be an optimizer.

I take your point that some models can behave like an optimizer at first glance but if you look closer it's not really an optimizer after all. But this doesn't answer my question: "Can you give some realistic examples/scenarios of “malign generalization” that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here."

ETA: If you don't have a realistic example in mind, and just think that we shouldn't currently rule out the possibility that a non-optimizer might generalize in a way that is more dangerous than total failure, I think that's a good thing to point out too. (I had already upvoted your post based on that.)

Load More