Wei Dai

I think I need more practice talking with people in real time (about intellectual topics). (I've gotten much more used to text chat/comments, which I like because it puts less time pressure on me to think and respond quickly, but I feel like I now incur a large cost due to excessively shying away from talking to people, hence the desire for practice.) If anyone wants to have a voice chat with me about a topic that I'm interested in (see my recent post/comment history to get a sense), please contact me via PM.

www.weidai.com

Comments

Sorted by

Some potential risks stemming from trying to increase philosophical competence of humans and AIs, or doing metaphilosophy research. (1 and 2 seem almost too obvious to write down, but I think I should probably write them down anyway.)

  1. Philosophical competence is dual use, like much else in AI safety. It may for example allow a misaligned AI to make better decisions (by developing a better decision theory), and thereby take more power in this universe or cause greater harm in the multiverse.
  2. Some researchers/proponents may be overconfident, and cause flawed metaphilosophical solutions to be deployed or spread, which in turn derail our civilization's overall philosophical progress.
  3. Increased philosophical competence may cause many humans and AIs to realize that various socially useful beliefs have weak philosophical justifications (such as all humans are created equal or have equal moral worth or have natural inalienable rights, moral codes based on theism, etc.). In many cases the only justifiable philosophical positions in the short to medium run may be states of high uncertainty and confusion, and it seems unpredictable what effects will come from many people adopting such positions.
  4. Maybe the nature of philosophy is very different from my current guesses, such that greater philosophical competence or orientation is harmful even in aligned humans/AIs and even in the long run. For example maybe philosophical reflection, even if done right, causes a kind of value drift, and by the time you've clearly figured that out, it's too late because you've become a different person with different values.

Is it something like "during deployment, the simulated human judges might be asked to answer questions far outside the training distribution, and so they might fail to accurately simulate humans (or humans might be worse than on )"?

Yes, but my concern also includes this happening during training of the debaters, when the simulated or actual humans can also go out of distribution, e.g., the actual human is asked a type of question that he has never considered before, and either answers in a confused way, or will have to use philosophical reasoning and a lot of time to try to answer, or maybe it looks like one of the debaters "jailbreaking" a human via some sort of out of distribution input.

The solution in the sketch is to keep the question distribution during deployment similar + doing online training during deployment (the simulated human judges could also be subject to online training). Is there a reason you think that won't work?

This intuitively seems hard to me, but since Geoffrey mentioned that you have a doc coming out related to this, I'm happy to read it to see if it changes my mind. But this still doesn't solve the whole problem, because as Geoffrey also wrote, "Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there's no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn't competitive."

For example, small groups of humans can invent grammatical languages from scratch, and of course historically humans invented science and tech and philosophy and so on from scratch.

I think this could be part of a viable approach, for example if we figure out in detail how humans invented philosophy and use that knowledge to design/train an AI that we can have high justified confidence will be philosophically competent. I'm worried that in actual development of brain-like AGI, we will skip this part (because it's too hard, or nobody pushes for it), and end up just assuming that the AGI will invent or learn philosophy because it's brain-like. (And then it ends up not doing that because we didn't give it some "secret sauce" that humans have.) And this does look fairly hard to me, because we don't yet understand the nature of philosophy or what constitutes correct philosophical reasoning or philosophical competence, so how do we study these things in either humans or AIs?

But for me and almost anyone, a future universe with no feelings of friendship, compassion, and connection in it seems like a bad thing that I don’t want to happen. I find it hard to believe that sufficient reflection would change my opinion on that [although I have some niggling concerns about technological progress].

I find it pretty plausible that wireheading is what we'll end up wanting, after sufficient reflection. (This could be literal wireheading, or something more complex like VR with artificial friends, connections, etc.) This seems to me to be the default, unless we have reasons to want to avoid it. Currently my reasons are 1. value uncertainty (maybe we'll eventually find good intrinsic reasons to not want to wirehead) 2. opportunity costs (if I wirehead now, it'll cost me in terms of both quality and quantity vs if I wait for more resources and security). But it seems foreseeable that both of these reasons will go away at some point, and we may not have found other good reasons to avoid wireheading by then.

(Of course if the right values for us are to max out our wireheading, then we shouldn't hand the universe off the AGIs that will want to max out their wireheading. Also, this is just the simplest example of how brain-like AGIs' values could conflict with ours.)

I’m confused about how this plan would be betting the universe on a particular meta-ethical view, from your perspective.

You're prioritizing instilling AGI with the right social instincts, and arguing that's very important for getting the AGI to eventually converge on values that we'd consider good. But under different meta-ethical views, you should perhaps be prioritizing something else. For example, under moral realism, it doesn't matter what social instincts the AGI starts out with, what's more important is that it has the capacity to eventually find and be motivated by objective moral truths.

From my perspective, in order to not bet the universe on a particular meta-ethical view, it seems that we need to either hold off on building AGI until we definitively solve metaethics (e.g., it's no longer a contentious subject in academic philosophy), or have an approach/plan that will work out well regardless of which meta-ethical view turns out to be correct.

By the way, my perspective again is “this might be the least-bad plausible plan”, as opposed to “this is a great plan”.

Thanks, I appreciate this, but of course still feel compelled to speak up when I see areas where you seem overly optimistic and/or missing some potential risks/failure modes.

Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there’s no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn’t competitive.

This is my concern, and I'm glad it's at least on your radar. How do you / your team think about competitiveness in general? (I did a simple search and the word doesn't appear in this post or the previous one.) How much competitiveness are you aiming for? Will there be a "competitiveness case" later in this sequence, or later in the project? Etc.?

But generally this requires you to have some formal purchase on the philosophical aspects where humans are off distribution, which may be rough.

Because of the "slowness of philosophy" issue I talked about in my post, we have no way of quickly reaching high confidence that any such formalization is correct, and we have a number of negative examples where a proposed formal solution to some philosophical problem that initially looked good turned out to be flawed upon deeper examination. (See decision theory and Solomonoff induction.) AFAIK we don't really have any positive examples of such formalizations that have stood the test of time. So I feel like this is basically not a viable approach.

Wei Dai187

I'm curious if your team has any thoughts on my post Some Thoughts on Metaphilosophy, which was in large part inspired by the Debate paper, and also seems relevant to "Good human input" here.

Specifically, I'm worried about this kind of system driving the simulated humans out of distribution, either gradually or suddenly, accidentally or intentionally. And distribution shift could cause problems either with the simulation (presumably similar to or based on LLMs instead of low-level neuron-by-neuron simulation), or with the human(s) themselves. In my post, I talked about how philosophy seems to be a general way for humans to handle OOD inputs, but tends to be very slow and may be hard for ML to learn (or needs extra care to implement correctly). I wonder if you agree with this line of thought, or have some other ideas/plans to deal with this problem.

Aside from the narrow focus on "good human input" in this particular system, I'm worried about social/technological change being accelerated by AI faster than humans can handle it (due to similar OOD / slowness of philosophy concerns), and wonder if you have any thoughts on this more general issue.

We humans also align with each other via organic alignment.

This kind of "organic alignment" can fail in catastrophic ways, e.g., produce someone like Stalin or Mao. (They're typically explained by "power corrupts" but can also be seen as instances of "deceptive alignment".)

Another potential failure mode is that "organically aligned" AIs start viewing humans as parasites instead of important/useful parts of its "greater whole". This also has plenty of parallels in biological systems and human societies.

Both of these seem like very obvious risks/objections, but I can't seem to find any material by Softmax that addresses or even mentions them.  @emmett

Wei Dai40

If it’s possible at all for this process to lead somewhere good, then it’s possible for it to lead somewhere good within the mind of an AI that combines a human-like ability to reason, with human-like social and moral instincts / reflexes.

  1. A counterexample to this is if humans and AIs both tend to conclude after a lot of reflection that they should be axiologically selfish but decision theoretically cooperative (with other strong agents), then if we hand off power to AIs, they'll cooperate with each other (and any other powerful agents in the universe or multiverse) to serve their own collective values, but we humans will be screwed.
  2. Another problem is that we're relatively confident that at least some humans can reason "successfully", in the sense of making philosophical progress, but we don't know the same about AI. There seemingly are reasons to think it might be especially hard for AI to learn, and easy for AI to learn something undesirable instead, like optimizing for how persuasive their philosophical arguments are to (certain) humans.
  3. Finally, I find your arguments against moral realism somewhat convincing, but I'm still pretty uncertain, and think the arguments I gave in Six Plausible Meta-Ethical Alternatives for the realism side of the spectrum still somewhat convincing as well, don't want to bet the universe on or against any of these positions.
Wei Dai1210

At the outermost feedback loop, capabilities can ultimately be grounded via relatively easy objective measures such as revenue from AI, or later, global chip and electricity production, but alignment can only be evaluated via potentially faulty human judgement. Also, as mentioned in the post, the capabilities trajectory is much harder to permanently derail because unlike alignment, one can always recover from failure and try again. I think this means there's an irreducible logical risk (i.e., the possibility that this statement is true as a matter of fact about logic/math) that capabilities research is just inherently easier to automate than alignment research, that no amount of "work hard to automated alignment research" can lower beyond. Given the lack of established consensus ways of estimating and dealing with such risk, it's inevitable that the people with the least estimate/concern about this risk (and other AI risks) will push capabilities forward as fast as they can, and seemingly the only way to solve this on the societal level is to push for norms/laws against doing that, i.e., slow down capabilities research via (politically legitimate) force and/or social pressure. I suspect the author might already agree with all this (the existence of this logical risk, the social dynamics, the conclusion about norms/laws being needed to reduce AI risk beyond some threshold), but I think it should be emphasized more in a post like this.

Wei Dai40

As I wrote in Social status part 1/2: negotiations over object-level preferences, there’s a zero-sum nature of social leading vs following. If I want to talk about trains and Zoe wants to talk about dinosaurs, we can’t both get everything we want; one of us is going to have our desires frustrated, at least to some extent.

Why does this happen in the first place, instead of people just wanting to talk about the same things all the time, in order to max out social rewards? Where does interest in trains and dinosaurs even come from? They seem to be purely or mostly social, given lack of practical utility, but then why divergence in interests? (Understood that you don't have a complete understanding yet, so I'm just flagging this as a potential puzzle, not demanding an immediate answer.)

There is in fact a precedent for that—indeed, it’s the status quo! We don’t know what the next generation of humans will choose to do, but we’re nevertheless generally happy to entrust the future to them.

I'm only "happy" to "entrust the future to the next generation of humans" if I know that they can't (i.e., don't have the technology to) do something irreversible, in the sense of foreclosing a large space of potential positive outcomes, like locking in their values, or damaging the biosphere beyond repair. In other words, up to now, any mistakes that a past generation of humans made could be fixed by a subsequent generation, and this is crucial for why we're still in an arguably ok position. However AI will make it false very quickly by advancing technology.

So I really want the AI transition to be an opportunity for improving the basic dynamic of "the next generation of humans will [figure out new things and invent new technologies], causing self-generated distribution shifts, and ending up going in unpredictable directions", for example by improving our civilizational philosophical competency (which may allow distributional shifts to be handled in a more principled way), and not just say that it's always been this way, so it's fine to continue.

I'm going to read the rest of that post and your other posts to understand your overall position better, but at least in this section, you come off as being a bit too optimistic or nonchalant from my perspective...

Wei Dai40

My intuition says reward hacking seems harder to solve than this (even in EEA), but I'm pretty unsure. One example is, under your theory, what prevents reward hacking through forming a group and then just directly maxing out on mutually liking/admiring each other?

When applying these ideas to AI, how do you plan to deal with the potential problem of distributional shifts happening faster than we can edit the reward function?

Load More