All of Wei Dai's Comments + Replies

But, the gist of your post seems to be: "Since coming up with UDT, we ran into these problems, made no progress, and are apparently at a dead end. Therefore, UDT might have been the wrong turn entirely."

This is a bit stronger than how I would phrase it, but basically yes.

On the other hand, my view is: Since coming up with those problems, we made a lot of progress on agent theory within the LTA

I tend to be pretty skeptical of new ideas. (This backfired spectacularly once, when I didn't pay much attention to Satoshi when he contacted me about Bitcoin,... (read more)

4Vanessa Kosoy8d
  It seems clear to me that the prior is subjective. Like with Solomonoff induction, I expect there to exist something like the right asymptotic for the prior (i.e. an equivalence class of priors under the equivalence relation where μ and ν are equivalent when there exists some C>0 s.t. μ≤Cν and ν≤Cμ), but not a unique correct prior, just like there is no unique correct UTM. In fact, my arguments about IBH already rely on the asymptotic of the prior to some extent. One way to view the non-uniqueness of the prior is through an evolutionary perspective: agents with prior X are likely to evolve/flourish in universes sampled from prior X, while agents with prior Y are likely to evolve/flourish in universes sampled from prior Y. No prior is superior across all universes: there's no free lunch. For the purpose of AI alignment, the solution is some combination of (i) learn the user's prior and (ii) choose some intuitively appealing measure of description complexity, e.g. length of lambda-term (i is insufficient in itself because you need some ur-prior to learn the user's prior). The claim is, different reasonable choices in ii will lead to similar results. Given all that, I'm not sure what's still unsatisfying. Is there any reason to believe something is missing in this picture?

Do you think part of it might be that even people with graduate philosophy educations are too prone to being wedded to their own ideas, or don't like to poke holes at them as much as they should? Because part of what contributes to my wanting to go more meta is being dissatisfied with my own object-level solutions and finding more and more open problems that I don't know how to solve. I haven't read much academic philosophy literature, but did read some anthropic reasoning and decision theory literature earlier, and the impression I got is that most of the authors weren't trying that hard to poke holes in their own ideas.

5Daniel Kokotajlo9d
Yep that's probably part of it. Standard human epistemic vices. Also maybe publish-or-perish has something to do with it? idk. I definitely noticed incentives to double-down / be dogmatic in order to seem impressive on the job market. Oh also, iirc one professor had a cynical theory that if you find an interesting flaw in your own theory/argument, you shouldn't mention it in your paper, because then the reviewers will independently notice the flaw and think 'aha, this paper has an interesting flaw, if it gets published I could easily and quickly write my own paper pointing out the flaw' and then they'll be more inclined to recommend publication. It's also a great way to get citations. Note also that I said "a few hundred a year" not "ten thousand a year" which is roughly how many people become philosophy grad students. I was more selective because in my experience most philosophy grad students don't have as much... epistemic ambition? as you or me. Sorta like the Hamming Question thing -- some, but definitely a minority, of grad students can say "I am working on it actually, here's my current plan..." to the question "what's the most important problem in your field and why aren't you working on it?" (to be clear epistemic ambition is a spectrum not a binary)

I don't understand your ideas in detail (am interested but don't have the time/ability/inclination to dig into the mathematical details), but from the informal writeups/reviews/critiques I've seen of your overall approach, as well as my sense from reading this comment of how far away you are from a full solution to the problems I listed in the OP, I'm still comfortable sticking with "most are wide open". :)

On the object level, maybe we can just focus on Problem 4 for now. What do you think actually happens in a 2IBH-1CDT game? Presumably CDT still plays D,... (read more)

...I'm still comfortable sticking with "most are wide open".

 

Allow me to rephrase. The problems are open, that's fair enough. But, the gist of your post seems to be: "Since coming up with UDT, we ran into these problems, made no progress, and are apparently at a dead end. Therefore, UDT might have been the wrong turn entirely." On the other hand, my view is: Since coming up with those problems, we made a lot of progress on agent theory within the LTA, which has implications on those problems among other things, and so far this progress seems to only r... (read more)

Even items 1, 3, 4, and 6 are covered by your research agenda? If so, can you quickly sketch what you expect the solutions to look like?

5Vanessa Kosoy9d
I'll start with Problem 4 because that's the one where I feel closest to the solution. In your 3-player Prisoner's Dilemma, infra-Bayesian hagglers[1] (IBH agents) don't necessarily play CCC. Depending on their priors, they might converge to CCC or CCD or other Pareto-efficient outcome[2]. Naturally, if the first two agents have identical priors then e.g. DCC is impossible, but CCD still is. Whereas, if all 3 have the same prior they will necessarily converge to CCC. Moreover, there is no "best choice of prior": different choices do better in different situations. You might think this non-uniqueness is evidence of some deficiency of the theory. However, I argue that it's unavoidable. For example, it's obvious that any sane decision theory will play "swerve" in a chicken game against a rock that says "straight". If there was an ideal decision theory X that lead to a unique outcome in every game, the outcome of X playing chicken against X would be symmetric (e.g. flipping a shared coin to decide who goes straight and who swerves, which is indeed what happens for symmetric IBH[3]). This leads to the paradox that the rock is better than X in this case. Moreover, it should really be no surprise that different priors are incomparable, since this is the case even when considering a single learning agent: the higher a particular environment is in your prior, the better you will do on it. Problems 1,3,6 are all related to infra-Bayesian physicalism (IBP). For Problem 1, notice that IBP agents are already allowed some sort of "indexical" values. Indeed, in section 3 of the original article we describe agents that only care about their own observations. However, these agents are not truly purely indexical, because when multiple copies co-exist, they all value each other symmetrically. In itself, I don't think this implies the model doesn't describe human values. Indeed, it is always sensible to precommit to care about your copies, so to the extent you don't do it, it's a fa

@jessicata @Connor Leahy @Domenic @Daniel Kokotajlo @romeostevensit @Vanessa Kosoy @cousin_it @ShardPhoenix @Mitchell_Porter @Lukas_Gloor (and others, apparently I can only notify 10 people by mentioning them in a comment)

Sorry if I'm late in responding to your comments. This post has gotten more attention and replies than I expected, in many different directions, and it will probably take a while for me to process and reply to them all. (In the meantime, I'd love to see more people discuss each other's ideas here.)

Do you have any examples that could illustrate your theory?

It doesn't seem to fit my own experience. I became interested in Bayesian probability, universal prior, Tegmark multiverse, and anthropic reasoning during college, and started thinking about decision theory and ideas that ultimately led to UDT, but what heuristics could I have been applying, learned from what "domains with feedback"?

Maybe I used a heuristic like "computer science is cool, lets try to apply it to philosophical problems" but if the heuristics are this coarse grained, it doesn't seem like the idea can explain how detailed philosophical reasoning happens, or be used to ensure AI philosophical competence?

1Vladimir Slepnev21d
Maybe one example is the idea of Dutch book. It comes originally from real world situations (sport betting and so on) and then we apply it to rationality in the abstract. Or another example, much older, is how Socrates used analogy. It was one of his favorite tools I think. When talking about some confusing thing, he'd draw an analogy with something closer to experience. For example, "Is the nature of virtue different for men and for women?" - "Well, the nature of strength isn't that much different between men and women, likewise the nature of health, so maybe virtue works the same way." Obviously this way of reasoning can easily go wrong, but I think it's also pretty indicative of how people do philosophy.

I expect at this moment in time me building a company is going to help me deconfuse a lot of things about philosophy more than me thinking about it really hard in isolation would

Hard for me to make sense of this. What philosophical questions do you think you'll get clarity on by doing this? What are some examples of people successfully doing this in the past?

It seems plausible that there is no such thing as “correct” metaphilosophy, and humans are just making up random stuff based on our priors and environment and that’s it and there is no “right way”

... (read more)

Hard for me to make sense of this. What philosophical questions do you think you'll get clarity on by doing this? What are some examples of people successfully doing this in the past?

The fact you ask this question is interesting to me, because in my view the opposite question is the more natural one to ask:  What kind of questions can you make progress on without constant grounding and dialogue with reality? This is the default of how we humans build knowledge and solve hard new questions, the places where we do best and get the least drawn astray is ... (read more)

Philosophy is a social/intellectual process taking place in the world. If you understand the world, you understand how philosophy proceeds.

What if I'm mainly interested in how philosophical reasoning ideally ought to work? (Similar to how decision theory studies how decision making normatively should work, not how it actually works in people.) Of course if we have little idea how real-world philosophical reasoning works, understanding that first would probably help a lot, but that's not the ultimate goal, at least not for me, for both intellectual and A... (read more)

3Jessica Taylor22d
My view would suggest: develop a philosophical view of normativity and apply that view to the practice of philosophy itself. For example, if it is in general unethical to lie, then it is also unethical to lie about philosophy. Philosophical practice being normative would lead to some outcomes being favored over others. (It seems like a problem if you need philosophy to have a theory of normativity and a theory of normativity to do meta-philosophy and meta-philosophy to do better philosophy, but earlier versions of each theory can be used to make later versions of them, in a bootstrapping process like with compilers) I mean normativity to include ethics, aesthetics, teleology, etc. Developing a theory of teleology in general would allow applying that theory to philosophy (taken as a system/practice/etc). It would be strange to have a distinct normative theory for philosophical practice than for other practices, since philosophical practice is a subset of practice in general; philosophical normativity is a specified variant of general normativity, analogous to normativity about other areas of study. The normative theory is mostly derived from cases other than cases of normative philosophizing, since most activity that normativity could apply to is not philosophizing. That seems like describing my views about things in general, which would take a long time. The original comment was meant to indicate what is non-foundationalist about this view. Imagine a subjective credit system. A bunch of people think other people are helpful/unhelpful to them. Maybe they help support helpful people and so people who are more helpful to helpful people (etc) succeed more. It's subjective, there's no foundation where there's some terminal goal and other things are instrumental to that. An intersubjective credit system would be the outcome of something like Pareto optimal bargaining between the people, which would lead to a unified utility function, which would imply some terminal go

If we keep stumbling into LLM type things which are competent at a surprisingly wide range of tasks, do you expect that they’ll be worse at philosophy than at other tasks?

I'm not sure but I do think it's very risky to depend on LLMs to be good at philosophy by default. Some of my thoughts on this:

  • Humans do a lot of bad philosophy and often can't recognize good philosophy. (See popularity of two-boxing among professional philosophers.) Even if a LLM has learned how to do good philosophy, how will users or AI developers know how to prompt it to elicit t
... (read more)

Here's another bullet point to add to the list:
 

  • It is generally understood now that ethics is subjective, in the following technical sense: 'what final goals you have' is a ~free parameter in powerful-mind-space, such that if you make a powerful mind without specifically having a mechanism for getting it to have only the goals you want, it'll probably end up with goals you don't want. What if ethics isn't the only such free parameter? Indeed, philosophers tell us that in the bayesian framework your priors are subjective in this sense, and also that yo
... (read more)
2Vladimir Slepnev23d
I don't say it's not risky. The question is more, what's the difference between doing philosophy and other intellectual tasks. Here's one way to look at it that just occurred to me. In domains with feedback, like science or just doing real world stuff in general, we learn some heuristics. Then we try to apply these heuristics to the stuff of our mind, and sometimes it works but more often it fails. And then doing good philosophy means having a good set of heuristics from outside of philosophy, and good instincts when to apply them or not. And some luck, in that some heuristics will happen to generalize to the stuff of our mind, but others won't. If this is a true picture, then running far ahead with philosophy is just inherently risky. The further you step away from heuristics that have been tested in reality, and their area of applicability, the bigger your error will be. Does this make sense?

If it’s not misuse, the provisions in 5.1.4-5 will steer the search process away from policies that attempt to propagandize to humans.

Ok I'll quote 5.1.4-5 to make it easier for others to follow this discussion:

5.1.4. It may be that the easiest plan to find involves an unacceptable degree of power-seeking and control over irrelevant variables. Therefore, the score function should penalize divergence of the trajectory of the world state from the trajectory of the status quo (in which no powerful AI systems take any actions).

5.1.5. The incentives under

... (read more)

"AI-powered memetic warfare makes all humans effectively insane" a catastrophe that I listed in an earlier comment, which seems one of the hardest to formally specify. It seems values-complete or metaphilosophy-complete to me, since without having specified human values or having solved metaphilosophy, how can we check whether an AI-generated argument is trying to convince us of something that is wrong according to actual human values, or wrong according to normative philosophical reasoning?

I don't see anything in this post or the linked OAA post that addr... (read more)

2davidad (David A. Dalrymple)1mo
OAA bypasses the accident version of this by only accepting arguments from a superintelligence that have the form “here is why my proposed top-level plan—in the form of a much smaller policy network—is a controller that, when combined with the cyberphysical model of an Earth-like situation, satisfies your pLTL spec.” There is nothing normative in such an argument; the normative arguments all take place before/while drafting the spec, which should be done with AI assistants that are not smarter-than-human (CoEm style). There is still a misuse version: someone could remove the provision in 5.1.5 that the model of Earth-like situations should be largely agnostic about human behavior, and instead building a detailed model of how human nervous systems respond to language. (Then, even though the superintelligence in the box would still be making only descriptive arguments about a policy, the policy that comes out would likely emit normative arguments at deployment time.) Superintelligence misuse is covered under problem 11. If it’s not misuse, the provisions in 5.1.4-5 will steer the search process away from policies that attempt to propagandize to humans.

Is this the final update from Ought about their factored cognition experiments? (I can't seem to find anything more recent.) The reason I ask is that the experiments they reported on here do not seem very conclusive, and they talked about doing further experiments but then did not seem to give any more updates. Does anyone know the story of what happened, and what that implies about the viability of factored-cognition style alignment schemes?

goodness of HCH

What is the latest thinking/discussion about this? I tried to search LW/AF but haven't found a lot of discussions, especially positive arguments for HCH being good. Do you have any links or docs you can share?

How do you think about the general unreliability of human reasoning (for example, the majority of professional decision theorists apparently being two-boxers and favoring CDT, and general overconfidence of almost everyone on all kinds of topics, including morality and meta-ethics and other topics relevant for AI alignment) in relatio... (read more)

But yeah also I think that AGIs will be by default way better than humans at this sort of stuff.

What's your reasons for thinking this? (Sorry if you already explained this and I missed your point, but it doesn't seem like you directly addressed my point that if AGIs learn from or defer to humans, they'll be roughly human-level at this stuff?)

When you say “the top tier of rational superintelligences exploits everyone else” I say that is analogous to “the most rational/clever/capable humans form an elite class which rules over and exploits the masses.”

... (read more)
2Daniel Kokotajlo2mo
I agree that if AGIs defer to humans they'll be roughly human-level, depending on which humans they are deferring to. If I condition on really nasty conflict happening as a result of how AGI goes on earth, a good chunk of my probability mass (and possibly the majority of it?) is this scenario. (Another big chunk, possibly bigger, is the "humans knowingly or unknowingly build naive consequentialists and let rip" scenario, which is scarier because it could be even worse than the average human, as far as I know). Like I said, I'm worried. If AGIs learn from humans though, well, it depends on how they learn, but in principle they could be superhuman. Re: analogy to current exploitation: Yes there are a bunch of differences which I am keen to study, such as that one. I'm more excited about research agendas that involve thinking through analogies like this than I am about what people interested in this topic seem to do by default, which is think about game theory and Nash bargaining and stuff like that. Though I do agree that both are useful and complementary.

I think that agents worthy of being called “rational” will probably handle all this stuff more gracefully/competently than humans do

Humans are kind of terrible at this right? Many give in even to threats (bluffs) conjured up by dumb memeplexes and back up by nothing (i.e., heaven/hell), popular films are full of heros giving in to threats, apparent majority of philosophers have 2-boxing intuitions (hence the popularity of CDT, which IIUC was invented specifically because some philosophers were unhappy with EDT choosing to 1-box), governments negotiate w... (read more)

2Daniel Kokotajlo2mo
Yes. Humans are pretty bad at this stuff, yet still, society exists and mostly functions. The risk is unacceptably high, which is why I'm prioritizing it, but still, by far the most likely outcome of AGIs taking over the world--if they are as competent at this stuff as humans are--is that they talk it over, squabble a bit, maybe get into a fight here and there, create & enforce some norms, and eventually create a stable government/society. But yeah also I think that AGIs will be by default way better than humans at this sort of stuff. I am worried about the "out of distibution" problem though, I expect humans to perform worse in the future than they perform in the present for this reason. Yes, some AGIs will be better than others at this, and presumably those that are worse will tend to lose out in various ways on average, similar to what happens in human society. Consider that in current human society, a majority of humans would probably pay ransoms to free loved ones being kidnapped. Yet kidnapping is not a major issue; it's not like 10% of the population is getting kidnapped and paying ransoms every year. Instead, the governments of the world squash this sort of thing (well, except for failed states etc.) and do their own much more benign version, where you go to jail if you don't pay taxes & follow the laws. When you say "the top tier of rational superintelligences exploits everyone else" I say that is analogous to "the most rational/clever/capable humans form an elite class which rules over and exploits the masses." So I'm like yeah, kinda sorta I expect that to happen, but it's typically not that bad? Also it would be much less bad if the average level of rationality/capability/etc. was higher? I'm not super confident in any of this to be clear.  

I think I'm less sure than @Eliezer Yudkowsky that there is a good solution to the problem of commitment races, even in theory, or that if there is a solution, it has the shape that he thinks it has. I've been thinking about this problem off and on since 2009, and haven't made much progress. Others have worked on this too (as you noted in the OP), and all seem to have gotten stuck at roughly the same place that I got stuck. Eliezer described what he would do in a particular game, but I don't know how to generalize his reasoning (which you call "nonconseque... (read more)

4Eliezer Yudkowsky2mo
TBC, I definitely agree that there's some basic structural issue here which I don't know how to resolve.  I was trying to describe properties I thought the solution needed to have, which ruled out some structural proposals I saw as naive; not saying that I had a good first-principles way to arrive at that solution.
3Daniel Kokotajlo2mo
Great comment. To reply I'll say a bit more about how I think of this stuff for the past few years: I agree that the commitment races problem poses a fundamental challenge to decision theory, in the following sense: There may not exist a simple algorithm in the same family of algorithms as EDT, CDT, UDT 1.0, 1.1, and even 2.0, that does what we'd consider a good job in a realistic situation characterized by many diverse agents interacting over some lengthy period with the ability to learn about each other and make self-modifications (including commitments). Indeed it may be that the top 10% of humans by performance in environments like this, or even the top 90%, outperform the best possible simple-algorithm-in-that-family. Thus any algorithm for making decisions that would intuitively be recognized as a decision theory, would be worse in realistic environments than the messy neural net wetware of many existing humans, and probably far worse than the best superintelligences. (To be clear, I still hold out hope that this is false and such a simple in-family algorithm does exist.) I therefore think we should widen our net and start considering algorithms that don't fit in the traditional decision theory family. For example, think of a human role model (someone you consider wise, smart, virtuous, good at philosophy, etc.) and then make them into an imaginary champion by eliminating what few faults they still have, and increasing their virtues to the extent possible, and then imagine them in a pleasant and secure simulated environment with control over their own environment and access to arbitrary tools etc. and maybe also the ability to make copies of themselves HCH style. You have now have described an algorithm that can be compared to the performance of EDT, UDT 2.0, etc. and arguably will be superior to all of them (because this wise human can use their tools to approximate or even directly compute such things to the extent that they deem it useful to do so). We ca

I was informed by an OpenAI insider that the 4 year goal is actually “build a roughly human-level automated alignment researcher”.

I guess train AI to do various kinds of reasoning required for automated alignment research, using RRM, starting with the easier kinds and working your way up to philosophical reasoning?

Thinking about this a bit more, a better plan would be to train AI to do all kinds of reasoning required for automated alignment research in parallel, so that you can monitor what kinds of reasoning are harder for the AI to learn than the others, and have more of an early warning if it looked like that would cause the project to fail. Assuming your plan looks more like t... (read more)

Does anyone have guesses or direct knowledge of:

  1. What are OpenAI's immediate plans? For example what are the current/next alignment-focused ML projects they have in their pipeline?
  2. What kind of results are they hoping for at the end of 4 years? Is it to actually "build a roughly human-level automated alignment researcher" or is that a longer term goal and the 4 year goal is to just to achieve some level of understanding of how to build and align such an AI?
3Wei Dai2mo
I was informed by an OpenAI insider that the 4 year goal is actually “build a roughly human-level automated alignment researcher”.

Insofar that philosophical progress is required, my optimism for AI helping on this is lower than for (more) technical research since in philosophy evaluation is often much harder and I’m not sure that it’s always easier than generation. You can much more easily picture a highly charismatic AI-written manifesto that looks very persuasive and is very difficult to refute than it is to make technical claims about math, algorithms, or empirical data that are persuasive and hard to falsify.

Given this, what is your current plan around AI and philosophy? I gue... (read more)

2Wei Dai3mo
Thinking about this a bit more, a better plan would be to train AI to do all kinds of reasoning required for automated alignment research in parallel, so that you can monitor what kinds of reasoning are harder for the AI to learn than the others, and have more of an early warning if it looked like that would cause the project to fail. Assuming your plan looks more like this, it would make me feel a little better about B although I'd still be concerned about C. More generally, I tend to think that getting automated philosophy right is probably one of the hardest parts of automating alignment research (as well as the overall project of making the AI transition go well), so it makes me worried when alignment researchers don't talk specifically about philosophy when explaining why they're optimistic, and makes me want to ask/remind them about it. Hopefully that seems understandable instead of annoying.

Thanks for engaging with my questions here. I'll probably have more questions later as I digest the answers and (re)read some of your blog posts. In the meantime, do you know what Paul meant by "it doesn’t depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents" in the other subthread?

1Jan Leike2mo
I'm not entirely sure but here is my understanding: I think Paul pictures relying heavily on process-based approaches where you trust the output a lot more because you closely held the system's hand through the entire process of producing the output. I expect this will sacrifice some competitiveness, and as long as it's not too much it shouldn't be that much of a problem for automated alignment research (as opposed to having to compete in a market). However, it might require a lot more human supervision time. Personally I am more bullish on understanding how we can get agents to help with evaluation of other agents' outputs such that you can get them to tell you about all of the problems they know about. The "offense-defense" balance to understand here is whether a smart agent could sneak a deceptive malicious artifact (e.g. some code) past a motivated human supervisor empowered with a similarly smart AI system trained to help them.

Thanks for this helpful explanation.

it doesn’t depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents

Can you point me to the original claims? While trying to find it myself, I came across https://aligned.substack.com/p/alignment-optimism which seems to be the most up to date explanation of why Jan thinks his approach will work (and which also contains his views on the obfuscated arguments problem and how RRM relates to IDA, so should be a good resource for me to read more carefu... (read more)

I don't think I disagree with many of the claims in Jan's post, generally I think his high level points are correct.

He lists a lot of things as "reasons for optimism" that I wouldn't consider positive updates (e.g. stuff working that I would strongly expect to work) and doesn't list the analogous reasons for pessimism (e.g. stuff that hasn't worked well yet).  Similarly I'm not sure conviction in language models is a good thing but it may depend on your priors.

One potential substantive disagreement with Jan's position is that I'm somewhat more scared ... (read more)

I've been trying to understand (catch up with) the current alignment research landscape, and this seems like a good opportunity to ask some questions.

  1. This post links to https://openai.com/research/critiques which seems to be a continuation of Geoffrey Irving et el's Debate and Jan Leike et el's recursive reward modeling both of which are in turn closely related to Paul Christiano's IDA, so that seems to be the main alignment approach that OpenAI will explore in the near future. Is this basically correct?
  2. From a distance (judging from posts on the Alignme
... (read more)
  1. Yes, we are currently planning continue to pursue these directions for scalable oversight. My current best guess is that scalable oversight will do a lot of the heavy lifting for aligning roughly human-level alignment research models (by creating very precise training signals), but not all of it. Easy-to-hard generalization, (automated) interpretability, adversarial training+testing will also be core pieces, but I expect we'll add more over time.

  2. I don't really understand why many people updated so heavily on the obfuscated arguments problem; I don't t

... (read more)

1. I think OpenAI is also exploring work on interpretability and on easy-to-hard generalization. I also think that the way Jan is trying to get safety for RRM is fairly different for the argument for correctness of IDA (e.g. it doesn't depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents), even though they both involve decomposing tasks and iteratively training smarter models.

2. I think it's unlikely debate or IDA will scale up indefinitely without major conceptual progress (which... (read more)

I'd like to read this post but can't find it in your post history. Any chance it might be sitting in your drafts folder? Also, do you know if obfuscated argument (or the equivalent problem for IDA) is the main reason that research interest in IDA has seemingly declined a lot from 3 years ago?

Then the AI Alignment Awards contest came along, and an excellent entry by Elliot Thornley proposed that incomplete preferences could be used as a loophole to make the shutdown problem tractable.

Where is this contest entry? All my usual search methods are failing me...

1Xuan (Tan Zhi Xuan)3mo
Not sure if this is the same as the awards contest entry, but EJT also made this earlier post ("There are no coherence theorems") arguing that certain Dutch Book / money pump arguments against incompleteness fail!
3johnswentworth3mo
I don't think it's been posted publicly yet. Elliot said I was welcome to cite it publicly, but didn't explicitly say whether I should link it. @EJT ?

One way that things could go wrong, not addressed by this playbook: AI may differentially accelerate intellectual progress in a wrong direction, or in other words create opportunities for humanity to make serious mistakes (by accelerating technological progress) faster than wisdom to make right choices (philosophical progress). Specific to the issue of misalignment, suppose we get aligned human-level-ish AI, but it is significantly better at speeding up AI capabilities research than the kinds of intellectual progress needed to continue to minimize misalign... (read more)

1HoldenKarnofsky3mo
I agree that this is a major concern. I touched on some related issues in this piece. This post focused on misalignment because I think readers of this forum tend to be heavily focused on misalignment, and in this piece I wanted to talk about what a playbook might look like assuming that focus (I have pushed back on this as the exclusive focus elsewhere). I think somewhat adapted versions of the four categories of intervention I listed could be useful for the issue you raise, as well.

I guess part of the problem is that the people who are currently most receptive to my message are already deeply enmeshed in other x-risk work, and I don't know how to reach others for whom the message might be helpful (such as academic philosophers just starting to think about AI?). If on reflection you think it would be worth spending some of your time on this, one particularly useful thing might be to do some sort of outreach/field-building, like writing a post or paper describing the problem, presenting it at conferences, and otherwise attracting more ... (read more)

2Daniel Kokotajlo4mo
Somehow there are 4 copies of this post

Even at 10% p(doom), which I consider to be unreasonably low, it would probably be worth delaying a few years.

Someone with with 10% p(doom) may worry that if they got into a coalition with others to delay AI, they can't control the delay precisely, and it could easily become more than a few years. Maybe it would be better not to take that risk, from their perspective.

And lots of people have p(doom)<10%. Scott Aaronson just gave 2% for example, and he's probably taken AI risk more seriously than most (currently working on AI safety at OpenAI), so prob... (read more)

2Daniel Kokotajlo4mo
I guess I just think it's pretty unreasonable to have p(doom) of 10% or less at this point, if you are familiar with the field, timelines, etc.  I totally agree the topic is important and neglected. I only said "arguably" deferrable, I have less than 50% credence that it is deferrable. As for why I'm not working on it myself, well, aaaah I'm busy idk what to do aaaaaaah! There's a lot going on that seems important. I think I've gotten wrapped up in more OAI-specific things since coming to OpenAI, and maybe that's bad & I should be stepping back and trying to go where I'm most needed even if that means leaving OpenAI. But yeah. I'm open to being convinced!

Why is 1 important? It seems like something we can defer discussion of until after (if ever) alignment is solved, no?

If aging was solved or looked like it will be solved within next few decades, it would make efforts to stop or slow down AI development less problematic, both practically and ethically. I think some AI accelerationists might be motivated directly by the prospect of dying/deterioration from old age, and/or view lack of interest/progress on that front as a sign of human inadequacy/stagnation (contributing to their antipathy towards humans).... (read more)

3Daniel Kokotajlo4mo
Something like 2% of people die every year right? So even if we ignore the value of future people and all sorts of other concerns and just focus on whether currently living people get to live or die, it would be worth delaying a year if we could thereby decrease p(doom) by 2 percentage points. My p(doom) is currently 70% so it is very easy to achieve that. Even at 10% p(doom), which I consider to be unreasonably low, it would probably be worth delaying a few years. Re: 2: Yeah I basically agree. I'm just not as confident as you are I guess. Like, maybe the answers to the problems you describe are fairly objective, fairly easy for smart AIs to see, and so all we need to do is make smart AIs that are honest and then proceed cautiously and ask them the right questions. I'm not confident in this skepticism and could imagine becoming much more convinced simply by thinking or hearing about the topic more.

Thanks, this clarifies a lot for me.

It seems like just 4 months ago you still endorsed your second power-seeking paper:

This paper is both published in a top-tier conference and, unlike the previous paper, actually has a shot of being applicable to realistic agents and training processes. Therefore, compared to the original[1] optimal policy paper, I think this paper is better for communicating concerns about power-seeking to the broader ML world.

Why are you now "fantasizing" about retracting it?

I think a healthy alignment community would have rebuked me for that line of research, but s

... (read more)

To be clear, I still endorse Parametrically retargetable decision-makers tend to seek power. Its content is both correct and relevant and nontrivial. The results, properly used, may enable nontrivial inferences about the properties of inner trained cognition. I don't really want to retract that paper. I usually just fantasize about retracting Optimal policies tend to seek power.

The problem is that I don't trust people to wield even the non-instantly-doomed results.

For example, one EAG presentation cited my retargetability results as showing that most rewar... (read more)

Is it just me or is it nuts that a statement this obvious could have gone outside the overton window, and is now worth celebrating when it finally (re?)enters?

How is it possible to build a superintelligence at acceptable risk while this kind of thing can happen? What if there are other truths important to safely building a superintelligence, that nobody (or very few) acknowledges because they are outside the overton window?

Now that AI x-risk is finally in the overton window, what's your vote for the most important and obviously true statement that is still... (read more)

3Daniel Kokotajlo4mo
Why is 1 important? It seems like something we can defer discussion of until after (if ever) alignment is solved, no? 2 is arguably in that category also, though idk.

Note that this paper already used "Language Agents" to mean something else. See link below for other possible terms. I will keep using "Language Agents" in this comment/thread (unless the OP decide to change their terminology).

I added the tag Chain-of-Thought Alignment, since there's a bunch of related discussion on LW under that tag. I'm not very familiar with this discussion myself, and have some questions below that may or may not already have good answers.

How competent will Language Agents be at strategy/planning, compared to humans and other AI approa... (read more)

For example, making numerous copies of itself to work in parallel would again raise the dangers of independently varying goals.

The AI could design a system such that any copies made of itself are deleted after a short period of time (or after completing an assigned task) and no copies of copies are made. This should work well enough to ensure that the goals of all of the copies as a whole never vary far from its own goals, at least for the purpose of researching a more permanent alignment solution. It's not 100% risk-free of course, but seems safe enoug... (read more)

I don't think I understand, what's the reason to expect that the "acausal economy" will look like a bunch of acausal norms, as opposed to, say, each civilization first figuring out what its ultimate values are, how to encode them into a utility function, then merging with every other civilization's utility function? (Not saying that I know it will be the latter, just that I don't know how to tell at this point.)

Also, given that I think AI risk is very high for human civilization, and there being no reason to suspect that we're not a typical pre-AGI civiliz... (read more)

To your first question, I'm not sure which particular "the reason" would be most helpful to convey.  (To contrast: what's "the reason" that physically dispersed human societies have laws?  Answer: there's a confluence of reasons.).  However, I'll try to point out some things that might be helpful to attend to.

First, committing to a policy that merges your utility function with someone else's is quite a vulnerable maneuver, with a lot of boundary-setting aspects.  For instance, will you merge utility functions multiplicatively (as in Nas... (read more)

We have a lot of experience and knowledge of building systems that are broadly beneficial and safe, while operating in the human capabilities regime.

What? A major reason we're in the current mess is that we don't know how to do this. For example we don't seem to know how to build a corporation (or more broadly an economy) such that its most powerful leaders don't act like Hollywood villains (race for AI to make a competitor 'dance')? Even our "AGI safety" organizations don't behave safely (e.g., racing for capabilities, handing them over to others, e.g.... (read more)

My personal view is that given all of this history and the fact that this forum is named the "AI Alignment Forum", we should not redefine "AI Alignment" to mean the same thing as "Intent Alignment". I feel like to the extent there is confusion/conflation over the terminology, it was mainly due to Paul's (probably unintentional) overloading of "AI alignment" with the new and narrower meaning (in Clarifying “AI Alignment”), and we should fix that error by collectively going back to the original definition, or in some circumstances where the risk of confusion... (read more)

6Paul Christiano7mo
I don't think this is the main or only source of confusion: * MIRI folks also frequently used the narrower usage. I think the first time I saw "aligned" was in Aligning Superintelligence with Human Interests from 2014 (scraped by wayback on January 3 2015) which says "We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.”" * Virtually every problem people discussed as part of AI alignment was also part of intent alignment. The name was deliberately chosen to evoke "pointing" your AI in a direction. Even in the linked post Eliezer uses "pointing the AI in the right direction" as a synonym for alignment. * It was proposed to me as a replacement for the narrower term AI control, which quite obviously doesn't include all the broader stuff. In the email thread where Rob suggested I adopt it he suggested it was referring to what Nick Bostrom called the "second principal-agent problem" between AI developers and the AI they build. I want to emphasize again that this definition seems extremely bad. A lot of people think their work helps AI actually produce good outcomes in the world when run, so pretty much everyone would think their work counts as alignment.  It includes all work in AI ethics, if in fact that research is helpful for ensuring that future AI has a good outcome. It also includes everything people work on in AI capabilities, if in fact capability increases improve the probability that a future AI system produces good outcomes when run. It's not even restricted to safety, since it includes realizing more upside from your AI. It includes changing the way you build AI to help address distributional issues, if the speaker (very reasonably!) thinks those are important to the value of the future. I didn't take this seriously as a definition and didn't really realize anyone was taking it seriously, I thought it was just an instance of speaking loosely. But i

Other relevant paragraphs from the Arbital post:

“AI alignment theory” is meant as an overarching term to cover the whole research field associated with this problem, including, e.g., the much-debated attempt to estimate how rapidly an AI might gain in capability once it goes over various particular thresholds.

Other terms that have been used to describe this research problem include “robust and beneficial AI” and “Friendly AI”. The term “value alignment problem” was coined by Stuart Russell to refer to the primary subproblem of aligning AI preferences wit

... (read more)
3Wei Dai7mo
My personal view is that given all of this history and the fact that this forum is named the "AI Alignment Forum", we should not redefine "AI Alignment" to mean the same thing as "Intent Alignment". I feel like to the extent there is confusion/conflation over the terminology, it was mainly due to Paul's (probably unintentional) overloading of "AI alignment" with the new and narrower meaning (in Clarifying “AI Alignment”), and we should fix that error by collectively going back to the original definition, or in some circumstances where the risk of confusion is too great, avoiding "AI alignment" and using some other term like "AI x-safety". (Although there's an issue with "existential risk/safety" as well, because "existential risk/safety" covers problems that aren't literally existential, e.g., where humanity survives but its future potential is greatly curtailed. Man coordination is hard.)

Here are some clearer evidence that broader usages of "AI alignment" were common from the beginning:

  1. In this Arbital page dated 2015, Eliezer wrote:

The “alignment problem for advanced agents” or “AI alignment” is the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.

(I couldn't find a easy way to view the original 2015 version, but do have a screenshot that I can produce upon request showing a Jan 2017 edit on Arbital that already had this broad def... (read more)

2Paul Christiano7mo
In the 2017 post Vladimir Slepnev is talking about your AI system having particular goals, isn't that the narrow usage? Why are you citing this here?
2Paul Christiano7mo
I misread the date on the Arbital page (since Arbital itself doesn't have timestamps and it wasn't indexed by the Wayback machine until late 2017) and agree that usage is prior to mine.
2Wei Dai7mo
Other relevant paragraphs from the Arbital post:

Your main justification was that Eliezer used the term with an extremely broad definition on Arbital, but the Arbital page was written way after a bunch of other usage (including after me moving to ai-alignment.com I think).

Eliezer used "AI alignment" as early as 2016 and ai-alignment.com wasn't registered until 2017. Any other usage of the term that potentially predates Eliezer?

4Paul Christiano7mo
But that talk appears to use the narrower meaning though, not the crazy broad one from the later Arbital page. Looking at the transcript: * The first usage is "At the point where we say, “OK, this robot’s utility function is misaligned with our utility function. How do we fix that in a way that it doesn’t just break again later?” we are doing AI alignment theory." Which seems like it's really about the goal the agent is pursuing. * The subproblems are all about agents having the right goals. And it continuously talks about pointing agents in the right direction when talking informally about what alignment is. * It doesn't talk about how there are other parts of alignment that Eliezer just doesn't care about. It really feels like "alignment" is supposed to be understood to mean getting your AI to be not trying to kill you / trying to help you / something about its goals. * The talk doesn't have any definitions to disabuse you of this apparent implication. What part of this talk makes it seem clear that alignment is about the broader thing rather than about making an AI that's not actively trying to kill you?

I’m not sure what order the history happened in and whether “AI Existential Safety” got rebranded into “AI Alignment” (my impression is that AI Alignment was first used to mean existential safety, and maybe this was a bad term, but it wasn’t a rebrand)

There was a pretty extensive discussion about this between Paul Christiano and me. tl;dr "AI Alignment" clearly had a broader (but not very precise) meaning than "How to get AI systems to try to do what we want" when it first came into use. Paul later used "AI Alignment" for his narrower meaning, but after... (read more)

4Paul Christiano7mo
I don't think I really agree with this summary. Your main justification was that Eliezer used the term with an extremely broad definition on Arbital, but the Arbital page was written way after a bunch of other usage (including after me moving to ai-alignment.com I think). I think very few people at the time would have argued that e.g. "getting your AI to be better at politics so it doesn't accidentally start a war" is value alignment though it obviously fits under Eliezer's definition. (ETA: actually the Arbital page is old, it just wasn't indexed by the wayback machine and doesn't come with a date on Arbital itself. so So I agree with the point that this post is evidence for an earlier very broad usage.) I would agree with "some people used it more broadly" but not "clearly had a broader meaning." Unless "broader meaning" is just "used very vaguely such that there was no agreement about what it means." (I don't think this really matters except for the periodic post complaining about linguistic drift.)

Overall I expect there to be a small number of massive training runs due to economies of scale, but I also expect AI developer margins to be reasonable, and I don’t see a strong reason to expect them to end up with way more power than other actors in the supply chain (either the companies who supply computing power,or the downstream applications of AI).

Is the reason that you expect AI developer margins to be reasonable that you expect the small number of AI developers to still compete with each other on price and thereby erode each other's margins? What... (read more)

3Paul Christiano9mo
Yes. A monopoly on computers or electricity could also take big profits in this scenario. I think the big things are always that it's illegal and that high prices drive new entrants. I think this would also be illegal if justified by the AI company's preferences rather than customer preferences, and it would at least make them a salient political target for people who disagree. It might be OK if they were competing to attract employees/investors/institutional customers and in practice I think it would be most likely happen as a move by the dominant faction in political/cultural conflict in a broader society, and this would be a consideration raising the importance of AI researchers and potentially capitalists in that conflict. I agree if you are someone who stands to lose from that conflict then you may be annoyed by some near-term applications of alignment, but I still think (i) alignment is distinct from those applications even if it facilitates them, (ii) if you don't like how AI empowers your political opponents then I strongly think you should push back on AI development itself rather than hoping that no one can control AI.

If we succeed at the technical problem of AI alignment, AI developers would have the ability to decide whether their systems generate sexual content or opine on current political events, and different developers can make different choices. Customers would be free to use whatever AI they want, and regulators and legislators would make decisions about how to restrict AI.

Presumably if most customers are able to find companies offering AIs that align sufficiently with their own preferences, there would be no backlash. The kind of backlash you're worried abo... (read more)

3Paul Christiano9mo
I don't really think that's the case.  Suppose that I have different taste from most people, and consider the interior of most houses ugly. I can be unhappy about the situation even if I ultimately end up in a house I don't think is ugly. I'm unhappy that I had to use multiple bits of selection pressure just to avoid ugly interiors, and that I spend time in other people's ugly houses, and so on. In practice I think it's even worse than that; people get politically worked up about things that don't affect their lives at all through a variety of channels. I do agree that backlash to X will be bigger if all AIs do X than if some AIs do X. I don't think this scenario is really relevant to the most common concerns about concentration of power. I think the most important reason to be scared of concentration of power is: * Historically you need a lot of human labor to get things done. * With AI the value of human labor may fall radically. * So capitalists may get all the profit, and it may be possible to run an oppressive state without a bunch of humans. * This may greatly increase economic inequality or make it much more possible to have robust oppressive regimes. But all of those arguments are unrelated to the number of AI developers. Overall I expect there to be a small number of massive training runs due to economies of scale, but I also expect AI developer margins to be reasonable, and I don't see a strong reason to expect them to end up with way more power than other actors in the supply chain (either the companies who supply computing power,or the downstream applications of AI). I don't think it's really plausible to have a technical situation where AI can be used to pursue "humanity's overall values" but cannot be used to pursue the values of a subset of humanity. (I also tend to think that technocratic solutions to empower humanity via the design of AI are worse than solutions that empower people in more legible ways, either by having their AI a

Accordingly, I think there’s a tendency to give OpenAI an unfair amount of flak compared to say, Google Brain or FAIR or any of the startups like Adept or Cohere.

I'm not sure I agree that this is unfair.

OpenAI is clearly on the cutting edge of AI research.

This is obviously a good reason to focus on them more.

OpenAI has a lot of visibility in this community, due to its physical proximity and a heavy overlap between OpenAI employees and the EA/Rationalist social scene.

Perhaps we have responsibility to scrutinize/criticize them more because of this... (read more)

I guess it depends on the specific alignment approach being taken, such as whether you're trying to build a sovereign or an assistant. Assuming the latter, I'll list some philosophical problems that seem generally relevant:

  1. metaphilosophy
    • How to solve new philosophical problems relevant to alignment as they come up?
    • How to help users when they ask the AI to attempt philosophical progress?
    • How to help defend the user against bad philosophical ideas (whether in the form of virulent memes, or intentionally optimized by other AIs/agents to manipulate the use
... (read more)

To the extent that alignment research involves solving philosophical problems, it seems that in this approach we will also need to automate philosophy, otherwise alignment research will become bottlenecked on those problems (i.e., on human philosophers trying to solve those problems while the world passes them by). Do you envision automating philosophy (and are you optimistic about this) or see some other way of getting around this issue?

It worries me to depend on AI to do philosophy, without understanding what "philosophical reasoning" or "philosophical p... (read more)

2Jan Leike9mo
Insofar that philosophical progress is required, my optimism for AI helping on this is lower than for (more) technical research since in philosophy evaluation is often much harder and I'm not sure that it's always easier than generation. You can much more easily picture a highly charismatic AI-written manifesto that looks very persuasive and is very difficult to refute than it is to make technical claims about math, algorithms, or empirical data that are persuasive and hard to falsify. However, I'm skeptical that the list of novel philosophical problems we actually need to solve to prevent the most serious misalignment risk will actually be that long. For example, a lot of problems in rationality + decision theory + game theory I'd count more as model capabilities and the moral patienthood questions you can punt on for a while from the longtermist point of view.
2Chris_Leong10mo
What are the key philosophical problems you believe we need to solve for alignment?

We do have empirical evidence that nonrobust aligned intelligence can be not OK, like this or this. Why are you not more worried about superintelligent versions of these (i.e. with access to galaxies worth of resources)?

2Alex Turner9mo
What do you mean by "nonrobust" aligned intelligence? Is "robust" being used in the "robust grading" sense, or in the "robust values" sense (of e.g. agents caring about lots of things, only some of which are human-aligned), or some other sense?  Anyways, responding to the vibe of your comment -- I feel... quite worried about that? Is there something I wrote which gave the impression otherwise? Maybe the vibe of the post is "alignment admits way more dof than you may have thought" which can suggest I believe "alignment is easy with high probability"?

ChatGPT has a tendency to make up answers and refuse to admit it. A log follows. Note that almost every paragraph of the initial biography contains multiple falsehoods that are unlikely to be directly in its training data.

Show me a 10 paragraph biography of Wei Dai

Wei Dai was born in Guangzhou, China in 1967. She received her undergraduate degree in computer science from the University of Science and Technology of China in 1988 and her master's degree in computer science from the University of California, Berkeley in 1991.

After completing her master's ... (read more)

This is tempting, but the problem is that I don't know what my idealized utility function is (e.g., I don't have a specification for CEV that I think would be safe or ideal to optimize for), so what does it mean to try to approximate it? Or consider that I only read about CEV one day in a blog, so what was I doing prior to that? Or if I was supposedly trying to approximate CEV, I can change my mind about it if I realized that it's a bad idea, but how does that fit into the framework?

My own framework is something like this:

  • The evaluation process is some c
... (read more)

Imagine someone who considers a few plans, grades them (e.g. “how good does my gut say this plan is?”), and chooses the best. They are not a grader-optimizer. They are not trying to navigate to the state where they propose and execute a plan which gets maximally highly rated by some evaluative submodule. They use a grading procedure to locally rate and execute plans, and may even locally think “what would make me feel better about this plan?”, but the point of their optimization isn’t “find the plan which makes me feel as good as globally possible.”

The ... (read more)

2Alex Turner10mo
I wrote in the post: Sorry if I'm just repeating something you read and understood, but I do feel like this criterion answers "no, this is still not grader-optimization; the effective search over lots of plans is still a side-effect of your cognition, not the terminal end."  In particular, note that the strategy you described would not strongly want to be given the actual-highest-rated plan--or maybe it would want to know more about the plan as a curiosity, but not in order to evaluate and execute that plan. That's one way in which saying "your strategy is not grader-optimization" constrains my anticipations in a useful-seeming way. This is a good point. I'm wondering about the type of the (presumably Cartesian) interface between the CEV-sim and the actor. First, CEV-sim shouldn't be affectable by the input-plan unless and until they run some stats on it. Otherwise the actor could (maybe?) side-channel attack them via whatever computer registers the input-plan shows up in. And CEV-sim does have to infer what they're being used for, at each invocation of the grader (since they don't retain memory across counterfactuals). a. That aside, if CEV-sim can just syntactically check whether the input-plan runs that, then your argument seems good.  b. If CEV-sim has to understand the actor's latent state context (is it all in the plan?), in order to make sure that the purported X-running plan isn't just running dangerous Y in another programming language... Seems like they can't do this.  I feel like we're dealing with (b) more than (a), so I'd say "no, 2 is safer than 1" tentatively.
2Adam Shimi10mo
Is your issue here that there exist a specific CEV-universe-simulation that makes 1 just as safe as 2, by basically emulating the latter situation? If so, why do you think this is a point against Alex's claim(which strikes me more as saying "there are a lot more cases of 2. being safe than of 1.")? 
6Vivek Hebbar10mo
Improve it with respect to what?   My attempt at a framework where "improving one's own evaluator" and "believing in adversarial examples to one's own evaluator" make sense: * The agent's allegiance is to some idealized utility function Uideal (like CEV).  The agent's internal evaluator Eval is "trying" to approximate Uideal by reasoning heuristically.  So now we ask Eval to evaluate the plan "do argmax w.r.t. Eval over a bunch of plans".  Eval reasons that, due to the the way that Eval works, there should exist "adversarial examples" that score very highly on Eval but low on Uideal.  Hence, Eval concludes that Uideal(plan) is low, where plan = "do argmax w.r.t. Eval".  So the agent doesn't execute the plan "search widely and argmax". * "Improving Eval" makes sense because Eval will gladly replace itself with Eval2 if it believes that Eval2 is a better approximation for Uideal (and hence replacing itself will cause the outcome to score better on Uideal) Are there other distinct frameworks which make sense here?  I look forward to seeing what design Alex proposes for "value child".
Load More