Rohin Shah

PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter. http://rohinshah.com/

Sequences

Value Learning
Alignment Newsletter

Comments

Against the Backward Approach to Goal-Directedness

my impression is that Rohin has a similar model, although he might put more importance on the last step that I do at this point in the research.

I agree with this summary.

I suspect Daniel Kokotajlo is in a similar position as me; my impression was that he was asking that the output be that-which-makes-AI-risk-arguments-work, and wasn't making any claims about how the research should be organized.

Debate Minus Factored Cognition

Ah, interesting, I didn't catch that this is what you were trying to do. But how are you arguing #3? Your original comment seems to be constructing a tree computation for my debate, which is why I took it for an argument that my thing can be computed within factored cognition, not vice versa.

There are two arguments:

  1. Your assumption + automatic verification of questions of the form "What is the best defeater to X" implies Weak Factored Cognition (which as defined in my original comment is of the form "there exists a tree such that..." and says nothing about what equilibrium we get).
  2. Weak Factored Cognition + debate + human judge who assumes optimal play implies an honest equilibrium. (Maybe also: if you assume debate trees terminate, then the equilibrium is unique. I think there's some subtlety here though.)

In my previous comment, I was talking about 1, and taking 2 for granted. This is all in the zero-sum setting. But let's leave that aside and instead talk about a simpler argument that doesn't talk about Factored Cognition at all.

----

Based on what argument? Is this something from the original debate paper that I'm forgetting?

Zero-sum setting, argument that honesty is an equilibrium (for the first player in a turn-by-turn game, or either player in a simultaneous-action game):

If you are always honest, then whenever you can take an action, there will exist a defeater (by your assumption), therefore you will have at least as many options as any non-honest policy (which may or may not have a defeater). Therefore you maximize your value by being honest.

Additional details:

In the case where arguments never terminate (every argument, honest or not, has a defeater), then being dishonest will also leave you with many options, and so that will also be an equilibrium. When arguments do terminate quickly enough (maximum depth of the game tree is less than the debate length), that ensures that the honest player always gets the "last word" (the point at which a dishonest defeater no longer exists), and so honesty always wins and is the unique equilibrium. In the middle, where most arguments terminate quickly but some go on forever, honesty is usually incentivized, but sometimes it can be swapped out for a dishonest strategy that achieves the same value.

The point of the "clawing" argument is that it's a rational deviation from honesty, so it means honesty isn't an equilibrium.

I think this is only true when you have turn-by-turn play and your opponent has already "claimed" the honest debater role. In this case I'd say that an equilibrium is for the first player to be honest and the second player to do whatever is necessary to have a chance at success. Still seems like you can use the first player AI in this situation.

In the simultaneous play setting, I think you expect both agents to be honest.

More broadly, I note that the "clawing" argument only applies when facing an honest opponent. Otherwise, you should just use honest counterarguments.

I also don't really understand the hope in the non-zero-sum case here -- in the non-zero-sum setting, as you mention the first player can be dishonest, and then the second player concedes rather than giving an honest defeater that will then be re-defeated by the first (dishonest) player. This seems like worse behavior than is happening under the zero-sum case.

My model is like this

Got it, that makes sense. I see better now why you're saying one-step debate isn't an NP oracle.

I think my arguments in the original comment do still work, as long as you enforce that the judge never verifies an argument without first asking the subquestion "What is the best defeater to this argument?"

Debate Minus Factored Cognition

While I agree that the defeater tree can be encoded as a factored cognition tree, that just means that if we assume factored cognition, and make my assumption about (recursive) defeaters, then we can show that factored cognition can handle the defeater computation. This is sort of like proving that the stronger theory can handle what the weaker theory can handle, which would not be surprising

I don't think that's what I did? Here's what I think the structure of my argument is:

  1. Every dishonest argument has a defeater. (Your assumption.)
  2. Debaters are capable of finding a defeater if it exists. (I said "the best counterargument" before, but I agree it can be weakened to just "any defeater". This doesn't feel that qualitatively different.)
  3. 1 and 2 imply the Weak Factored Cognition hypothesis.

I'm not assuming factored cognition, I'm proving it using your assumption.

Possibly your worry is that the argument trees will never terminate, because every honest defeater could still have a dishonest defeater? It is true that I do need an additional assumption of some sort to ensure termination. Without that assumption, honesty becomes one of multiple possible equilibria (but it is still an equilibrium).

So if the point of the computational complexity analogy is to look at what debate could accomplish if humans could be perfect (but poly-time) judges, then I accept the conclusion

This is in fact what I usually take away from it. The point is to gain intuition about how "strongly" you amplify the original human's capabilities.

but I just don't think that's telling you very much about what you can accomplish on messier questions (and especially, not telling you much about safety properties of debate).

I also agree with this; does anyone think it is proving something about the safety properties of debate w.r.t messy situations?

Instead, I'm proposing a computational complexity analogy in which we account for human fallibility as judges, but also allow for the debate to have some power to correct for those errors. This seems like a more realistic way to assess the capabilities of highly trained debate systems.

This seems good; I think probably I don't get what exactly you're arguing. (Like, what's the model of human fallibility where you don't access NP in one step? Can the theoretical-human not verify witnesses? What can the theoretical-human verify, that lets them access NP in multiple timesteps but not one timestep?)

In my setup, a player is incentivised to concede when they're beaten, rather than continue to defeat the arguments of the other side. This is crucial, because any argument may have a (dishonest) defeater, so the losing side could continue on, possibly flipping the winner back and forth until the argument gets decided by who has the last word. Thus, my argument that there is an honest equilibrium would not go through for a zero-sum mechanism where players are incentivised to try and steal victory back from the jaws of defeat.

Perhaps I could have phrased my point as the pspace capabilities of debate are eaten up by error correction. 

I agree that you get a "clawing on to the argument in hopes of winning" effect, but I don't see why that changes the equilibrium away from honesty. Just because a dishonest debater would claw on doesn't mean that they'd win. The equilibrium is defined by what makes you win.

I can buy that in practice due to messiness you find worse situations where the AI systems sometimes can't find the honest answer and instead finds that making up BS has a better chance of winning, and so it does that; but that's not about the equilibrium, and it sounded to me like you were talking about the equilibrium.

I really just needed it for my argument to go through. If you have an alternate argument which works for the zero-sum case, I'm interested in hearing it.

I mean, I tried to give one (see response to your first point; I'm not assuming the Factored Cognition hypothesis). I'm not sure what's unconvincing about it.

Transparency and AGI safety

I'm curious: does this mean that you're on board with the assumption in Ajeya’s report that 2020 algorithms and datasets + "business as usual" in algorithm and dataset design will scale up to strong AI, with compute being the bottleneck?

Yes, with the exception that I don't know if compute will be the bottleneck (that is my best guess; I think Ajeya's report makes a good case for it; but I could see it being other factors as well).

I think the case for is basically "we see a bunch of very predictable performance lines; seems like they'll continue to go up". But more importantly I don't know of any compelling counterpoints; the usual argument seems to be "but we don't see any causal reasoning / abstraction / <insert property here> yet", which I think is perfectly compatible with the scaling hypothesis (see e.g. this comment).

A more concrete version of the “lobotomized alien" hypothetical might be something like this

I see, that makes sense, and I think it does make sense as an intuition pump for what the "ML paradigm" is trying to do (though as you sort of mentioned I don't expect that we can just do the motivation / cognition decomposition).

"research acceleration" seems like a much narrower task with a relatively well-defined training set of papers, books, etc. than "AI agent that competently runs a company", so might still come first on those grounds.

Definitely depends on how powerful you're expecting the AI system to be. It seems like if you want to make the argument that AI will go well by default, you need the research accelerator to be quite powerful (or you have to combine with some argument like "AI alignment will be easy to solve").

I don't think papers, books, etc are a "relatively well-defined training set". They're a good source of knowledge, but if you imitate papers and books, you get a research accelerator that is limited by the capabilities of human scientists (well, actually much more limited, since it can't run experiments). They might be a good source of pretraining data, but there would still be a lot of work to do to get a very powerful research accelerator.

I should have said that I don't see a path for language models to get selection pressure in the direction of being catastrophically deceptive like in the old "AI getting out of the box" stories, so I think we agree.

Fwiw I'm not convinced that we avoid catastrophic deception either, but my thoughts here are pretty nebulous and I think that "we don't know of a path to catastrophic deception" is a defensible position.

Thoughts on Iason Gabriel’s Artificial Intelligence, Values, and Alignment

Huh. I see a lot of the work I'm excited about as trying to avoid the Agency Hand-Off paradigm, and the Value Learning sequence was all about why we might hope to be able to avoid the Agency Hand-Off paradigm.

Your definition of corrigibility is the MIRI version. If you instead go by Paul's post, the introduction goes:

I would like to build AI systems which help me:

  • Figure out whether I built the right AI and correct any mistakes I made
  • Remain informed about the AI’s behavior and avoid unpleasant surprises
  • Make better decisions and clarify my preferences
  • Acquire resources and remain in effective control of them
  • Ensure that my AI systems continue to do all of these nice things
  • …and so on

This seems pretty clearly outside of the Agency Hand-Off paradigm to me (that's the way I interpret it at least). Similarly for e.g. approval-directed agents.

I do agree that assistance games look like they're within the Agency Hand-Off paradigm, especially the way Stuart talks about them; this is one of the main reasons I'm more excited about corrigibility.

Understanding “Deep Double Descent”

Fwiw, I really liked Rethinking Bias-Variance Trade-off for Generalization of Neural Networks (summarized in AN #129), and I think I'm now at "double descent is real and occurs when (empirical) bias is high but later overshadowed by (empirical) variance". (Part of it is that it explains a lot of existing evidence, but another part is that my prior on an explanation like that being true is much higher than almost anything else that's been proposed.)

I was pretty uncertain about the arguments in this post and the followup when they first came out. (More precisely, for any underlying claim; my opinions about the truth value of the claim seemed nearly independent of my beliefs in double descent.) I'd be interested in seeing a rewrite based on the bias-variance trade-off explanation; my current guess is that they won't hold up, but I haven't thought about it much.

AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

I think I agree at least that many problems can be seen this way, but I suspect that other framings are more useful for solutions. (I don't think I can explain why here, though I am working on a longer explanation of what framings I like and why.)

What I was claiming in the sentence you quoted was that I don't see ontological shifts as a huge additional category of problem that isn't covered by other problems, which is compatible with saying that ontological shifts can also represent many other problems.

Transparency and AGI safety

Planned summary for the Alignment Newsletter:

This post identifies four different motivations for working on transparency:

1. By learning more about how current neural networks work, we can improve our forecasts for AI timelines.

2. It seems _necessary_ for inner alignment. In particular, whatever AI development model you take, it seems likely that there will be some possibility of emergent misbehavior, and there doesn’t yet seem to be a way to rule that out except via transparency.

3. A good solution to transparency would be _sufficient_ for safety, since we could at least notice when AI systems were misaligned, and then choose not to deploy them.

4. Even if AI will “go well by default”, there are still instrumental reasons for transparency, such as improving cause prioritization in EA (via point 1), and for making systems more capable and robust.

After reviewing work on <@circuits@>(@Thread: Circuits@), the post suggests a few directions for future research:

1. Investigating how modular neural networks tend to be,

2. Figuring out how to make transparency outputs more precise and less subjective,

3. Looking for circuits in other networks (i.e. not image classifiers), see e.g. <@RL vision@>(@Understanding RL Vision@),

4. Figuring out how transparency fits into an end-to-end story for AI safety.

Planned opinion:

I’m on board with motivations 2 and 3 for working on transparency (that is, that it directly helps with AI safety). I agree with motivation 4, but for a different reason than the post mentions. If by “AI goes well by default” we mean that AI systems are trying to help their users, that still leaves the AI governance problem. It seems that making these AI systems transparent would be significantly helpful; it would probably enable a “right to an explanation”, for example. I don’t agree with motivation 1 as much: if I wanted to improve AI timeline forecasts, there are a lot of other aspects I would investigate first. (Specifically, I’d improve estimates of inputs into <@this report@>(@Draft report on AI timelines@).) Part of this is that I am less uncertain than the author about the cruxes that transparency could help with, and so see less value in investigating them further.

Other comments (i.e. not going in the newsletter):

I feel like the most interesting parts of this post are exactly the parts I didn't summarize. (I chose not to summarize them because they felt quite preliminary, where I couldn't really extract a key argument or insight that I believed.) But some thoughts about them:

  • The "alien in a box" hypothetical made sense to me (mostly), but I didn't understand the "lobotomized alien" hypothetical. I also didn't see how this was meant to be analogous to machine learning. One concrete question: why are we assuming that we can separate out the motivational aspect of the brain? (That's not my only confusion, but I'm having a harder time explaining other confusions.)
  • It feels like your non-agentic argument is too dependent on how you defined "AGI". I can believe that the first powerful research accelerator will be limited to language, but that doesn't mean that other AI systems deployed at the same time will be limited to language.
  • It seems like there's a pretty clear argument for language models to be deceptive -- the "default" way to train them is to have them produce outputs that humans like; this optimizes for being convincing to humans, which is not necessarily the same as being true. (However, it's more plausible to me that the first such model won't cause catastrophic risk, which would still be enough for your conclusions.)
Recursive Quantilizers II

Planned summary for the Alignment Newsletter:

This post gives an example scheme inspired by the previous post. Like [iterated amplification](https://www.alignmentforum.org/s/EmDuGeRw749sD3GKd), it defines an ideal (analogous to <@HCH@>(@Humans Consulting HCH@)), and then an approximation to it that could be computed in practice.

Like HCH, we imagine a tree of systems that improve as we increase the depth of the tree. However, the nodes in the tree are question-answering (QA) _systems_, rather than direct questions. Given a few QA systems from a lower level, we construct a QA system at a higher level by asking one low-level QA system “what’s a good safe distribution over QA systems”, and a different low-level QA system “what’s a good metric that we can use to judge QA systems”.  We then use [quantilization](https://intelligence.org/files/QuantilizersSaferAlternative.pdf) ([AN #48](https://mailchi.mp/3091c6e9405c/alignment-newsletter-48)) to select better-performing QA systems, without optimizing too hard and falling prey to Goodhart’s Law. In the infinite limit, this should converge to a stable equilibrium.

By having the tree reason about what good safe distributions are, and what good metrics are, we are explicitly improving the way that the AI system learns to interpret feedback (this is what the “good metric” is meant to evaluate), thus meeting the desiderata from the previous post.

To implement this in practice, we do something similar to iterated amplification. Iterated amplification approximates depth-limited HCH by maintaining a model that can answer _arbitrary_ questions (even though each node is a single question); similarly here we maintain a model that has a _distribution_ over QA systems (even though each node is a single QA system). Then, to sample from the amplified distribution, we sample two QA systems from the current distribution, ask one for a good safe distribution and the other for a good metric, and use quantilization to sample a new QA system given these ingredients. We use distillation to turn this slow quantilization process into a fast neural net model.

Considering the problem of <@inaccessible information@>(@Inaccessible information@), the hope is that as we amplify the QA system we will eventually be able to approve of some safe reasoning process about inaccessible information. If this doesn’t happen, then it seems that no human reasoning could approve of reasoning about that inaccessible information, so we have done as well as possible.

Planned opinion (the first post is Learning Normativity):

**On feedback types:** It seems like the scheme introduced here is relying quite strongly on the ability of humans to give good process-level feedback _at arbitrarily high levels_. It is not clear to me that this is something humans can do: it seems to me that when thinking at the meta level, humans often fail to think of important considerations that would be obvious in an object-level case. I think this could be a significant barrier to this scheme, though it’s hard to say without more concrete examples of what this looks like in practice.

**On interaction:** I’ve previously <@argued@>(@Human-AI Interaction@) that it is important to get feedback _online_ from the human; giving feedback “all at once” at the beginning is too hard to do well. However, the idealized algorithm here does have the feedback “all at once”. It’s possible that this is okay, if it is primarily process-level feedback, but it seems fairly worrying to me.

**On desiderata:** The desiderata introduced in the first post feel stronger than they need to be. It seems possible to specify a method of interpreting feedback that is _good enough_: it doesn’t exactly capture everything, but it gets it sufficiently correct that it results in good outcomes. This seems especially true when talking about process-level feedback, or feedback one meta level up -- as long as the AI system has learned an okay notion of “being helpful” or “being corrigible”, then it seems like we’re probably fine.

Often just making feedback uncertain can help. For example, in the preference learning literature, Boltzmann rationality has emerged as the model of choice for how to interpret human feedback. While there are several theoretical justifications for this model, I suspect its success is simply because it makes feedback uncertain: if you want to have a model that assigns higher likelihood to high-reward actions, but still assigns some probability to all actions, it seems like you end up choosing the Boltzmann model (or something functionally equivalent). Note that there is work trying to improve upon this model, such as by [modeling humans as pedagogic](https://papers.nips.cc/paper/2016/file/b5488aeff42889188d03c9895255cecc-Paper.pdf), or by <@incorporating a notion of similarity@>(@LESS is More: Rethinking Probabilistic Models of Human Behavior@).

So overall, I don’t feel convinced that we need to aim for learning at all levels. That being said, the second post introduced a different argument: that the method does as well as we “could” do given the limits of human reasoning. I like this a lot more as a desideratum; it feels more achievable and more tied to what we care about.

Learning Normativity: A Research Agenda

Planned summary for the Alignment Newsletter:

To build aligned AI systems, we need to have our AI systems learn what to do from human feedback. However, it is unclear how to interpret that feedback: any particular piece of feedback could be wrong; economics provides many examples of stated preferences diverging from revealed preferences. Not only would we like our AI system to be uncertain about the interpretation about any particular piece of feedback, we would also like it to _improve_ its process for interpreting human feedback. This would come from human feedback on the meta-level process by which the AI system learns. This gives us _process-level feedback_, where we make sure the AI system gets the right answers _for the right reasons_.

For example, perhaps initially we have an AI system that interprets human statements literally. Switching from this literal interpretation to a Gricean interpretation (where you also take into account the fact that the human chose to say this statement rather than other statements) is likely to yield improvements, and human feedback could help the AI system do this. (See also [Gricean communication and meta-preferences](https://www.alignmentforum.org/posts/8NpwfjFuEPMjTdriJ/gricean-communication-and-meta-preferences), [Communication Prior as Alignment Strategy](https://www.alignmentforum.org/posts/zAvhvGa6ToieNGuy2/communication-prior-as-alignment-strategy), and [multiple related CHAI papers](https://www.alignmentforum.org/posts/zAvhvGa6ToieNGuy2/communication-prior-as-alignment-strategy?commentId=uWBFsKK6XFbL4Hs4z).)

Of course, if we learn _how_ to interpret human feedback, that too is going to be uncertain. We can fix this by “going meta” once again: learning how to learn to interpret human feedback. Iterating this process we get an infinite tower of “levels” of learning, and at every level we assume that feedback is not perfect and the loss function we are using is also not perfect.

In order for this to actually be feasible, we clearly need to share information across these various “levels” (or else it would take infinite time to learn across all of the levels). The AI system should not just learn to decrease the probability assigned to a single hypothesis, it should learn what _kinds_ of hypotheses tend to be good or bad.

Planned opinion (the second post is Recursive Quantilizers II):

**On feedback types:** It seems like the scheme introduced here is relying quite strongly on the ability of humans to give good process-level feedback _at arbitrarily high levels_. It is not clear to me that this is something humans can do: it seems to me that when thinking at the meta level, humans often fail to think of important considerations that would be obvious in an object-level case. I think this could be a significant barrier to this scheme, though it’s hard to say without more concrete examples of what this looks like in practice.

**On interaction:** I’ve previously <@argued@>(@Human-AI Interaction@) that it is important to get feedback _online_ from the human; giving feedback “all at once” at the beginning is too hard to do well. However, the idealized algorithm here does have the feedback “all at once”. It’s possible that this is okay, if it is primarily process-level feedback, but it seems fairly worrying to me.

**On desiderata:** The desiderata introduced in the first post feel stronger than they need to be. It seems possible to specify a method of interpreting feedback that is _good enough_: it doesn’t exactly capture everything, but it gets it sufficiently correct that it results in good outcomes. This seems especially true when talking about process-level feedback, or feedback one meta level up -- as long as the AI system has learned an okay notion of “being helpful” or “being corrigible”, then it seems like we’re probably fine.

Often just making feedback uncertain can help. For example, in the preference learning literature, Boltzmann rationality has emerged as the model of choice for how to interpret human feedback. While there are several theoretical justifications for this model, I suspect its success is simply because it makes feedback uncertain: if you want to have a model that assigns higher likelihood to high-reward actions, but still assigns some probability to all actions, it seems like you end up choosing the Boltzmann model (or something functionally equivalent). Note that there is work trying to improve upon this model, such as by [modeling humans as pedagogic](https://papers.nips.cc/paper/2016/file/b5488aeff42889188d03c9895255cecc-Paper.pdf), or by <@incorporating a notion of similarity@>(@LESS is More: Rethinking Probabilistic Models of Human Behavior@).

So overall, I don’t feel convinced that we need to aim for learning at all levels. That being said, the second post introduced a different argument: that the method does as well as we “could” do given the limits of human reasoning. I like this a lot more as a desideratum; it feels more achievable and more tied to what we care about.

Load More