Ramana Kumar

Wiki Contributions

Comments

On how various plans miss the hard bits of the alignment challenge

For 2, I think a lot of it is finding the "sharp left turn" idea unlikely. I think trying to get agreement on that question would be valuable.

For 4, some of the arguments for it in this post (and comments) may help.

For 3, I'd be interested in there being some more investigation into and explanation of what "interpretability" is supposed to achieve (ideally with some technical desiderata). I think this might end up looking like agency foundations if done right.

For example, I'm particularly interested in how "interpretability" is supposed to work if, in some sense, much of the action of planning and achieving some outcome occurs far away from the code or neural network that played some role in precipitating it. E.g., one NN-based system convinces another more capable system to do something (including figuring out how); or an AI builds some successor AIs that go on to do most of the thinking required to get something done. What should "interpretability" do for us in these cases, assuming we only have access to the local system?

Will Capabilities Generalise More?

The desiderata you mentioned:

  1. Make sure the feedback matches the preferences
  2. Make sure the agent isn't changing the preferences

It seems that RRM/Debate somewhat addresses both of these, and path-specific objectives is mainly aimed at addressing issue 2. I think (part of) John's point is that RRM/Debate don't address issue 1 very well, because we don't have very good or robust processes for judging the various ways we could construct or improve these schemes. Debate relies on a trustworthy/reliable judge at the end of the day, and we might not actually have that.

Will Capabilities Generalise More?

Nice - thanks for this comment - how would the argument be summarised as a nice heading to go on this list? Maybe "Capabilities can be optimised using feedback but alignment cannot" (and feedback is cheap, and optimisation eventually produces generality)?

Will Capabilities Generalise More?

I think what you say makes sense, but to be clear the argument does not consider those things as the optimisation target but rather considers fitness or reproductive capacity as the optimisation target. (A reasonable counterargument is that the analogy doesn't hold up because fitness-as-optimisation-target isn't a good way to characterise evolution as an optimiser.)

Where I agree and disagree with Eliezer

I basically agree with you. I think you go too far in saying Lethailty 19 is solved, though. Using the 3 feats from your linked comment, which I'll summarise as "produce a mind that...":

  1. cares about something
  2. cares about something external (not shallow function of local sensory data)
  3. cares about something specific and external

(clearly each one is strictly harder than the previous) I recognise that Lethality 19 concerns feat 3, though it is worded as if being about both feat 2 and feat 3.

I think I need to distinguish two versions of feat 3:

  1. there is a reliable (and maybe predictable) mapping between the specific targets of caring and the mind-producing process
  2. there is a principal who gets to choose what the specific targets of caring are (and they succeed)

Humans show that feat 2 at least has been accomplished, but also 3a, as I take you to be pointing out. I maintain that 3b is not demonstrated by humans and is probably something we need.

Where I agree and disagree with Eliezer

Yes, human beings exist and build world models beyond their local sensory data, and have values over those world models not just over the senses.

But this is not addressing all of the problem in Lethality 19. What's missing is how we point at something specific (not just at anything external).

The important disanalogy between AGI alignment and humans as already-existing (N)GIs is:

  • for AGIs there's a principal (humans) that we want to align the AGI to
  • for humans there is no principal - our values can be whatever. Or if you take evolution as the principal, the alignment problem wasn't solved.
Training Trace Priors

In this story deception is all about the model having hidden behaviors that never get triggered during training


Not necessarily - depends on how abstractly we're considering behaviours. (It also depends on how likely we are to detect the bad behaviours during training.)

Consider an AI trained on addition problems that is only exposed to a few problems that look like 1+3=4, 3+7=10, 2+5=7, 2+6=8 during training, where there are two summands which are each a single digit and they appear in ascending order. Now at inference time the model exposed to 10+2= outputs 12.

Have we triggered a hidden behaviour that was never encountered in training? Certainly these inputs were never encountered, and there's maybe a meaningful difference in the new input, since it involves multiple digits and out-of-order summands. But it seems possible that exactly the same learned algorithm is being applied now as was being applied during the late stages of training, and so there won't be some new parts of the model being activated for the first time.

Deceptive behaviour might be a natural consequence of the successful learned algorithms when they are exposed to appropriate inputs, rather than different machinery that was never triggered during training.

AGI Ruin: A List of Lethalities

We might summarise this counterargument to #30 as "verification is easier than generation". The idea is that the AI comes up with a plan (+ explanation of how it works etc.) that the human systems could not have generated themselves, but that human systems can understand and check in retrospect.

Counterclaim to "verification is easier than generation" is that any pivotal act will involve plans that human systems cannot predict the effects of just by looking at the plan. What about the explanation, though? I think the problem there may be more that we don't know how to get the AI to produce a helpful and accurate explanation as opposed to a bogus misleading but plausible-sounding one, not that no helpful explanation exists. 

Richard Ngo's Shortform

What kind of access might be needed to private models? Could there be a secure multi-party computation approach that is sufficient?

Load More