Wiki Contributions


This is a nice frame of the problem.

In theory, at least. It's not so clear that there are any viable alternatives to argmax-style reasoning that will lead to superhuman intelligence.

I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they're from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don't think I've properly communicated my feelings in this comment, but hopefully it's better than nothing))

My intuition is that in order to go beyond imitation learning and random exploration, we need some sort of "iteration" system (a la IDA), and the cases of such systems that we know of tend to either literally be argmax planners with crisp utility functions, or have similar problems to argmax planners with crisp utility functions.

A nuclear bomb steers a lot of far-away objects into a high-entropy configuration, and does so very robustly, but that perhaps is not a "small part of the state space"

This example reminds me of a thing I have been thinking about, namely that it seems like optimization can only occur in cases where the optimization produces/is granted enough "energy" to control the level below. In this example, the model works in a quite literal way, as a nuclear bomb floods an area with energy, and I think this example generalizes to e.g. markets with Dutch disease.

Flooding the lower level with "energy" is presumably not the only way this problem can occur; lack of incentives/credit assignment in the upper level generates this result simply because no incentives means that the upper level does not allocate "energy" to the area.

The data may itself have a privileged basis, but this should be lost as soon as the first linear layer is reached.

Not totally lost if the layer is e.g. a convolutional layer, because while the pixels within the convolutional window can get arbitrarily scrambled, it is not possible for a convolutional layer to scramble things across different windows in different parts of the picture.

I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment. 

How about this one: Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.

Are you saying "The AI says something which makes us erroneously believe it saved a person's life, and we reward it, and this can spawn a deception-shard"?

Not necessarily a general deception shard, just it spawns some sort of shard that repeats similar things to what it did before, which presumably means more often errorneously making us believe it saved a person's life. Whether that's deception or approval-seeking or donating-to-charities-without-regard-for-effectiveness or something else.

If so -- that's not (necessarily) how credit assignment works.

Your post points out that you can do all sorts of things in theory if you "have enough write access to fool credit assignment". But that's not sufficient to show that they can happen in practice. You gotta propose system of write access and training to use this write access to do what you are proposing.

I don't know whether this is a problem at all, in general. I expect unaligned models to convergently deceive us. But that requires them to already be unaligned.

Would you not agree that models are unaligned by default, unless there is something that aligns them?

So I guess if we want to be concrete, the most obvious place to start would be classical cases where RLHF has gone wrong. Like a gripper pretending to pick up an object by placing its hand in front of the camera, or a game-playing AI pretending to make progress by replaying the same part of the game over and over again. Though these are "easy" in the sense that they seem correctable by taking more context into consideration.

One issue with giving concrete examples is that I think nobody has gotten RLHF to work in problems that are too "big" for humans to have all the context. So we don't really know how it would work in the regime where it seems irreparably dangerous. Like I could say "what if we give it the task of coming up with plans for an engineering project and it has learned to not make pollution that causes health problems obvious? Due to previously having suggested a design with obvious pollution and having that design punished", but who knows how RLHF will actually be used in engineering?

I guess to me, shard theory resembles RLHF, and seems to share its flaws (unless this gets addressed in a future post or I missed it in one of the existing posts or something).

So for instance learning values by reinforcement events seems likely to lead to deception. If there's some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.

This doesn't become much of a problem in practice among humans (or well, it actually does seem to be a fairly significant problem, but not x-risk level significant), but the most logical reinforcement-based reason I can see why it doesn't become a bigger problem is that people cannot reliably deceive each other. (There may also be innate honesty instincts? But that runs into genome inaccessibility problems.)

These seem like standard objections around here so I assume you've thought about them. I just don't notice those thoughts anywhere in the work.

I'm basically re-raising the point I asked about in your linked post; the alignability of sharded humans seems to be due to people living in a society that gives them feedback on their behavior that they have to follow. This allows cooperative shards to grow. It doesn't seem like it would generalize to more powerful beings.

But how does this help with alignment? Sharded systems seem hard to robustly align outside of the context of an entity who participates on equal footing with other humans in society.

I think if you pretrained it on all of YouTube, you could get explanations and illustrations of people doing basic tasks. I think this would (if used with appropriate techniques that can be developed on a short notice) make it need very little data for basic tasks, because it can just interpolate from its previous experiences.

Load More