Joel Burget - AI Alignment Forum

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

How would you distinguish between weak and strong methods?

Subcortical reinforcement circuits, though, hail from a distinct informational world... and so have to reinforce computations "blindly," relying only on simple sensory proxies.

This seems to be pointing in an interesting direction that I'd like to see expanded.

Because your subcortical reward circuitry was hardwired by your genome, it's going to be quite bad at accurately assigning credit to shards.

I don't know, I think of the brain as doing credit assignment pretty well, but we may have quite different definitions of good and bad. Is there an example you were thinking of? Cognitive biases in general?

if shard theory is true, meaningful partial alignment successes are possible

"if shard theory is true" -- is this a question about human intelligence, deep RL agents, or the relationship between the two? How can the hypothesis be tested?

Even if the human shards only win a small fraction of the blended utility function, a small fraction of our lightcone is quite a lot

What's to stop the human shards from being dominated and extinguished by the non-human shards? IE is there reason to expect equilibrium?

A central AI alignment problem: capabilities generalization, and the sharp left turn

Joel Burget2y7-3

Two points:

The visualization of capabilities improvements as an attractor basin is pretty well accepted and useful, I think. I kind of like the analogous idea of an alignment target as a repeller cone / dome. The true target is approximately infinitely small and attempts to hit it slide off as optimization pressure is applied. I'm curious if other share this model and if it's been refined / explored in more detail by others.
The sharpness of the left turn strikes me as a major crux. Some (most?) alignment proposals seem to rely on developing an AI just a bit smarter than humans but not yet dangerous. (An implicit assumption here may be that intelligence continues to develop in straight lines.) The sharp left turn model implies this sweet spot will pass by in the blink of an eye. (An implicit assumption here may be that there are discrete leaps.) Interesting to note that Nate explicitly says RSI is not a core part of his model. I'd like to see more arguments on both sides of this debate.

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments