Abram Demski


Pointing at Normativity
Consequences of Logical Induction
Partial Agency
Alternate Alignment Ideas
Embedded Agency


It's imaginable to do this work but not remember any of it, i.e. avoid having that work leave traces that can accumulate, but that seems like a delicate, probably unnatural carving.

Is the implication here that modern NNs don't do this? My own tendency would be to think that they are doing a lot of this -- doing a bunch of reasoning which gets thrown away rather than saved. So it seems like modern NNs have simply managed to hit this delicate unnatural carving. (Which in turn suggests that it is not so delicate, and even, not so unnatural.)

Attempting to write out the holes in my model. 

  • You point out that looking for a perfect reward function is too hard; optimization searches for upward errors in the rewards to exploit. But you then propose an RL scheme. It seems to me like it's still a useful form of critique to say: here are the upward errors in the proposed rewards, here is the policy that would exploit them.
  • It seems like you have a few tools to combat this form of critique:
    • Model capacity. If the policy that exploits the upward errors is too complex to fit in the model, it cannot be learned. Or, more refined: if the proposed exploit-policy has features that make it difficult to learn (whether complexity, or some other special features). 
      • I don't think you ever invoke this in your story, and I guess maybe you don't even want to, because it seems hard to separate the desired learned behavior from undesired learned behavior via this kind of argument. 
    • Path-dependency. What is learned early in training might have an influence on what is learned later in training.
      • It seems to me like this is your main tool, which you want to invoke repeatedly during your story.
  • I am very uncertain about how path-dependency works.
    • It seems to me like this will have a very large effect on smaller neural networks, but a vanishing effect on larger and larger neural networks, because as neural networks get large, the gradient updates get smaller and smaller, and stay in a region of parameter-space where loss is more approximately linear. This means that much larger networks have much less path-dependency. 
      • Major caveat: the amount of data is usually scaled up with the size of the network. Perhaps the overall amount of path-dependency is preserved. 
        • This contradicts some parts of your story, where you mention that not too much data is needed due to pre-training. However, you did warn that you were not trying to project confidence about story details. Perhaps lots of data is required in these phases to overcome the linearity of large-model updates, IE, to get the path-dependent effects you are looking for. Or perhaps tweaking step size is sufficient.
    • The exact nature of path-dependence seems very unclear.
      • In the shard language, it seems like one of the major path-dependency assumptions for this story is that gradient descent (GD) tends to elaborate successful shards rather than promote all relevant shards.
        • It seems unclear why this would happen in any one step of GD. At each neuron, all inputs which would have contributed to the desired result get strengthened. 
          • My model of why very large NNs generalize well is that they effectively approximate bayesian learning, hedging their bets by promoting all relevant hypotheses rather than just one. 
          • An alternate interpretation of your story is that ineffective subnetworks are basically scavenged for parts by other subnetworks early in training, so that later on, the ineffective subnetworks don't even have a chance. This effect could compound on itself as the remaining not-winning subnetworks have less room to grow (less surrounding stuff to scavenge to improve their own effective representation capacity).
            • Shards that work well end up increasing their surface area, and surface area determines a kind of learning rate for a shard (via the shard's ability to bring other subnetworks under its gradient influence), so there is a compounding effect. 
      • Another major path-dependency assumption seems to be that as these successful shards develop, they tend to have a kind of value-preservation. (Relevant comment.)
        • EG, you might start out with very simple shards that activate when diamonds are visible and vote to walk directly toward them. These might elaborate to do path-planning toward visible diamonds, and then to do path-planning to any diamonds which are present in the world-model, and so on, upwards in sophistication. So you go from some learned behavior that could be anthropomorphized as diamond-seeking, to eventually having a highly rational/intelligent/capable shard which really does want diamonds.
        • Again, I'm unclear on whether this can be expected to happen in very large networks, due to the lottery ticket hypothesis. But assuming some path-dependency, it is unclear to me whether it will word like this. 
        • The claim would of course be very significant if true. It's a different framework, but effectively, this is a claim about ontological shifts - as you more-or-less flag in your major open questions. 
        • While I find the particular examples intuitive, the overall claim seems too good to be true: effectively, that the path-dependencies which differentiate GD learning from ideal Bayesian learning are exactly the tool we need for alignment. 

So, it seems to me, the most important research questions to figure out if a plan like this could be feasible revolve around the nature and existence of path-dependencies in GD, especially for very large NNs.

A good question. I've never seen it happen myself; so where I'm standing, it looks like short emergence examples are cherry-picked.

What report is the image pulled from?

I think your original idea was tenable. LLMs have limited memory, so the waluigi hypothesis can't keep dropping in probability forever, since evidence is lost. The probability only becomes small - but this means if you run for long enough you do in fact expect the transition.

LLMs are high order Markov models, meaning they can't really balance two different hypotheses in the way you describe; because evidence drops out of memory eventually, the probability of Waluigi drops very small instead of dropping to zero. This makes an eventual waluigi transition inevitable as claimed in the post.

I disagree. The crux of the matter is the limited memory of an LLM. If the LLM had unlimited memory, then every Luigi act would further accumulate a little evidence against Waluigi. But because LLMs can only update on so much context, the probability drops to a small one instead of continuing to drop to zero. This makes waluigi inevitable in the long run.

Load More