I wrote this a month ago while working on my SERI MATS applications for shard theory. I'm now less confident in the claims and the usefulness of this direction, but it still seems worth sharing. 

I think reflectivity happens earlier then you might think in embedded RL agents. The basic concepts around value drift ("addiction", ...) are available in the world model from pretraining on human data (and alignment posts), and modeling context dependent shard activation and value drift helps the SSL WM predict future behavior. Because of these things I think we can get useful reflectivity and study it in sub-dangerous AI. This is where a good chunk of my alignment optimism comes from. (understanding reflectivity and instrumental convergence in real systems seems very important to building a safe AGI.)

In my model people view reflectivity through some sort of magic lens (possibly due to conflating mystical consciousness and non-mystical self-awareness?). Predicting that I'm more likely to crave cookie after seeing cookie or after becoming hungry isn't that hard. And if you explain shards and contextual activation (granted understanding the abstract concepts might be hard) you get more. Abstract reflection seems hard but still doable in sub AGI systems.

There's also a meaningful difference between primitive/dumb shard, my health shards outmaneuver my cookie shards because the former are "sophisitcated enough" to reflectively outsmart the cookie shards (e.g. by hiding the cookies) while the latter is contextually activated and "more primitive". I expect modelling the contextual activation of the cookie shard not to be hard, but reflective planning like hiding cookies seems harder- though still doable. (This might be where the majority of the interesting/hard part could be.)

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 7:01 AM

Relevant Paper

We propose Algorithm Distillation (AD), a method for distilling reinforcement learning (RL) algorithms into neural networks by modeling their training histories with a causal sequence model. Algorithm Distillation treats learning to reinforcement learn as an across-episode sequential prediction problem. A dataset of learning histories is generated by a source RL algorithm, and then a causal transformer is trained by autoregressively predicting actions given their preceding learning histories as context. Unlike sequential policy prediction architectures that distill post-learning or expert sequences, AD is able to improve its policy entirely in-context without updating its network parameters. We demonstrate that AD can reinforcement learn in-context in a variety of environments with sparse rewards, combinatorial task structure, and pixel-based observations, and find that AD learns a more data-efficient RL algorithm than the one that generated the source data.