Let the wonder never fade!

Aspiring alignment researcher with a keen interest in agent foundations. Studying math, physics, theoretical CS (Harvard 2027). Contact me via Discord: dalcy_me, email: dalcy.mail@gmail.com. They / Them, He / Him.


Sorted by New

Wiki Contributions



What are the errors in this essay? As I'm reading through the Brain-like AGI sequence I keep seeing this post being referenced (but this post says I should instead read the sequence!)

I would really like to have a single reference post of yours that contains the core ideas about phasic dopamine rather than the reference being the sequence posts (which is heavily dependent on a bunch of previous posts; also Post 5 and 6 feels more high-level than this one?)


Especially because we’re working with toy models that ostensibly fit the description of an optimizer, we may end up with a model that mechanistically doesn’t have an explicit notion of objective.

I think this is very likely to be the default for most toy models one trains RL on. In my model of agent value formation (which looks very much like this post), explicit representation of objectives is useful inasmuch the model already has some sort of internal "optimizer" or search process. And before that, simple "heuristics" (or shards) should suffice—especially in small training regimes.


I think that RLHF is reasonably likely to be safer than prompt engineering: RLHF is probably a more powerful technique for eliciting your model’s capabilities than prompt engineering is. And so if you need to make a system which has some particular level of performance, you can probably achieve that level of performance with a less generally capable model if you use RLHF than if you use prompt engineering.

Wait, that doesn't really follow. RLHF can elicit more capabilities than prompt engineering, yes, but how is that a reason for RLHF being safer than prompt engineering?