CatGoddess — AI Alignment Forum

I agree that motivation should reduce to low-level, primitive things, and also that changing the agent's belief about where the cheese is lets you retarget behavior. However, I don't expect edits to beliefs to let you scalably control what the agent does, in that if it's smart enough and making sufficiently complicated plans you won't have a reliable mapping from (world model state) to (abstract class of behavior executed by the agent), where when I say "abstract class of behavior" I mean things like "put the red balls in the blue basket" or "pet all the cats in the environment."

It also seems plausible to me that there exist parts of the agent that do allow for scalable control through modification, and this is what I would refer to as "the values" (the classic example here is a utility function, though things like RL agents might not have those).

But maybe you're studying the structure of motivational circuitry with a downstream objective other than "scalable control," in which case this objection doesn't necessarily apply.

Understanding and controlling a maze-solving policy network

CatGoddess3y20

CatGoddess3y00

Great post! I'm looking forward to seeing future projects from Team Shard.

I'm curious why you frame channel 55 as being part of the agent's "cheese-seeking motivation," as opposed to simply encoding the agent's belief about where the cheese is. Unless I'm missing something, I'd expect the latter to be as or more likely - in that when you change the cheese's location, the thing that should straightforwardly change is the agent's model of the cheese's location.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments