https://vladimirslepnev.me
Do you think the agent will care about the button and ignore the wire, even if during training it already knew that buttons are often connected to wires? Or does it depend on the order in which the agent learns things?
In other words, are we hoping that RL will make the agent focus on certain aspects of the real world that we want it to focus on? If that's the plan, to me at first glance it seems a bit brittle. A slightly smarter agent would turn its gaze slightly closer to the reward itself. Or am I still missing something?
I thought about it some more and want to propose another framing.
The problem, as I see it, is learning to choose futures based on what will actually happen in these futures, not on what the agent will feel. The agent's feelings can even be identical in future A vs future B, but the agent can choose future A anyway. Or maybe one of the futures won't even have feelings involved: imagine an environment where any mistake kills the agent. In such an environment, RL is impossible.
The reason we can function in such environments, I think, is because we aren't the main learning process involved. Evolution is. It's a kind of RL for which the death of one creature is not the end. In other words, we can function because we've delegated a lot of learning to outside processes, and do rather little of it ourselves. Mostly we execute strategies that evolution has learned, on top of that we execute strategies that culture has learned, and on top of that there's a very thin layer of our own learning. (Btw, here I disagree with you a bit: I think most of human learning is imitation. For example, the way kids pick up language and other behaviors from parents and peers.)
This suggests to me that if we want the rubber to meet the road - if we want the agent to have behaviors that track the world, not just the agent's own feelings - then the optimization process that created the agent cannot be the agent's own RL. By itself, RL can only learn to care about "behavioral reward" as you put it. Caring about the world can only occur if the agent "inherits" that caring from some other process in the world, by makeup or imitation.
This conclusion might be a bit disappointing, because finding the right process to "inherit" from isn't easy. Evolution depends on one specific goal (procreation) and is not easy to adapt to other goals. However, evolution isn't the only such process. There is also culture, and there is also human intelligence, which hopefully tracks reality a little bit. So if we want to design agents that will care about human flourishing, we can't hope that the agents will learn it by some clever RL. It has to be due to the agent's makeup or imitation.
This is all a bit tentative, I was just writing out the ideas as they came. Not sure at all that any of it is right. But anyway what do you think?
I think it helps. The link to "non-behaviorist rewards" seems the most relevant. The way I interpret it (correct me if I'm wrong) is that we can have different feelings in the present about future A vs future B, and act to choose one of them, even if we predict our future feelings to be the same in both cases. For example, button A makes a rabbit disappear and gives you an amnesia pill, and button B makes a rabbit disappear painfully and gives you an amnesia pill.
The followup question then is, what kind of learning could lead to this behavior? Maybe RL in some cases, maybe imitation learning in some cases, or maybe it needs the agent to be structured a certain way. Do you already have some crisp answers about this?
Here's maybe a related point: AIs might find it useful to develop an ability to reveal their internals in a verifiable way under certain conditions (say, when the other AI offers to do the same thing and there's a way to do a secure "handshake"). So deception ability would be irrelevant, because AIs that can credibly refrain from deception with each other would choose to do so and get a first-best outcome, instead of second-best as voting theory would suggest.
A real world analogy is some of the nuclear precommitments mentioned in Schelling's book. Like when the US and Soviets knowingly refrained from catching some of each other's spies, because if a flock of geese triggers the warning radars or something, spies could provide their side with the crucial information that an attack isn't really happening and there's no need to retaliate.
Thanks for the link! It's indeed very relevant to my question.
I have another question, maybe a bit philosophical. Humans seem to reward-hack in some aspects of value, but not in others. For example, if you offered a mathematician a drug that would make them feel like they solved Riemann's hypothesis, they'd probably refuse. But humans aren't magical: we are some combination of reinforcement learning, imitation learning and so on. So there's got to be some non-magical combination of these learning methods that would refuse reward hacking, at least in some cases. Do you have any thoughts what it could be?
Very interesting, thanks for posting this!
One question that comes to mind is, could the layers be flipped? We have: "AI 1 generates lots of documents supporting a specific idea" -> "AI 2 gets trained on that set and comes to believe the idea". Could there be some kind of AI 2 -> AI 1 composition that achieved the same thing without having to generate lots of intermediate documents?
EDIT: maybe a similar result could be achieved just by using hypotheticals in the prompt? Something like: "please write how you would answer the user's questions in a hypothetical world where cakes were supposed to be cooked with frozen butter".
I think this is all correct, but it makes me wonder.
You can imagine reinforcement learning as learning to know explicitly how reward looks like, and how to make plans to achieve it. Or you can imagine it as building a bunch of heuristics inside the agent, pulls and aversions, that don't necessarily lead to coherent behavior out of distribution and aren't necessarily understood by the agent. A lot of human values seem to be like this, even though humans are pretty smart. Maybe an AI will be even smarter, and subjecting it to any kind of reinforcement learning at all will automatically make it adopt explicit Machiavellian reasoning about the thing, but I'm not sure how to tell if it's true or not.
Good post. But I thought about this a fair bit and I think I disagree with the main point.
Let's say we talk about two AIs merging. Then the tuple of their expected utilities from the merge had better be on the Pareto frontier, no? Otherwise they'd just do a better merge that gets them onto the frontier. Which specific point on the frontier is a matter of bargaining, but the fact that they want to hit the frontier isn't, it's a win-win. And the merges that get them to the frontier are exactly those that output a EUM agent, maximizing some linear combination of their utilities. If the point they want to hit is in a flat region of the frontier, the merge will involve coinflips to choose which EUM agent to become; and if it's curvy at that point, the merge will be deterministic. For realistic agents who have more complex preferences than just linearly caring about one cake, I expect the frontier will be curvy, so deterministic merge into a EUM agent will be the best choice.
The relevant point is his latter claim: “in particular with respect to “learn ‘don’t steal’ rather than ‘don’t get caught’.”″ I think this is a very strong conclusion, relative to available data.
I think humans don't steal mostly because society enforces that norm. Toward weaker "other" groups that aren't part of your society (farmed animals, weaker countries, etc) there's no such norm, and humans often behave badly toward such groups. And to AIs, humans will be a weaker "other" group. So if alignment of AIs to human standard is a complete success - if AIs learn to behave toward weaker "other" groups exactly as humans behave toward such groups - the result will be bad for humans.
It gets even worse because AIs, unlike humans, aren't raised to be moral. They're raised by corporations with a goal to make money, with a thin layer of "don't say naughty words" morality. We already know corporations will break rules, bend rules, lobby to change rules, to make more money and don't really mind if people get hurt in the process. We'll see more of that behavior when corporations can make AIs to further their goals.
My perspective (well, the one that came to me during this conversation) is indeed "I don't want to take cocaine -> human-level RL is not the full story". That our attachment to real world outcomes and reluctance to wirehead is due to evolution-level RL, not human-level. So I'm not quite saying all plans will fail; but I am indeed saying that plans relying only on RL within the agent itself will have wireheading as attractor, and it might be better to look at other plans.
It's just awfully delicate. If the agent is really dumb, it will enjoy watching videos of the button being pressed (after all, they cause the same sensory experiences as watching the actual button being pressed). Make the agent a bit smarter, because we want it to be useful, and it'll begin to care about the actual button being pressed. But add another increment of smart, overshoot just a little bit, and it'll start to realize that behind the button there's a wire, and the wire leads to the agent's own reward circuit and so on.
Can you engineer things just right, so the agent learns to care about just the right level of "realness"? I don't know, but I think in our case evolution took a different path. It did a bunch of learning by itself, and saddled us with the result: "you'll care about reality in this specific way". So maybe when we build artificial agents, we should also do a bunch of learning outside the agent to capture the "realness"? That's the point I was trying to make a couple comments ago, but maybe didn't phrase it well.