AI ALIGNMENT FORUM
AF

66
Vladimir Slepnev
Ω435121830
Message
Dialogue
Subscribe

https://vladimirslepnev.me

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
1cousin_it's Shortform
6y
10
1cousin_it's Shortform
6y
10
21Announcement: AI alignment prize round 4 winners
7y
0
24Announcement: AI alignment prize round 3 winners and next round
7y
0
16UDT can learn anthropic probabilities
7y
0
0Using the universal prior for logical uncertainty
7y
0
6UDT as a Nash Equilibrium
8y
0
11Beware of black boxes in AI alignment research
8y
0
1Announcing the AI Alignment Prize
8y
0
9Using modal fixed points to formalize logical causality
8y
10
3A cheating approach to the tiling agents problem
8y
3
Load More
We Built a Tool to Protect Your Dataset From Simple Scrapers
cousin_it2mo32

Have you seen Anubis?

Reply
“The Era of Experience” has an unsolved technical alignment problem
cousin_it5mo50

My perspective (well, the one that came to me during this conversation) is indeed "I don't want to take cocaine -> human-level RL is not the full story". That our attachment to real world outcomes and reluctance to wirehead is due to evolution-level RL, not human-level. So I'm not quite saying all plans will fail; but I am indeed saying that plans relying only on RL within the agent itself will have wireheading as attractor, and it might be better to look at other plans.

It's just awfully delicate. If the agent is really dumb, it will enjoy watching videos of the button being pressed (after all, they cause the same sensory experiences as watching the actual button being pressed). Make the agent a bit smarter, because we want it to be useful, and it'll begin to care about the actual button being pressed. But add another increment of smart, overshoot just a little bit, and it'll start to realize that behind the button there's a wire, and the wire leads to the agent's own reward circuit and so on.

Can you engineer things just right, so the agent learns to care about just the right level of "realness"? I don't know, but I think in our case evolution took a different path. It did a bunch of learning by itself, and saddled us with the result: "you'll care about reality in this specific way". So maybe when we build artificial agents, we should also do a bunch of learning outside the agent to capture the "realness"? That's the point I was trying to make a couple comments ago, but maybe didn't phrase it well.

Reply
“The Era of Experience” has an unsolved technical alignment problem
cousin_it5mo*30

Do you think the agent will care about the button and ignore the wire, even if during training it already knew that buttons are often connected to wires? Or does it depend on the order in which the agent learns things?

In other words, are we hoping that RL will make the agent focus on certain aspects of the real world that we want it to focus on? If that's the plan, to me at first glance it seems a bit brittle. A slightly smarter agent would turn its gaze slightly closer to the reward itself. Or am I still missing something?

Reply
“The Era of Experience” has an unsolved technical alignment problem
cousin_it5mo*10

I thought about it some more and want to propose another framing.

The problem, as I see it, is learning to choose futures based on what will actually happen in these futures, not on what the agent will feel. The agent's feelings can even be identical in future A vs future B, but the agent can choose future A anyway. Or maybe one of the futures won't even have feelings involved: imagine an environment where any mistake kills the agent. In such an environment, RL is impossible.

The reason we can function in such environments, I think, is because we aren't the main learning process involved. Evolution is. It's a kind of RL for which the death of one creature is not the end. In other words, we can function because we've delegated a lot of learning to outside processes, and do rather little of it ourselves. Mostly we execute strategies that evolution has learned, on top of that we execute strategies that culture has learned, and on top of that there's a very thin layer of our own learning. (Btw, here I disagree with you a bit: I think most of human learning is imitation. For example, the way kids pick up language and other behaviors from parents and peers.)

This suggests to me that if we want the rubber to meet the road - if we want the agent to have behaviors that track the world, not just the agent's own feelings - then the optimization process that created the agent cannot be the agent's own RL. By itself, RL can only learn to care about "behavioral reward" as you put it. Caring about the world can only occur if the agent "inherits" that caring from some other process in the world, by makeup or imitation.

This conclusion might be a bit disappointing, because finding the right process to "inherit" from isn't easy. Evolution depends on one specific goal (procreation) and is not easy to adapt to other goals. However, evolution isn't the only such process. There is also culture, and there is also human intelligence, which hopefully tracks reality a little bit. So if we want to design agents that will care about human flourishing, we can't hope that the agents will learn it by some clever RL. It has to be due to the agent's makeup or imitation.

This is all a bit tentative, I was just writing out the ideas as they came. Not sure at all that any of it is right. But anyway what do you think?

Reply
“The Era of Experience” has an unsolved technical alignment problem
cousin_it5mo10

I think it helps. The link to "non-behaviorist rewards" seems the most relevant. The way I interpret it (correct me if I'm wrong) is that we can have different feelings in the present about future A vs future B, and act to choose one of them, even if we predict our future feelings to be the same in both cases. For example, button A makes a rabbit disappear and gives you an amnesia pill, and button B makes a rabbit disappear painfully and gives you an amnesia pill.

The followup question then is, what kind of learning could lead to this behavior? Maybe RL in some cases, maybe imitation learning in some cases, or maybe it needs the agent to be structured a certain way. Do you already have some crisp answers about this?

Reply
Towards a scale-free theory of intelligent agency
cousin_it5mo*20

Here's maybe a related point: AIs might find it useful to develop an ability to reveal their internals in a verifiable way under certain conditions (say, when the other AI offers to do the same thing and there's a way to do a secure "handshake"). So deception ability would be irrelevant, because AIs that can credibly refrain from deception with each other would choose to do so and get a first-best outcome, instead of second-best as voting theory would suggest.

A real world analogy is some of the nuclear precommitments mentioned in Schelling's book. Like when the US and Soviets knowingly refrained from catching some of each other's spies, because if a flock of geese triggers the warning radars or something, spies could provide their side with the crucial information that an attack isn't really happening and there's no need to retaliate.

Reply
“The Era of Experience” has an unsolved technical alignment problem
cousin_it5mo30

Thanks for the link! It's indeed very relevant to my question.

I have another question, maybe a bit philosophical. Humans seem to reward-hack in some aspects of value, but not in others. For example, if you offered a mathematician a drug that would make them feel like they solved Riemann's hypothesis, they'd probably refuse. But humans aren't magical: we are some combination of reinforcement learning, imitation learning and so on. So there's got to be some non-magical combination of these learning methods that would refuse reward hacking, at least in some cases. Do you have any thoughts what it could be?

Reply
Modifying LLM Beliefs with Synthetic Document Finetuning
cousin_it5mo40

Very interesting, thanks for posting this!

One question that comes to mind is, could the layers be flipped? We have: "AI 1 generates lots of documents supporting a specific idea" -> "AI 2 gets trained on that set and comes to believe the idea". Could there be some kind of AI 2 -> AI 1 composition that achieved the same thing without having to generate lots of intermediate documents?

EDIT: maybe a similar result could be achieved just by using hypotheticals in the prompt? Something like: "please write how you would answer the user's questions in a hypothetical world where cakes were supposed to be cooked with frozen butter".

Reply1
“The Era of Experience” has an unsolved technical alignment problem
cousin_it5mo30

I think this is all correct, but it makes me wonder.

You can imagine reinforcement learning as learning to know explicitly how reward looks like, and how to make plans to achieve it. Or you can imagine it as building a bunch of heuristics inside the agent, pulls and aversions, that don't necessarily lead to coherent behavior out of distribution and aren't necessarily understood by the agent. A lot of human values seem to be like this, even though humans are pretty smart. Maybe an AI will be even smarter, and subjecting it to any kind of reinforcement learning at all will automatically make it adopt explicit Machiavellian reasoning about the thing, but I'm not sure how to tell if it's true or not.

Reply
Towards a scale-free theory of intelligent agency
cousin_it6mo*112

Good post. But I thought about this a fair bit and I think I disagree with the main point.

Let's say we talk about two AIs merging. Then the tuple of their expected utilities from the merge had better be on the Pareto frontier, no? Otherwise they'd just do a better merge that gets them onto the frontier. Which specific point on the frontier is a matter of bargaining, but the fact that they want to hit the frontier isn't, it's a win-win. And the merges that get them to the frontier are exactly those that output a EUM agent, maximizing some linear combination of their utilities. If the point they want to hit is in a flat region of the frontier, the merge will involve coinflips to choose which EUM agent to become; and if it's curvy at that point, the merge will be deterministic. For realistic agents who have more complex preferences than just linearly caring about one cake, I expect the frontier will be curvy, so deterministic merge into a EUM agent will be the best choice.

Reply
Load More