AI ALIGNMENT FORUM
AF

802
Vladimir Slepnev
Ω439121870
Message
Dialogue
Subscribe

https://vladimirslepnev.me

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
1cousin_it's Shortform
6y
10
Plans A, B, C, and D for misalignment risk
cousin_it10d10

Yeah, that partly makes sense to me. I guess my intuition is like, if 95% of the company is focused on racing as hard as possible (and using AI leverage for that too, AI coming up with new unsafe tricks and all that), then the 5% who care about safety probably won't have that much impact.

Reply
Plans A, B, C, and D for misalignment risk
cousin_it10d417

The OP says takeover risk is 45% under plan D and 75% under plan E. We're supposed to gain an extra 30% of safety from this feeble "build something by next week with 1% of compute"? Not happening.

My point is that if the "ten people on the inside" obey their managers, plan D will have a tiny effect at best. And if we instead postulate that they won't obey their managers, then there are no such "ten people on the inside" in the first place. So we should already behave as if we're in world E.

Reply
Plans A, B, C, and D for misalignment risk
cousin_it10d13

Can you maybe describe in more detail how you imagine it? What specifically do the "ten people on the inside" do, if company leadership disagrees with them about safety?

Reply
Plans A, B, C, and D for misalignment risk
cousin_it10d20

Do you know any people working at frontier labs who would be willing to do the kind of thing you describe in plan D, some kind of covert alignment against the wishes of the larger company? Who would physically press keys on their terminal to do it, as opposed to quitting or trying to sway the company? Not asking to name names, just my hunch is that there are very few such people now, maybe none at all. And if that's the case, we're in E world already.

Reply
We Built a Tool to Protect Your Dataset From Simple Scrapers
cousin_it3mo32

Have you seen Anubis?

Reply
“The Era of Experience” has an unsolved technical alignment problem
cousin_it6mo50

My perspective (well, the one that came to me during this conversation) is indeed "I don't want to take cocaine -> human-level RL is not the full story". That our attachment to real world outcomes and reluctance to wirehead is due to evolution-level RL, not human-level. So I'm not quite saying all plans will fail; but I am indeed saying that plans relying only on RL within the agent itself will have wireheading as attractor, and it might be better to look at other plans.

It's just awfully delicate. If the agent is really dumb, it will enjoy watching videos of the button being pressed (after all, they cause the same sensory experiences as watching the actual button being pressed). Make the agent a bit smarter, because we want it to be useful, and it'll begin to care about the actual button being pressed. But add another increment of smart, overshoot just a little bit, and it'll start to realize that behind the button there's a wire, and the wire leads to the agent's own reward circuit and so on.

Can you engineer things just right, so the agent learns to care about just the right level of "realness"? I don't know, but I think in our case evolution took a different path. It did a bunch of learning by itself, and saddled us with the result: "you'll care about reality in this specific way". So maybe when we build artificial agents, we should also do a bunch of learning outside the agent to capture the "realness"? That's the point I was trying to make a couple comments ago, but maybe didn't phrase it well.

Reply
“The Era of Experience” has an unsolved technical alignment problem
cousin_it6mo*30

Do you think the agent will care about the button and ignore the wire, even if during training it already knew that buttons are often connected to wires? Or does it depend on the order in which the agent learns things?

In other words, are we hoping that RL will make the agent focus on certain aspects of the real world that we want it to focus on? If that's the plan, to me at first glance it seems a bit brittle. A slightly smarter agent would turn its gaze slightly closer to the reward itself. Or am I still missing something?

Reply
“The Era of Experience” has an unsolved technical alignment problem
cousin_it6mo*10

I thought about it some more and want to propose another framing.

The problem, as I see it, is learning to choose futures based on what will actually happen in these futures, not on what the agent will feel. The agent's feelings can even be identical in future A vs future B, but the agent can choose future A anyway. Or maybe one of the futures won't even have feelings involved: imagine an environment where any mistake kills the agent. In such an environment, RL is impossible.

The reason we can function in such environments, I think, is because we aren't the main learning process involved. Evolution is. It's a kind of RL for which the death of one creature is not the end. In other words, we can function because we've delegated a lot of learning to outside processes, and do rather little of it ourselves. Mostly we execute strategies that evolution has learned, on top of that we execute strategies that culture has learned, and on top of that there's a very thin layer of our own learning. (Btw, here I disagree with you a bit: I think most of human learning is imitation. For example, the way kids pick up language and other behaviors from parents and peers.)

This suggests to me that if we want the rubber to meet the road - if we want the agent to have behaviors that track the world, not just the agent's own feelings - then the optimization process that created the agent cannot be the agent's own RL. By itself, RL can only learn to care about "behavioral reward" as you put it. Caring about the world can only occur if the agent "inherits" that caring from some other process in the world, by makeup or imitation.

This conclusion might be a bit disappointing, because finding the right process to "inherit" from isn't easy. Evolution depends on one specific goal (procreation) and is not easy to adapt to other goals. However, evolution isn't the only such process. There is also culture, and there is also human intelligence, which hopefully tracks reality a little bit. So if we want to design agents that will care about human flourishing, we can't hope that the agents will learn it by some clever RL. It has to be due to the agent's makeup or imitation.

This is all a bit tentative, I was just writing out the ideas as they came. Not sure at all that any of it is right. But anyway what do you think?

Reply
“The Era of Experience” has an unsolved technical alignment problem
cousin_it6mo10

I think it helps. The link to "non-behaviorist rewards" seems the most relevant. The way I interpret it (correct me if I'm wrong) is that we can have different feelings in the present about future A vs future B, and act to choose one of them, even if we predict our future feelings to be the same in both cases. For example, button A makes a rabbit disappear and gives you an amnesia pill, and button B makes a rabbit disappear painfully and gives you an amnesia pill.

The followup question then is, what kind of learning could lead to this behavior? Maybe RL in some cases, maybe imitation learning in some cases, or maybe it needs the agent to be structured a certain way. Do you already have some crisp answers about this?

Reply
Towards a scale-free theory of intelligent agency
cousin_it6mo*20

Here's maybe a related point: AIs might find it useful to develop an ability to reveal their internals in a verifiable way under certain conditions (say, when the other AI offers to do the same thing and there's a way to do a secure "handshake"). So deception ability would be irrelevant, because AIs that can credibly refrain from deception with each other would choose to do so and get a first-best outcome, instead of second-best as voting theory would suggest.

A real world analogy is some of the nuclear precommitments mentioned in Schelling's book. Like when the US and Soviets knowingly refrained from catching some of each other's spies, because if a flock of geese triggers the warning radars or something, spies could provide their side with the crucial information that an attack isn't really happening and there's no need to retaliate.

Reply
Load More
1cousin_it's Shortform
6y
10
21Announcement: AI alignment prize round 4 winners
7y
0
24Announcement: AI alignment prize round 3 winners and next round
7y
0
16UDT can learn anthropic probabilities
7y
0
0Using the universal prior for logical uncertainty
7y
0
6UDT as a Nash Equilibrium
8y
0
11Beware of black boxes in AI alignment research
8y
0
1Announcing the AI Alignment Prize
8y
0
9Using modal fixed points to formalize logical causality
8y
10
3A cheating approach to the tiling agents problem
8y
3
Load More