Matthew "Vaniver" Graves

Wiki Contributions

Comments

I want to point out that I think the typical important case looks more like "wanting to do things for unusual reasons," and if you're worried about this approach breaking down there that seems like a pretty central obstacle. For example, suppose rather than trying to maintain a situation (the diamond stays in the vault) we're trying to extrapolate (like coming up with a safe cancer cure). When looking at a novel medication to solve an unsolved problem, we won't be able to say "well, it cures the cancer for the normal reason" because there aren't any positive examples to compare to (or they'll be identifiably different).

It might still work out, because when we ask "is the patient healthy?" there is something like "the normal reason" there. [But then maybe it doesn't work for Dyson Sphere designs, or so on.]

Is this saying "if model performance is getting better, then maybe it will have a sharp left turn, and if model performance isn't getting better, then it won't"?

In particular, all of the RLHF work is basically capabilities work which makes alignment harder in the long term (because it directly selects for deception), while billing itself as "alignment".

I share your opinion of RLHF work but I'm not sure I share your opinion of its consequences. For situations where people don't believe arguments that RLHF is fundamentally flawed because they're too focused on empirical evidence over arguments, the generation of empirical evidence that RLHF is flawed seems pretty useful for convincing them! 

This might imply a predictive circuit for predicting the output of the antecedent-computation-reinforcer, but I don't see why it implies internal reward-orientation motivational edifices.

Sorry, if I'm reading this right, we're hypothesizing internal reward-orientation motivational edifices, and then asking the question of whether or not policy gradients will encourage them or discourage them. Quintin seems to think "nah, it needs to take an action before that action can be rewarded", and my response is "wait, isn't this going to be straightforwardly encouraged by backpropagation?"

[I am slightly departing from Wei_Dai's hypothetical in my line of reasoning here, as Wei is mostly focused on asking "don't you expect this to come about in an introspective-reasoning powered way?" and I'm mostly focused on asking "if this structure is present in the model initialization as one of the lottery tickets, won't policy gradient encourage it?".]

I think that assuming there's a feature-direction "care more about reward" which isn't already gradient-starved by shallower proxies learned earlier in training. In my ontology, this corresponds to "thinking thoughts about reward in order to get reward." 

Cool, this feels like a real reason, but also substantially more contingent. Naively, I would expect that you could construct a training schedule such that 'care more about reward' is encouraged, and someone will actually try to do this (as part of making a zero-shot learner in RL environments).

If this argument works, why doesn't it go through for people? (Not legibly a knockdown until we check that the mechanisms are sufficiently similar, but it's at least a sanity check. I think the mechanisms are probably sufficiently similar, though.)

I think we have some pre-existing disagreement about what we should conclude from human heroin addicts; you seem to think "yeah, it only happens sometimes" whereas my view is something more like "fuck, it happens sometimes". Like, the thing where people don't do heroin because they've heard other people downvote heroin addiction is not a strategy that scales to superintelligence.

For fixed-duration reasoning chains, or in situations where the AI is trying to facilitate cooperation between different passes (like in the proposed scenario where it gets information but then loses access to it), this seems important. For example, in a situation where you show the model a passage, let it make notes, and then have the same model look just at the notes and answer a reading comprehension question, I expect that likely it will want to embed information in the null space of the notes. [Or in an HCH-style situation, it might want to embed the original question into subquestions, and might do so steganographically, in a way that breaks some of the things one might be hoping for there.]

But I'm somewhat confused about whether you should expect it to crop up in recursive prompting cases (probably you don't?). If you secretly know the answer in the first pass, don't you want to just say the answer and terminate? If you secretly know a relevant fact in the first pass, don't you want to openly express it and recur? I don't see how steganography is rewarded over openness. [I'm setting aside the case where it invents a new language to express non-linguistic concepts more cheaply, and assuming we're in the natural language setting where standard language is the 'natural' way to transmit the information.]

so in pure non-lookahead (e.g. model free) sample-based policy gradient estimation, an action which has never been tried can not be reinforced (except as a side effect of generalisation by function approximation).

This is the bit I don't believe, actually. [Or at least don't think is relevant.] Note that in Wei_Dai's hypothetical, the neural net architecture has a particular arrangement such that "how much it optimizes for reward" is either directly or indirectly implied by the neural network weights. [We're providing the reward as part of its observations, and so if nothing else the weights from that part of the input vector to deeper in the network will be part of this, but the actual mechanism is going to be more complicated for one that doesn't have access to that.]

Quintin seems to me to be arguing "if you actually follow the math, there isn't a gradient to that parameter," which I find surprising, and which seems easy to demonstrate by going thru the math. As far as I can tell, there is a gradient there, and it points in the direction of "care more about reward."

This doesn't mean that, by caring about reward more, it knows which actions in the environment cause more reward. There I believe the story that the RL algorithm won't be able to reinforce actions that have never been tried.

[EDIT: Maybe the argument is "but if it's never tried the action of optimizing harder for reward, then the RL algorithm won't be able to reinforce that internal action"? But that seems pretty strained and not very robust, as the first time it considers trying harder to get reward, it will likely get hooked.]

If the agent doesn’t explore in the direction of answering “good”, then there’s no gradient in that direction.

Wait, I don't think this is true? At least, I'd appreciate it being stepped thru in more detail.

In the simplest story, we're imagining an agent whose policy is  and, for simplicity's sake,  is a scalar that determines "how much to maximize for reward" and all the other parameters of  store other things about the dynamics of the world / decision-making process.

It seems to me that  is obviously going to try to point  in the direction of "maximize harder for reward".

In the more complicated story, we're imagining an agent whose policy is  which involves how it manipulates both external and internal actions (and thus both external and internal state). One of the internal state pieces (let's call it  like last time) determines whether it selects actions that are more reward-seeking or not. Again I think it seems likely that  is going to try to adjust  such that the agent selects internal actions that point  in the direction of "maximize harder for reward".

What is my story getting wrong?

If you have a space with two disconnected components, then I'm calling the distinction between them "crisp."

The components feel disconnected to me in 1D, but I'm not sure they would feel disconnected in 3D or in ND. Is your intuition that they're 'durably disconnected' (even looking at the messy plan-space of the real-world, we'll be able to make a simple classifier that rates corrigibility), or if not, when the connection comes in (like once you can argue about philosophy in way X, once you have uncertainty about your operator's preferences, once you have the ability to shut off or distract bits of your brain without other bits noticing, etc.)?

[This also feels like a good question for people who think corrigibility is anti-natural; do you not share Paul's sense that they're disconnected in 1D, or when do you think the difficulty comes in?]

I agree this list doesn't seem to contain much unpublished material, and I think the main value of having it in one numbered list is that "all of it is in one, short place", and it's not an "intro to computers can think" and instead is "these are a bunch of the reasons computers thinking is difficult to align".

The thing that I understand to be Eliezer's "main complaint" is something like: "why does it seem like No One Else is discovering new elements to add to this list?". Like, I think Risks From Learned Optimization was great, and am glad you and others wrote it! But also my memory is that it was "prompted" instead of "written from scratch", and I imagine Eliezer reading it more had the sense of "ah, someone made 'demons' palatable enough to publish" instead of "ah, I am learning something new about the structure of intelligence and alignment."

[I do think the claim that Eliezer 'figured it out from the empty string' doesn't quite jive with the Yudkowsky's Coming of Age sequence.]

Why is the process by which humans come to reliably care about the real world

IMO this process seems pretty unreliable and fragile, to me. Drugs are popular; video games are popular; people-in-aggregate put more effort into obtaining imaginary afterlives than life extension or cryonics.

But also humans have a much harder time 'optimizing against themselves' than AIs will, I think. I don't have a great mechanistic sense of what it will look like for an AI to reliably care about the real world.

Load More