Wiki Contributions

Comments

tailcalled2mo1013

In fact, PPO is essentially a tweaked version of REINFORCE,

Valid point.

Beyond PPO and REINFORCE, this "x as learning rate multiplier" pattern is actually extremely common in different RL formulations. From lecture 7 of David Silver's RL course:

Critically though, neither Q, A or delta denote reward. Rather they are quantities which are supposed to estimate the effect of an action on the sum of future rewards; hence while pure REINFORCE doesn't really maximize the sum of rewards, these other algorithms are attempts to more consistently do so, and the existence of such attempts shows that it's likely we will see more better attempts in the future.

It was published in 1992, a full 22 years before Bostrom's book.

Bostrom's book explicitly states what kinds of reinforcement learning algorithms he had in mind, and they are not REINFORCE:

Often, the learning algorithm involves the gradual construction of some kind of evaluation function, which assigns values to states, state–action pairs, or policies. (For instance, a program can learn to play backgammon by using reinforcement learning to incrementally improve its evaluation of possible board positions.) The evaluation function, which is continuously updated in light of experience, could be regarded as incorporating a form of learning about value. However, what is being learned is not new final values but increasingly accurate estimates of the instrumental values of reaching particular states (or of taking particular actions in particular states, or of following particular policies). Insofar as a reinforcement-learning agent can be described as having a final goal, that goal remains constant: to maximize future reward. And reward consists of specially designated percepts received from the environment. Therefore, the wireheading syndrome remains a likely outcome in any reinforcement agent that develops a world model sophisticated enough to suggest this alternative way of maximizing reward.

Similarly, before I even got involved with alignment or rationalism, the canonical reinforcement learning algorithm I had heard of was TD, not REINFORCE.

It also has a bad track record in ML, as the core algorithmic structure of RL algorithms capable of delivering SOTA results has not changed that much in over 3 decades.

Huh? Dreamerv3 is clearly a step in the direction of utility maximization (away from "reward is not the optimization target"), and it claims to set SOTA on a bunch of problems. Are you saying there's something wrong with their evaluation?

In fact, just recently Cohere published Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs, which found that the classic REINFORCE algorithm actually outperforms PPO for LLM RLHF finetuning.

LLM RLHF finetuning doesn't build new capabilities, so it should be ignored for this discussion.

Finally, this counterpoint seems irrelevant for Alex's point in this post, which is about historical alignment arguments about historical RL algorithms. He even included disclaimers at the top about this not being an argument for optimism about future AI systems.

It's not irrelevant. The fact that Alex Turner explicitly replies to Nick Bostrom and calls his statement nonsense means that Alex Turner does not get to use a disclaimer to decide what the subject of discussion is. Rather, the subject of discussion is whatever Bostrom was talking about. The disclaimer rather serves as a way of turning our attention away from stuff like DreamerV3 and towards stuff like DPO. However DreamerV3 seems like a closer match for Bostrom's discussion than DPO is, so the only way turning our attention away from it can be valid is if we assume DreamerV3 is a dead end and DPO is the only future.

This is actually pointing to the difference between online and offline learning algorithms, not RL versus non-RL learning algorithms.

I was kind of pointing to both at once.

In contrast, offline RL is surprisingly stable and robust to reward misspecification.

Seems to me that the linked paper makes the argument "If you don't include attempts to try new stuff in your training data, you won't know what happens if you do new stuff, which means you won't see new stuff as a good opportunity". Which seems true but also not very interesting, because we want to build capabilities to do new stuff, so this should instead make us update to assume that the offline RL setup used in this paper won't be what builds capabilities in the limit. (Not to say that they couldn't still use this sort of setup as some other component than what builds the capabilities, or that they couldn't come up with an offline RL method that does want to try new stuff - merely that this particular argument for safety bears too heavy of an alignment tax to carry us on its own.)

tailcalled2mo1523

I get that a lot of AI safety rhetoric is nonsensical, but I think your strategy of obscuring technical distinctions between different algorithms and implicitly assuming that all future AI architectures will be something like GPT+DPO is counterproductive.

After making a false claim, Bostrom goes on to dismiss RL approaches to creating useful, intelligent, aligned systems. But, as a point of further fact, RL approaches constitute humanity's current best tools for aligning AI systems today! Those approaches are pretty awesome. No RLHF, then no GPT-4 (as we know it).

RLHF as understood currently (with humans directly rating neural network outputs, a la DPO) is very different from RL as understood historically (with the network interacting autonomously in the world and receiving reward from a function of the world). It's not an error from Bostrom's side to say something that doesn't apply to the former when talking about the latter, though it seems like a common error to generalize from the latter to the former.

I think it's best to think of DPO as a low-bandwidth NN-assisted supervised learning algorithm, rather than as "true reinforcement learning" (in the classical sense). That is, under supervised learning, humans provide lots of bits by directly creating a training sample, whereas with DPO, humans provide ~1 bit by picking the network-generated sample they like the most. It's unclear to me whether DPO has any advantage over just directly letting people edit the outputs, other than that if you did that, you'd empower trolls/partisans/etc. to intentionally break the network.

Did RL researchers in the 1990’s sit down and carefully analyze the inductive biases of PPO on huge 2026-era LLMs, conclude that PPO probably entrains LLMs which make decisions on the basis of their own reinforcement signal, and then decide to say “RL trains agents to maximize reward”? Of course not.

I was under the impression that PPO was a recently invented algorithm? Wikipedia says it was first published in 2017, which if true would mean that all pre-2017 talk about reinforcement learning was about other algorithms than PPO.

(A) Hopefully everyone on all sides can agree that if my LLM reliably exhibits a certain behavior—e.g. it outputs “apple” after a certain prompt—and you ask me “Why did it output ‘apple’, rather than ‘banana’?”, then it might take me decades of work to give you a satisfying intuitive answer.

I don't confidently disagree with this statement, but it occurs to me that I haven't tried it myself and haven't followed it very closely, and have sometimes heard claims that there are promising methods.

A lot of people trying to come up with answers try to do it with mechanistic interpretability, but that probably isn't very feasible. However, investigations based on ideas like neural tangent kernels seem plausibly more satisfying and feasible. Like if you show that the dataset contains a bunch of instances that'd push it towards saying apple rather than banana, and you then investigate where those data points come from and realize that there's actually a pretty logical story for them, then that seems basically like success.

As an example, I remember a while ago there was some paper that claimed to have found a way to attribute NN outputs to training data points, and it claimed that LLM power-seeking was mainly caused by sci-fi stories and by AI safety discussions. I didn't read the paper so I don't know whether it's legit, but that sort of thing seems quite plausibly feasible a lot of the time.

tailcalled9mo1620

I think discussions about capabilities raise the question "why create AI that is highly capable at deception etc.? seems like it would be safer not to".

The problem that occurs here is that some ways to create capabilities are quite open-ended, and risk accidentally creating capabilities for deception due to instrumental convergence. But at that point it feels like we are getting into the territory that is best thought of as "intelligence", rather than "capabilities".

Nice, I was actually just thinking that someone needed to respond to LeCun's proposal.

That said, I think you may have gotten some of the details wrong. I don't think the intrinsic cost module gets raw sensory data as input, but instead it gets input from the latent variables of the world model as well as the self-supervised perception module. This complicates some of the safety problems you suggest.

Did you tell your friend the premise behind this interaction out of band?

I think the key question is how much policy-caused variance in the outcome there is.

That is, juice can chain onto itself because we assume there will be a bunch of scenarios where reinforcement-triggering depends a lot on the choices made. If you are near a juice but not in front of the juice, you can walk up to the juice, which triggers reinforcement, or you can not walk up to the juice, which doesn't trigger reinforcement. The fact that these two plausible actions differ in their reinforcement is what I am referring to with "policy-caused variance".

If you are told not to kill someone, then you can usually stay sufficiently far away from killing anyone, in such a way that the variance in killing is 0, because killing itself has a constant value of 0. (Unlike juice-drinking which might have a value of 0 or 1, depending on both scenario and actions.)

But you could also have a case with a positive value that fails to be robust/lasting, if the positive value fails to have variance. One example would be if the positive value is bounded and always achieved; for instance you might imagine that if you are always carrying a juice dispenser, juice-drinking would always have a value of 1, and therefore again there wouldn't be any variance to reinforce seeking juice.

A more subtle point is that if the task is too difficult, e.g. if you are in a desert with no juice available, then the policy-caused variance is also 0, because the juice is constant 0. This is a general point that encompasses many failures of reinforcement learning. Often, if you don't do things like reward shaping, then reinforcement learning simply fails, because the task is too complex to learn.

I think future RL algorithms will be able to succeed using less policy-caused variance than present RL algorithms require by using models to track the outcomes through many layers of interactions.

I don't think so.

Like technically yes, it shows that there is an internal optimization process that is running in the networks, but much of the meat of optimization such as instrumental convergence/power-seeking depends the structure of the function one is optimizing over.

If the function is not consequentialist - if it doesn't attempt to compute the real-world consequences of different outputs and grade things based on those consequences - then much of the discussion about optimizers does not apply.

I would very much assume that you have a strong genetic disposition to be smart and curious.

Do you think unschooling would work acceptably well for kids who are not smart and curious?

Load More