Wiki Contributions


(A) Hopefully everyone on all sides can agree that if my LLM reliably exhibits a certain behavior—e.g. it outputs “apple” after a certain prompt—and you ask me “Why did it output ‘apple’, rather than ‘banana’?”, then it might take me decades of work to give you a satisfying intuitive answer.

I don't confidently disagree with this statement, but it occurs to me that I haven't tried it myself and haven't followed it very closely, and have sometimes heard claims that there are promising methods.

A lot of people trying to come up with answers try to do it with mechanistic interpretability, but that probably isn't very feasible. However, investigations based on ideas like neural tangent kernels seem plausibly more satisfying and feasible. Like if you show that the dataset contains a bunch of instances that'd push it towards saying apple rather than banana, and you then investigate where those data points come from and realize that there's actually a pretty logical story for them, then that seems basically like success.

As an example, I remember a while ago there was some paper that claimed to have found a way to attribute NN outputs to training data points, and it claimed that LLM power-seeking was mainly caused by sci-fi stories and by AI safety discussions. I didn't read the paper so I don't know whether it's legit, but that sort of thing seems quite plausibly feasible a lot of the time.

I think discussions about capabilities raise the question "why create AI that is highly capable at deception etc.? seems like it would be safer not to".

The problem that occurs here is that some ways to create capabilities are quite open-ended, and risk accidentally creating capabilities for deception due to instrumental convergence. But at that point it feels like we are getting into the territory that is best thought of as "intelligence", rather than "capabilities".

Nice, I was actually just thinking that someone needed to respond to LeCun's proposal.

That said, I think you may have gotten some of the details wrong. I don't think the intrinsic cost module gets raw sensory data as input, but instead it gets input from the latent variables of the world model as well as the self-supervised perception module. This complicates some of the safety problems you suggest.

Did you tell your friend the premise behind this interaction out of band?

I think the key question is how much policy-caused variance in the outcome there is.

That is, juice can chain onto itself because we assume there will be a bunch of scenarios where reinforcement-triggering depends a lot on the choices made. If you are near a juice but not in front of the juice, you can walk up to the juice, which triggers reinforcement, or you can not walk up to the juice, which doesn't trigger reinforcement. The fact that these two plausible actions differ in their reinforcement is what I am referring to with "policy-caused variance".

If you are told not to kill someone, then you can usually stay sufficiently far away from killing anyone, in such a way that the variance in killing is 0, because killing itself has a constant value of 0. (Unlike juice-drinking which might have a value of 0 or 1, depending on both scenario and actions.)

But you could also have a case with a positive value that fails to be robust/lasting, if the positive value fails to have variance. One example would be if the positive value is bounded and always achieved; for instance you might imagine that if you are always carrying a juice dispenser, juice-drinking would always have a value of 1, and therefore again there wouldn't be any variance to reinforce seeking juice.

A more subtle point is that if the task is too difficult, e.g. if you are in a desert with no juice available, then the policy-caused variance is also 0, because the juice is constant 0. This is a general point that encompasses many failures of reinforcement learning. Often, if you don't do things like reward shaping, then reinforcement learning simply fails, because the task is too complex to learn.

I think future RL algorithms will be able to succeed using less policy-caused variance than present RL algorithms require by using models to track the outcomes through many layers of interactions.

I don't think so.

Like technically yes, it shows that there is an internal optimization process that is running in the networks, but much of the meat of optimization such as instrumental convergence/power-seeking depends the structure of the function one is optimizing over.

If the function is not consequentialist - if it doesn't attempt to compute the real-world consequences of different outputs and grade things based on those consequences - then much of the discussion about optimizers does not apply.

I would very much assume that you have a strong genetic disposition to be smart and curious.

Do you think unschooling would work acceptably well for kids who are not smart and curious?

In your view, are values closer to recipes/instruction manuals than to utility functions?

For your quiz, could you give an example of something that is grader-optimization but which is not wireheading?

Load More