Charlie Steiner

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.


Reducing Goodhart

Wiki Contributions


What you are doing is training the AI to have an accurate model of itself, used with language like "I" and "you". You can use your brain to figure out what will happen if you ask "are you conscious?" without having previously trained in any position on similarly nebulous questions. Training text was written overwhelmingly by conscious things, so maybe it says yes because that's so favored by the training distribution. Or maybe you trained it to answer "you" questions as about nonfiction computer hardware and it makes the association that nonfiction computer hardware is rarely conscious.

Basically, I don't think you can start out confused about consciousness and cheat by "just asking it." You'll still be confused about consciousness and the answer won't be useful.

I'm worried this is going to lead, either directly or indirectly, to training foundation models to have situational awareness, which we shouldn't be doing.

And perhaps you should be worried that having an accurate model of onesself, associated with language like "I" and "you", is in fact one of the ingredients in human consciousness, and maybe we shouldn't be making AIs more conscious.

An AI trained with RL that suddenly gets access to self-modifying actions might (briefly) have value dynamics according to idiosyncratic considerations that do not necessarily contain human-like guardrails. You could call this "systematization," but it's not proceeding according to the same story that governed systematization during training by gradient descent.

I like this as a description of value drift under training and regularization. It's not actually an inevitable process - we're just heading for something like the minimum circuit complexity of the whole system, and usually that stores some precomputation or otherwise isn't totally ststematized. But though I'm sure the literature on the intersection of NNs and circuit complexity is fascinating, I've never read it, so my intuition may be bad.

But I don't like this as a description of value drift under self-reflection. I see this post more as "this is what you get right after offline training" than "this is the whole story that needs to have an opinion on the end state of the galaxy."

I don't think we should equate the understanding required to build a neural net that will generalize in a way that's good for us with the understanding required to rewrite that neural net as a gleaming wasteless machine.

The former requires finding some architecture and training plan to produce certain high-level, large-scale properties, even in the face of complicated AI-environment interaction. The latter requires fine-grained transparency at the level of cognitive algorithms, and some grasp of the distribution of problems posed by the environment, together with the ability to search for better implementations.

If your implicit argument is "In order to be confident in high-level properties even in novel environments, we have to understand the cognitive algorithms that give rise to them and how those algorithms generalize - there exists no emergent theory of the higher level properties that covers the domain we care about." then I think that conclusion is way too hasty.

Huh, what is up with the ultra low frequency cluster? If the things are actually firing on the same inputs, then you should really only need one output vector. And if they're serving some useful purpose, then why is there only one and not more?

I'm curious if you have guesses about how many singular dimensions were dead neurons (or neurons that are "mostly dead," only activating for a tiny fraction of the training set), versus how much the zero-gradient directions depended dynamically on training example.

Pretty neat.

My ears perk up when I hear about approximations to basin size because it's related to the Bayesian NN model of uncertainty.

Suppose you have a classifier that predicts a probability distribution over outputs. Then when we want the uncertainty of the weights, we just use Bayes' rule, and because most of the terms don't matter we mostly carte that P(weights | dataset) has evidence ratio proportional to P(dataset | weights). If you're training on a predictive loss, your loss is basically the log of this P(dataset | weights), and so a linear weighting of probability turns into an exponential weighting of loss.

I.e. you end up (in theory that doesn't always work) with a Boltzmann distribution sitting at the bottom of your loss basin (skewed by a regularization term). Broader loss basins directly translate to more uncertainty over weights.

Hm... But I guess thinking about this really just highlights for me the problems with the approximations used to get uncertainties out of the Bayesian NN picture. Knowing the learning coefficient is of limited use because, especially when some dimensions are different, you can't really model all directions in weight-space as interchangeable and uncorrelated, so increased theoretical firepower doesn't translate to better uncertainty estimates as nicely as I'd like.

The day 2 lightning talks were really great.

Suppose we have a robot we want to fetch coffee, with an off switch that can trigger exactly once, when the robot is halfway to the coffee. And just a little bit more than halfway there's a cat we don't want the robot to step on (the cat is stochastic, maybe it's in the shortest path, maybe not). So we want to turn on the robot and have it fetch coffee, but if it's going to step on the cat we want to be able to press the off switch.

How do we design this robot, using the system here? And then what does it do?

Well, we want to define two utility functions that have an intermediate point where the robot is indifferent to the switch, and then see what policy that recommends. But the utility function has to care about fetching the coffee, since that's the whole point.

Can the utility functions be fetch the coffee + switch pressed, and fetch the coffee + switch not pressed? No! Because you can't fetch the coffee if the switch is pressed, so fetch the coffee + switch pressed might be a minimum of the utility as a function of the switch state, with no flat intermediate spot.

To make a toy model, suppose the robot gets to set two variables: tamper with the switch (T), and avoid the cat (C). The robot gets 1 if it succeeds, but that's canceled out if the switch is pressed, and it can pay 0.1 to tamper with the switch and 0.2 to avoid the cat.

I.e. when the cat is present, u = 1 - (1-C)(1-T) - 0.1T - 0.2C. (Second term is checking whether the switch is pressed.)

Except that the reason it gets penalized for not avoiding the cat is because of the effect on the switch. Let's pick a causal version to be u = 1 - K(1-T) - 0.1T - 0.2C.

Can the utility functions be (fetch the coffee)*(K) and (fetch the coffee)*(1-K)?

I'm getting a little muddled about where you mean to do causal interventions, so I'm going to leave it there rather than trying to chase down all the options.

Load More