Charlie Steiner

If you want to chat, message me!

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.

Sequences

Reducing Goodhart

Wiki Contributions

Comments

Yeah, I don't know where my reading comprehension skills were that evening, but they weren't with me :P

Oh well, I'll just leave it as is as a monument to bad comments.

I think it's pretty tricky, because what matters to real networks is the cost difference between storing features pseudo-linearly (in superposition), versus storing them nonlinearly (in one of the host of ways it takes multiple nn layers to decode), versus not storing them at all. Calculating such a cost function seems like it has details that depend on the particulars of the network and training process, making it a total pain to try to mathematize (but maybe amenable to making toy models).

Neat, thanks. Later I might want to rederive the estimates using different assumptions - not only should the number of active features L be used in calculating average 'noise' level (basically treating it as an environment parameter rather than a design decision), but we might want another free parameter for how statistically dependent features are. If I really feel energetic I might try to treat the per-layer information loss all at once rather than bounding it above as the sum of information losses of individual features.

I feel like there's a somewhat common argument about RL not being all that dangerous because it generalizes the training distribution cautiously - being outside the training distribution isn't going to suddenly cause an RL system to make multi-step plans that are implied but never seen in the training distribution, it'll probably just fall back on familiar, safe behavior.

To me, these arguments feel like they treat present-day model-free RL as the "central case," and model-based RL as a small correction.

Anyhow, good post, I like most of the arguments, I just felt my reaction to this particular one could be made in meme format.

I hear you as saying "If we don't have to worry about teaching the AI to use human values, then why do sandwiching when we can measure capabilities more directly some other way?"

One reason is that with sandwiching, you can more rapidly measure capabilities generalization, because you can do things like collect the test set ahead of time or supervise with a special-purpose AI.

But if you want the best evaluation of a research assistant's capabilities, I agress using it as a research assistant is more reliable.

A separate issue I have here is the assumption that you don't have to worry about teaching an AI to make human-friendly decisions if you're using it as a research assistant, and therefore we can go full speed ahead trying to make general-purpose AI as long as we mean to use it as a research assistant. A big "trust us, we're the good guys" vibe.

Relative to string theory, getting an AI to help use do AI alignment is much more reliant on teaching the AI to give good suggestions in the first place - and not merely "good" in the sense of highly rated, but good in the contains-hard-parts-of-outer-alignment kinda way. So I disagree with the assumption in the first place.

And then I also disagree with the conclusion. Technology proliferates, and there are misuse opportunities even within an organization that's 99% "good guys." But maybe this is a strategic disagreement more than a factual one.

Non-deceptive failures are easy to notice, but they're not necessarily easy to eliminate - and if you don't eliminate them, they'll keep happening until some do slip through. I think I take them more seriously than you.

Or if you buy a shard-theory-esque picture of RL locking in heuristics, what heuristics can get locked in depends on what's "natural" to learn first, even when training from scratch.

Both of these hypotheses probably should come with caveats though. (About expected reliability, training time, model-free-ness, etc.)

The history is a little murky to me. When I wrote [what's the dream for giving natural-language commands to AI](https://www.lesswrong.com/posts/Bxxh9GbJ6WuW5Hmkj/what-s-the-dream-for-giving-natural-language-commands-to-ai), I think I was trying to pin down and critique (a version of) something that several other people had gestured to in a more offhand way, but I can't remember the primary sources. (Maybe Rohin's alignment newsletter between the announcement of GPT2 and then would contain the relevant links?)

This is what all that talk about predictive loss was for. Training on predictive loss gets you systems that are especially well-suited to being described as learning the time-evolution dynamics of the training distribution. Not in the sense that they're simulating the physical reality underlying the training distribution, merely in the sense that they're learning dynamics for the behavior of the training data.

Sure, you could talk about AlphaZero in terms of prediction. But it's not going to have the sort of configurability that makes the simulator framing so fruitful in the case of GPT (or in the case of computer simulations of the physical world). You can't feed AlphaZero the first 20 moves of a game by Magnus Carlsen and have it continue like him.

Or to use a different example, one time talking about simulators is when someone asks "Does GPT know this fact?" because GPT's dynamics are inhomogeneous - it doesn't always act with the same quality of knowing the fact or not knowing it. But AlphaZero's training process is actively trying to get rid of that kind of inhomogeneity - AlphaZero isn't trained to mimic a training distribution, it's trained to play high-scoring moves.

The simulator framing has no accuracy advantage over thinking directly in terms of next token prediction, except that thinking in terms of simulator and simulacra sometimes usefully compresses the relevant ideas, and so lets people think larger new thoughts at once. Probably useful for coming up with ChatGPT jailbreaks. Definitely useful for coming up with prompts for base GPT.

Load More