Charlie Steiner

LW1.0 username Manfred. Day job is condensed matter physics, hobby is thinking I know how to assign anthropic probabilities.


Arguments against myopic training

I think of myopia as part of a broader research direction of finding ways to weaken the notion of an agent ("agent" meaning that it chooses actions based on their predicted consequences) so the AI follows some commonsense notion of "does what you ask without making its own plans."

So if you remove all prediction of consequences beyond some timeframe, then you will get non-agent behavior relative to things outside the timeframe. This helps clarify for me why implicit predictions, like the AI modeling a human who makes predictions of the future, might lead to agential behavior even in an almost-myopic agent - the AI might now have incentives to choose actions based on their predicted (albeit by the human) consequences.

I think there are two endgames people have in mind for this research direction - using non-agent AI as an oracle and using it to help us choose good actions (implicitly making an agent of the combined human-oracle system), or using non-agent parts as understandable building blocks in an agent, which is supposed to be safe by virtue of having some particular structure. I think both of these have their own problems, but they do sort of mitigate your points about the direct performance of a myopic AI.

Mesa-Optimizers vs “Steered Optimizers”

I dunno, the productivity hacks thing sounds pretty bad.

But yeah, doing better seems to be held up by the fact that we don't yet have a coherent way to describe the standards for doing better, when the human isn't an idealized sort of agent. Trying to steer the agent towards thinking of its goal as "do what the programmers want" is essentially talking about a machine-learning method of trying to find this description.

Talk: Key Issues In Near-Term AI Safety Research

I enjoyed the whole video :) My only regret is that nobody brought up Bayesianism, or even regularization, in the context of double descent.

What does it mean to apply decision theory?

I feel like the logical inductor analogy still has more gas in the tank. Can we further limit the computational power and ask about the finite-time properties of some system that tries to correct its own computationally-tractable systematic errors? I feel like there's some property of "not fooling yourself" that this should help with.

Learning the prior

Ah, I think I see, thanks for explaining. So even when you talk about amplifying f, you mean a certain way of extending human predictions to more complicated background information (e.g. via breaking down Z into chunks and then using copies of f that have been trained on smaller Z), not fine-tuning f to make better predictions. Or maybe some amount of fine-tuning for "better" predictions by some method of eliciting its own standards, but not by actually comparing it to the ground truth.

This (along with eventually reading your companion post) also helps resolve the confusion I was having over what exactly was the prior in "learning the prior" - Z is just like a latent space, and f is the decoder from Z to predictions. My impression is that your hope is that if Z and f start out human-like, then this is like specifying the "programming language" of a universal prior, so that search for highly-predictive Z, decoded through f, will give something that uses human concepts in predicting the world.

Is that somewhat in the right ballpark?

Learning the prior
The motivation is that we want a flexible and learnable posterior.

-Paul Christiano, 2020

Ahem, back on topic, I'm not totally sure what actually distinguishes f and Z, especially once you start jointly optimizing them. If f incorporates background knowledge about the world, it can do better at prediction tasks. Normally we imagine f having many more parameters than Z, and so being more likely to squirrel away extra facts, but if Z is large then we might imagine it containing computationally interesting artifacts like patterns that are designed to train a trainable f on background knowledge in a way that doesn't look much like human-written text.

Now, maybe you can try to ensure that Z is at least somewhat textlike via making sure it's not too easy for a discriminator to tell from human text, or requiring it to play some functional role in a pure text generator, or whatever. There will still be some human-incomprehensible bits that can be transmitted through Z (Because otherwise you'd need a discriminator so good that Z couldn't be superhuman), but at least the amount is sharply limited.

But I'm really lost on how your could hope to limit the f side of this dichotomy. Penalize it for understanding the world too well given a random Z? Now it just has an incentive to notice random Zs and "play dead." Somehow you want it not to do better by just becoming a catch-all model of the training data, even on the actual training data. This might be one of those philosophical problems, given that you're expecting it to interpret natural language passages, and the lack of bright line between "understanding natural language" and "modeling the world."

An overview of 11 proposals for building safe advanced AI

Basically because I think that amplification/recursion, in the current way I think it's meant, is more trouble than it's worth. It's going to produce things that have high fitness according to the selection process applied, which in the limit are going to be bad.

On the other hand, you might see this as me claiming that "narrow reward modeling" includes a lot of important unsolved problems. HCH is well-specified enough that you can talk about doing it with current technology. But fulfilling the verbal description of narrow value learning requires some advances in modeling the real world (unless you literally treat the world as a POMDP and humans as Boltzmann-rational agents, in which case we're back down to bad computational properties and also bad safety properties), which gives me the wiggle room to be hopeful.

Sparsity and interpretability?

I feel like this is trying to apply a neural network where the problem specification says "please train a decision tree." Even when you are fine with part of the NN not being sparse, it seems like you're just using the gradient descent training as an elaborate img2vec method.

Maybe the idea is that you think a decision tree is too restrictive, and you want to allow more weightings and nonlinearities? Still, it seems like if you can specify from the top down what operations are "interpretable," this will give you some tree-like structure that can be trained in a specialized way.

Human instincts, symbol grounding, and the blank-slate neocortex

This was just on my front page for me, for some reason. So, it occurs to me that the example of the evolved FPGA is precisely the nightmare scenario for the CCA hypothesis.

If neurons behave according to simple rules during growth and development, and there are only smooth modulations of chemical signals during development, then nevertheless you might get regions of the cortex that look very similar, but whose cells are exploiting the hardly-noticeable FPGA-style quirks of physics in different ways. You'd have to detect the difference by luckily choosing the right sort of computational property to measure.

An overview of 11 proposals for building safe advanced AI

I noticed myself mentally grading the entries by some extra criteria. The main ones being something like "taking-over-the-world competitiveness" (TOTWC, or TOW for short) and "would I actually trust this farther than I could throw it, once it's trying to operate in novel domains?" (WIATTFTICTIOITTOIND, or WIT for short).

A raw statement of my feelings:

  1. Reinforcement learning + transparency tool: High TOW, Very Low WIT.
  2. Imitative amplification + intermittent oversight: Medium TOW, Low WIT.
  3. Imitative amplification + relaxed adversarial training: Medium TOW, Medium-low WIT.
  4. Approval-based amplification + relaxed adversarial training: Medium TOW, Low WIT.
  5. Microscope AI: Very Low TOW, High WIT.
  6. STEM AI: Low TOW, Medium WIT.
  7. Narrow reward modeling + transparency tools: High TOW, Medium WIT.
  8. Recursive reward modeling + relaxed adversarial training: High TOW, Low WIT.
  9. AI safety via debate with transparency tools: Medium-Low TOW, Low WIT.
  10. Amplification with auxiliary RL objective + relaxed adversarial training: Medium TOW, Medium-low WIT.
  11. Amplification alongside RL + relaxed adversarial training: Medium-low TOW, Medium WIT.
Load More