Nora Belrose


Sorted by New

Wiki Contributions


(Didn't consult Quintin on this; I speak for myself)

I flatly deny that our arguments depend on AGI being anything like an LLM. I think the arguments go through in a very wide range of scenarios, basically as long as we're using some kind of white-box optimization to align them, rather than e.g. carrot-and-stick incentives or prompt engineering. Even if we only relied on prompt engineering, I think we'd be in a better spot than with humans (because we can run many controlled experiments).

A human can harbor a secret desire for years, never acting on it, and their brain won’t necessarily overwrite that desire, even as they think millions of thoughts in the meantime. So evidently, the argument above is inapplicable to human brains.

I’m pretty confused by this claim. Why should we expect the human reward system to overwrite all secret desires? Also how do we know it’s not doing that? Your desires are just causal effects of a bunch of stuff including your reward circuitry.

As a human, I can be sitting in bed, staring into space, and I can think a specific abstruse thought about string theory, and now I’ve figured out something important. If a future AI can do that kind of thing, as I expect, then it’s not so clear that “controlling the AI’s sensory environment” is really all that much control.

  1. This is just generally a pretty weak argument. You don't seem to be contesting the fact that we have full sensory control for AI and we don’t have full sensory control for humans. It’s just a claim that this doesn’t matter. Maybe this ends up being a brute clash of intuitions, but it seems obvious to me that full sensory control matters a lot, even if the AI is doing a lot of long running cognition without supervision.
  2. With AI we can choose to cut its reasoning short whenever we want, force it to explain itself in human language, roll it back to a previous state, etc. We just have a lot more control over this ongoing reasoning process for AIs and it’s baffling to me that you seem to think this mostly doesn’t matter.

That sounds nice, but brain-like AGI (like most RL agents) does online learning. So if you run a bunch of experiments, then as soon as the AGI does anything whatsoever (e.g. reads the morning newspaper), your experiments are all invalid (or at least, open to question), because now your AGI is different than it was before

You can just include online learning in your experimentation loop. See what happens when you let the AI online learn for a bit in different environments. I don't think online learning changes the equation very much. It's known to be less stable than offline RL, but that instability hurts capabilities as well as alignment, so we'd need a specific argument that it will hurt alignment significantly more than capabilities, in ways that we wouldn't be able to notice during training and evaluation.

I have no idea how I’m supposed to interpret this sentence ["we are the innate reward system"] for brain-like AGI, such that it makes any sense at all. Actually, I’m not quite sure what it means even for LLMs!

It just means we are directly updating the AI’s neural circuitry with white box optimizers. This will be true across a very wide range of scenarios, including (IIUC) your brain-like AGI scenario.

Brains can imitate, but do so in a fundamentally different way from LLM pretraining

I don’t see why any of the differences you listed are relevant for safety.

Relatedly, brains have a distinction between expectations and desires, cleanly baked into the algorithms. I think this is obvious common sense, leaving aside galaxy-brain Free-Energy-Principle takes which try to deny it.

I basically deny this, especially if you're stipulating that it's a "clean" distinction. Obviously folk psychology has a fuzzy distinction between beliefs and desires in it, but it's also well-known both in common sense and among neuroscientists etc. that beliefs and desires get mixed up all the time and there's not a particularly sharp divide.


FWIW it appears that out of the 4 differences you cited here, only one of them (the relaxation of the restriction that the scrubbed output must be the same) still holds as of this January paper from Geiger's group So the methods are even more similar than you thought.

I think it would be a distraction to try to figure out if LMs are "phenomenally conscious" for a few different reasons.

  1. I think there are pretty strong reasons to believe that phenomenal consciousness is not actually a substantive property, in the sense that either everything has it in some sense (panpsychism) or nothing does (eliminativism). Any other solution confronts the Hard Problem and the empirical intractability of actually figuring out which things are or are not phenomenally conscious.
  2. Your proposed tests for phenomenal consciousness seem to, in fact, be testing for access consciousness— basically, the ability to do certain types of reflection and introspection. Access consciousness may well be relevant for alignment; it seems pretty related to situational awareness. But that's not phenomenal consciousness (because of the Hard Problem). Phenomenal consciousness is causally inert and empirically untestable.
  3. While it would be a problem if LMs were moral patients, I think these concerns are utterly dwarfed by the value we'd lose due to an AI-caused existential catastrophe. Also, on the most plausible views of valence, an experience's valence is directly determined by your first-order in-the-moment preferences to continue having that experience or not. If valence just reduces to preferences, then we really can just talk about the preferences, which seem more empirically tractable to probe.

I do think consciousness is real and important (I think some form of Russellian monism is probably right). I just don't think it matters for alignment.

This probably doesn't work, but have you thought about just using weight decay as a (partial) solution to this? In any sort of architecture with residual connections you should expect circuits to manifest as weights with nontrivial magnitude. If some set of weights isn't contributing to the loss then the gradients won't prevent them from being pushed toward zero by weight decay. Sort of a "use it or lose it" type thing. This seems a lot simpler and potentially more robust than other approaches.