Charlie Steiner

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.


Reducing Goodhart

Wiki Contributions


[Intro to brain-like-AGI safety] 14. Controlled AGI

If I wanted to play fast and loose, I would claim that our sense of ourselves as having a first-person at all is part of an evolutionary solution to the problem of learning from other peoples's experiences (wait, wasn't there a post like that recently? Or was that about empathy...). It merely seems like a black box to us because we're too good at it, precisely because it's so important.

Somehow we develop a high-level model of the world with ourselves and other people in it, and then this level of abstraction actually gets hooked up to our motivations - making this a subset of social instincts.

When imagining hooking up abstract learned world models to motivation for AI like this, I sometimes imagine something much less "fire and forget" than the human brain, something more like people testing, responding to, and modifying an AI that's training or pre-training on real-world data. Evolution doesn't get to pause me at age 4 and rummage around in my skull.

Open Problems in Negative Side Effect Minimization

There's definitely a tension here between avoiding bad disruptive actions and doing good disruptive actions.

It seems to me like you're thinking about SEM more like a prior that starts out dominant but can get learned away over time. Is that somewhat close to how you're thinking about this tension?

Law-Following AI 3: Lawless AI Agents Undermine Stabilizing Agreements

Decision-makers who need inspections to keep them in line are incentivized to subvert those inspections and betray the other party. It seems like what's actually necessary is people in key positions who are willing to cooperate in the prisoner's dilemma when they think they are playing against people like them - people who would cooperate even if there were no inspections.

But if there are inspections, then why do we need law-following AI? Why not have the inspections directly check that the AI would not harm the other party (hopefully because it would be helping humans in general)?

Refine: An Incubator for Conceptual Alignment Research Bets

Great news! I have to change the post I was drafting about unfilled niches :)

AMA Conjecture, A New Alignment Startup

Do you expect interpretability tools developed now to extend to interpreting more general (more multimodal, better at navigating the real world) decision-making systems? How?

AXRP Episode 14 - Infra-Bayesian Physicalism with Vanessa Kosoy

Warning: rambling.

I feel like I haven't quite clarified for myself what the cybernetic agent is doing wrong (or if you don't want to assume physicalism, just put scare quotes around "doing wrong") when it sees the all-black input. There might be simple differences that have many implications.

Suppose that all hypotheses are in terms of physical universe + bridging law. We might accuse the cybernetic agent of violating a dogma of information flow by making its physical-universe-hypothesis depend on the output of the bridging law. But it doesn't necessarily have to! Consider the hypotheses that has a universe just complicated enough to explain my past observations up until the all-black input, and then a bridging law that is normal (locating myself within the modeled universe etc.) until I see the all-black input, upon which it becomes something very simple (like all black from now on). This seems simple to the cybernetic agent but is nonsensical from a physicalist perspective.

So we might say that all information backflow from the output of the bridging law back into any part of the hypothesis is a mistake. But although this might work for idealized microphysical hypotheses, in practice I think I actually violate both levels of information-flow dogma when I use my expected thoughts and feelings to predict what I'll do and perceive in the physical world. Although you could still deny that such practical self-modeling "counts," and make my hypotheses something more Platonic, that are imputed to me but that I don't carry around in my physical brain, or we could imagine some "inner RL agent" steering my brain that gets to obey the strict rules of physicalism even if my brain as a whole doesn't. I find this latter unsatisfactory - I want to explore my own whole-brain views on physicalism, not the views of a tiny fraction of me, and flipping this around, if we build a Turing RL agent I want the thing as a whole to be physicalist, not just a small part of it.

In short, the information flow dogma seems promising for toy models of reasoning, but it feels to me like it must only be an expression of some deeper principle that even us bounded humans can obey.

One important fact I didn't yet mention is that we don't have a final hypothesis for the world - we're constantly getting new bits that have to be incorporated into our physical hypothesis and bridging law. Without this, I think the perverse "all black from now on" hypothesis isn't very appealing (though still more appealing to the Cartesian agent than to the physicalist), because it's more complex than the true hypothesis. But with new bits flowing in all the time, the simplest predictions are always going to predict that this information flow suddenly stops.

Is there some kind of Copernican principle here that makes sense to a physicalist but sounds like nonsense to a Cartesian? Like "No, this is not the instant when you're going to stop learning new things forever." Is physicalism related to looking for patterns on the meta-levels?

Why Agent Foundations? An Overly Abstract Explanation

It's not clear to me that your metaphors are pointing at something in particular.

Revenue of a nail factory is a good proxy for the quality of the nails produced, but only within a fairly small bubble around our current world. You can't make the factory-owner too smart, or the economy too irrational, or allow for too many technological breakthroughs to happen, or else the proxy breaks. If this was all we needed, then yes, absolutely, I'm sure there's a similarly neat and simple way to instrumentalize human values - it's just going to fail if things are too smart, or too irrational, or too far in the future.

Biology being human-comprehensible is an interesting topic, and suppose I grant that it is - that we could have comprehensible explanatory stories for every thing our cells do, and that these stories aren't collectively leaving anything out. First off, I would like to note that such a collection of stories would still be really complicated relative to simple abstractions in physics or economics! Second, this doesn't connect directly to Goodhart's law. We're just talking about understanding biology, without mentioning purposes to which our understanding can be applied. Comprehending biology might help us generalize, in the sense of being able to predict what features will be conserved by mutation, or will adapt to a perturbed environment, but again this generalization only seems to work in a limited range, where the organism is doing all the same jobs with the same divisions between them.

The butterfly effect metaphor seems like the opposite of biology. In biology you can have lots of little important pieces - they're not individually redirecting the whole hurricane/organism, but they're doing locally-important jobs that follow comprehensible rules, and so we don't disregard them as noise. None of the butterflies have such locally-useful stories about what they're doing to the hurricane, they're all just applying small incomprehensible perturbations to a highly chaotic system. The lesson I take is that messiness is not the total lack of structure - when I say my room is messy, I don't mean that the arrangement of its component atoms has been sampled from the Boltzmann distribution - it's just that the structure that's there isn't easy for humans to use.

I'd like to float one more metaphor: K-complexity and compression.

Suppose I have a bit string of length 10^9, and I can compress it down to length 10^8. The "True Name hypothesis" is that the compression looks like finding some simple, neat patterns that explain most of the data and we expect to generalize well, plus a lot of "diff" that's the noisy difference between the simple rules and the full bitstring. The "fractal hypothesis" is that there are a few simple patterns that do some of the work, and a few less simple rules that do more of the work, and so on for as long as you have patience. The "total mess hypothesis" is that simple rules do a small amount of the work, and a lot of the 10^8 bits is big highly-interdependent programs that would output something very different if you flipped just a few bits. Does this seem about right?

When people ask for your P(doom), do you give them your inside view or your betting odds?

Could you explain more about the difference.and what it looks like to give one vs. the other?

Why Agent Foundations? An Overly Abstract Explanation

Not sure if I disagree or if we're placing emphasis differently.

I certainly agree that there are going to be places where we'll need to use nice, clean concepts that are known to generalize. But I don't think that the resolutions to problems 1 and 2 will look like nice clean concepts (like in minimizing mutual information). It's not just human values that are messy and contingent, even the pointer we want to use to gesture to those-things-we-want-to-treat-as-our-values is messy and contingent. I think of some of my intuitions as my "real values" and others as mere "biases" in a thoroughly messy way.

But back on the first hand again, what's "messy" might be subjective. A good recipe for fitting values to me will certainly be simple and neat compared to the totality of information stored in my brain.

And I certainly want to move away from the framing that the way to deal with problems 1 and 2 is to say "Goodhart's law says that any difference between the proxy and our True Values gets amplified... so we just have to find our True Values" - I think this framing leads one to look for solutions in the wrong way (trying to eliminate ambiguity, trying to find a single human-comprehensible model of humans from which the True Values can be extracted, mistakes like that). But this is also kind of a matter of perspective - any satisfactory value learning process can be evaluated (given a background world-model) as if it assigns humans some set of True Values.

I think even if we just call these things differences in emphasis, they can still lead directly to disagreements about (even slightly) meta-level questions, such as how we should build trust in value learning schemes.

ELK Thought Dump

Pragmatism's a great word, everyone wants to use it :P But to be specific, I mean more like Rorty (after some Yudkowskian fixes) than Pierce.

Load More