Charlie Steiner

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.


Reducing Goodhart

Wiki Contributions


Goodhart Ethology

np, I'm just glad someone is reading/commenting :)

Goodhart Ethology

Yeah, this is right. The variable uncertainty comes in for free when doing curve fitting - close to the datapoints your models tend to agree, far away they can shoot off in different directions. So if you have a probability distribution over different models, applying the correction for the optimizer's curse has the very sensible effect of telling you to stick close to the training data.

Measurement, Optimization, and Take-off Speed

I'm confused about your picture of "outer optimization power." What sort of decisions would be informed by knowing how sensitive the learned model is to perturbations of hyperparameters?

Any thoughts on just tracking the total amount of gradient-descending done, or total amount of changes made, to measure optimization?

Grokking the Intentional Stance

Nice summary :) It's relevant for the post that I'm about to publish that you can have more than one intentional-stance view of the same human. The inferred agent-shaped model depends not only on the subject and the observer, but also on the environment, and on what the observer hopes to get by modeling.

Charlie Steiner's Shortform

(biorxiv )

Cool paper on trying to estimate how many parameters neurons have (h/t Samuel at EA Hotel). I don't feel like they did a good job distinguishing how hard it was for them to fit nonlinearities that would nonetheless be the same across different neurons, versus the number of parameters that were different from neuron to neuron. But just based on differences in physical arrangement of axons and dendrites, there's a lot of opportunity for diversity, and I do think the paper was convincing that neurons are sufficiently nonlinear that this structure is plausibly important. The question is how much neurons undergo selection based on this diversity, or even update their patterns as a form of learning!

Research agenda update

I only really know about the first bit, so have a comment about that :)

Predictably, when presented with the 1st-person problem I immediately think of hierarchical models. It's easy to say "just imagine you were in their place." What I'd think could do this thing is accessing/constructing a simplified model of the world (with primitives that have interpretations as broad as "me" and "over there") that is strongly associated with the verbal thought (EDIT: or alternately is a high-level representation that cashes out to the verbal thought via a pathway that ends in verbal imagination), and then cashing out the simplified model into a sequence of more detailed models/anticipations by fairly general model-cashing-out machinery.

I'm not sure if this is general enough to capture how humans do it, though. When I think of humans on roughly this level of description, I usually think of having many different generative models (a metaphor for a more continuous system with many principal modes, which is still a metaphor for the brain-in-itself) that get evaluated at first in simple ways, and if found interesting get broadcasted and get to influence the current thought, meanwhile getting evaluated in progressively more complex ways. Thus a verbal thought "imagine you were in their place" can get sort of cashed out into imagination by activation of related-seeming imaginings. This lacks the same notion of "models" as above; i.e. a context agent is still too agenty, we don't need the costly simplification of agentyness in our model to talk about learning from other peoples' actions.

Plus that doesn't get into how to pick out what simplified models to learn from. You can probably guess better than me if humans do something innate that involves tracking human-like objects and then feeling sympathy for them. And I think I've seen you make an argument that something similar could work for an AI, but I'm not sure. (Would a Bayesian updater have less of the path-dependence that safety of such innate learning seems to rely on?)

Answering questions honestly instead of predicting human answers: lots of problems and some solutions

I'm having some formatting problems (reading on in firefox) with scroll bars under full-width LaTex covering the following line of text.

(So now I'm finishing reading it on greaterwrong.)

BASALT: A Benchmark for Learning from Human Feedback

Nice! If I had university CS affiliations I would send them this with unsubtle comments that it would be a cool project to get students to try :P

In fact, now that I think about it, I do have one contact through the UIUC datathon. Or would you rather not have this sort of marketing?

Anthropics in infinite universes

A similarly odd question is how this plays with Solomonoff induction. Is a universe with infinite stuff in it of zero prior probability, because it requires infinite bits to specify where the stuff is? Quantum mechanics would say no: we can just specify a simple quantum state of the early universe, and then we're within one branch of that wavefunction. And the (quantum) information required to locate us within that wavefunction is only related to the information we actually see, i.e. finite.

A world in which the alignment problem seems lower-stakes

Weird coincidence, but I just read Superintelligence for the first time, and I was struck by the lack of mention of Steve Omohundro (though he does show up in endnote 8). My citation for instrumental convergence would be Omohundro 2008.

Load More