Charlie Steiner

LW1.0 username Manfred. Day job is condensed matter physics, hobby is thinking I know how to assign anthropic probabilities.


Frequent arguments about alignment

For #2, not sure if this is a skeptic or an advocate point: why have a separate team at all? When designing a bridge you don't have one team of engineers making the bridge, and a separate team of engineers making sure the bridge doesn't fall down. Within openAI, isn't everyone committing to good things happening, and not just strictly picking the lowest-hanging fruit? If alignment-informed research is better long-term, why isn't the whole company the "safety team" out of simple desire to do their job?

We could make this more obviously skeptical by rephrasing it as a wisdom-of-the-crowds objection. You say we need people focused on alignment because it's not always the lowest-hanging fruit. But other people aren't dumb and want things to go well - are you saying they're making a mistake?

And then you have to either say yes, they're making a mistake because people are (e.g.) both internally and externally over-incentivized to do things that have flashy results now, or no, they're not making a mistake, in fact having a separate alignment group is a good idea even in a perfect world because of (e.g.) specialization of basic research, or some combination of the two.

The Nature of Counterfactuals

In addition to self-consistency, we can also imagine agents that interact with an environment and learn about how to model the environment (thereby having an effective standard for counterfactuals - or it could be explicit if we hand-code the agents to choose actions by explicitly considering counterfactuals) by taking actions and evaluating how good their predictions are.

Knowledge is not just precipitation of action

Do you want to chat sometime about this?

I think it's pretty clear why we think of the map-making sailboat as "having knowledge" even if it sinks, and it's because our own model of the world expects maps to be legible to agents in the environment, and so we lump them into "knowledge" even before actually seeing someone use any particular map. You could try to predict this legibility part of how we think of knowledge from the atomic positions of the item itself, but you're going to get weird edge cases unless you actually make a intentional-stance-level model of the surrounding environment to see who might read the map.

EDIT: I mean, the interesting thing about this to me is then asking the question of what this means about how granular to be when thinking about knowledge (and similar things).

Big picture of phasic dopamine

How does the section of the amygdala that a particular dopamine neuron connects to even get trained to do the right thing in the first place? It seems like there should be enough chance in connections that there's really only this one neuron linking a brainstem's particular output to this specific spot in the amygdala - it doesn't have a whole bundle of different signals available to send to this exact spot.

SL in the brain seems tricky because not only does the brainstem have to reinforce behaviors in appropriate contexts, it might have to train certain outputs to correspond to certain behaviors in the first place, all with only one wire to each location! Maybe you could do this with a single signal that means both "imitate the current behavior" and also "learn to do your behavior in this context"? Alternatively we might imagine some separate mechanism for of priming the developing amygdala to start out with a diverse yet sensible array of behavior proposals, and the brainstem could learn what its outputs correspond to and then signal them appropriately.

Big picture of phasic dopamine

One thing that strikes me as odd about this model is that it doesn't have the blessing of dimensionality - each plan is one loop, and evaluating feedback to a winning plan just involves feedback to one loop. When it's general reward we can simplify this with just rewarding recent winning plans, but in some places it seems like you do imply highly specific feedback, for which you need N feedback channels to give feedback on ~N possible plans. The "blessing of dimensionality" kicks in when you can use more diverse combinations of a smaller number of feedback channels to encode more specific feedback.

Maybe what seems to be specific feedback is actually a smaller number of general types? Like rather than specific feedback to snake-fleeing plans or whatever, a broad signal (like how Success-In-Life Reward is a general signal rewarding whatever just got planned) could be sent out that means "whatever the amygdala just did to make the snake go away, good job" (or something). Note that I have no idea what I'm talking about.

Thoughts on the Alignment Implications of Scaling Language Models

Great post! I very much hope we can do some clever things with value learning that let us get around needing AbD to do the things that currently seem to need it.

The fundamental example of this is probably optimizability - is your language model so safe that you can query it as part of an optimization process (e.g. making decisions about what actions are good), without just ending up in the equivalent of deepDream's pictures of Maximum Dog.

Problems facing a correspondence theory of knowledge

I think grappling with this problem is important because it leads you directly to understanding that what you are talking about is part of your agent-like model of systems, and how this model should be applied depends both on the broader context and your own perspective.

Saving Time

What is then stopping us from swapping the two copies of the coarser node?

Isn't it precisely that they're playing different roles in an abstracted model of reality? Though alternatively, you can just throw more logical nodes at the problem and create a common logical cause for both.

Also, would you say what you have in mind is built out of of augmenting a collection of causal graphs with logical nodes, or do you have something incompatible in mind?

Agency in Conway’s Game of Life

The truly arbitrary version seems provably impossible. For example, what if you're trying to make a smiley face, but some other part of the world contains an agent just like you except they're trying to make a frowny face - you obviously both can't succeed. Instead you need some special environment with low entropy, just like humans do in real life.

Load More