All of Charlie Steiner's Comments + Replies

Non-Adversarial Goodhart and AI Risks

Necromantic comment, sorry :P

I might be misinterpreting, but what I think you're saying is that if the humans make a mistake in using a causal model of the world and tell the AI to optimize for something bad-in-retrospect, this is "mistaken causal structure lead[ing] to regressional or extremal Goodhart", and thus not really causal Goodhart per se (by the categories you're intending). But I'm still a little fuzzy on what you mean to be actual factual causal Goodhart.

Is the idea that humans tell the AI to optimize for something that is not bad-in-retrospect... (read more)

2David Manheim5dYes on point Number 1, and partly on point number 2. If humans don't have incredibly complete models for how to achieve their goals, but know they want a glass of water, telling the AI to put a cup of H2O in front of them can create weird mistakes. This can even happen because of causal connections the humans are unaware of. The AI might have better causal models than the humans, but still cause problems for other reasons. In this case, a human might not know the difference between normal water and heavy water, but the AI might decide that since there are two forms, it should have them present in equal amounts, which would be disastrous for reasons entirely beyond the understanding of the human who asked for the glass of water. The human needed to specify the goal differently, and was entirely unaware of what they did wrong - and in this case it will be months before the impacts of the weirdly different than expected water show up, so human-in-the-loop RL or other methods might not catch it.
AXRP Episode 11 - Attainable Utility and Power with Alex Turner

It seems like evaluating human AU depends on the model. There's a "black box" sense where you can replace the human's policy with literally anything in calculating AU for different objectives, and there's a "transparent box" sense in which you have to choose from a distribution of predicted human behaviors.

The former is closer to what I think you mean by "hasn't changed the humans' AU," but I think it's the latter that an AI cares about when evaluating the impact of its own actions.

2Alex Turner5dI'm discussing a philosophical framework for understanding low impact. I'm not prescribing how the AI actually accomplishes this.
My take on Vanessa Kosoy's take on AGI safety

Yup, I more or less agree with all that. The name thing was just a joke about giving things we like better priority in namespace.

I think quantilization is safe when it's a slightly "lucky" human-imitation (also if it's a slightly "lucky" version of some simpler base distribution, but then it won't be as smart). But push too hard, which might not be very hard at all if you're iterating quantilization steps rather than quantilizing over a long-term policy, and instead you get an unaligned intelligence that happens to interact with the world by picking human-... (read more)

What Selection Theorems Do We Expect/Want?

Biologically, I think the evolution of body-plan modularity might be backwards from your general argument for it - the biological goal seems to be to make big (but not automatically bad) changes from small mutations (e.g. entire extra body segments), not to "hide" DOF inside modules to allow for smoother changes per parameter.

In fact, this strikes me as resembling abstraction, so this might be right in your wheelhouse :P Biological modularity seems to specifically select for those modules with simple interfaces that can be cut-and-pasted with the maximal c... (read more)

My take on Vanessa Kosoy's take on AGI safety

Like, maybe we should say that "radically superhuman safety and benevolence" is a different problem than "alignment"? 

Ah, you mean that "alignment" is a different problem than "subhuman and human-imitating training safety"? :P

So is there a continuum between category 1 and category 2? The transitional fossils could be non-human-imitating AIs that are trying to be a little bit general or have goals that refer to a model of the human a little bit, but the designers still understand the search space better than the AIs.

1Steve Byrnes16d"Quantilizing from the human policy" is human-imitating in a sense, but also superhuman. At least modestly superhuman - depends on how hard you quantilize. (And maybe very superhuman in speed.) If you could fork your brain state to create an exact clone, would that clone be "aligned" with you? I think that we should define the word "aligned" such that the answer is "yes". Common sense, right? Seems to me that if you say "yes it's aligned" to that question, then you should also say "yes it's aligned" to a quantilize-from-the-human-policy agent. It's kinda in the same category, seems to me. Hmm, Stuart Armstrong suggested here [] that "alignment is conditional: an AI is aligned with humans in certain circumstances, at certain levels of power." So then maybe as you quantilize harder and harder, you get less and less confident in that system's "alignment"? (I'm not sure we're disagreeing about anything substantive, just terminology, right? Also, I don't actually personally buy into this quantilization picture, to be clear.)
The Simulation Hypothesis Undercuts the SIA/Great Filter Doomsday Argument

I'll try not to be too grumpy, here.

Let me make the case why filter-ology isn't so exciting by extending part of your own model: the fact that you take evidence into account.

The basic doomsday argument model asks us to pretend that we are amnesiacs - that we don't know any astronomy or history or engineering, all we know is that we've drawn some random number in a list of indeterminate length. If that numbered list is the ordering of human births, we're halfway through. If we consign the number of humans to the memory hole and say that the list is years in... (read more)

My take on Vanessa Kosoy's take on AGI safety

Great post!

I'm not sure if I disagree with your evaluation of timeline-driven delegative RL, or if I disagree with how you're splitting up the alignment problem. Vanessa's example scheme still requires that humans have certain nice behaviors (like well-calibrated success %s). It sorta sounds like you're putting those nice properties under inner alignment (or maybe just saying that you expect solutions to inner alignment problems to also handle human messiness?).

This causes me to place Vanessa's ideas differently. Because I don't say she's solved the entire... (read more)

3Vanessa Kosoy15dYes, that's pretty accurate. My methodology is, start by making as many simplifying assumptions as you need as long as some of the core difficulty of the problem is preserved. Once you have a solution which works in that model, start relaxing the assumptions. For example, delegative IRL requires that the user takes the best action with the highest likelihood, delegative RL "only" requires that the user sometimes takes the best action and never enters traps, whereas HTDL's assumptions are weaker still.
3Steve Byrnes16dHmm, thinking about it more I guess I'd say that "alignment" is not a binary. Like maybe: * Good alignment: Algorithm is helping the human work faster and better, while not doing anything more dangerous than what the human would have done by themselves without AI assistance * Even better alignment: Algorithm is trying to maximize the operator's synthesized preferences [] / trying to implement CEV / whatever. One thing is, there's a bootstrapping thing here: if AI alignment researchers had AIs with "good alignment", that would help them make AIs with "even better alignment". Another thing is, I dunno, I feel like having a path that definitely gets us to "good alignment" would be such fantastic wonderful progress that I would want to pop some champagne and sing the praises of whoever figured that out. That's not to say that we can all retire and go to the beach, but I think there's a legitimate sense in which this kind of advance would solve a huge and important class of problems. Like, maybe we should say that "radically superhuman safety and benevolence" is a different problem than "alignment"? We still want to solve both of course. The pre-AGI status quo has more than its share of safety problems.
Selection Theorems: A Program For Understanding Agents

This might be related to the notion that if we try to dictate the form of a model ahead of time (i.e. some of the parameters are labeled "world model" in the code, and others are labeled "preferences", and inference is done by optimizing the latter over the former), but then just train it to minimize error, the actual content of the parameters after training doesn't need to respect our preconceptions. What the model really "wants" to do in the limit of lots of compute is find a way to encode an accurate simulation of the human in the parameters in a way th... (read more)

Selection Theorems: A Program For Understanding Agents

Right. I think I'm more of the opinion that we'll end up choosing those interfaces via desiderata that apply more directly to the interface (like "we want to be able to compare two models' ratings of the same possible future"), rather than indirect desiderata on "how a practical agent should look" that we keep adding to until an interface pops out.

2johnswentworth18dThe problem with that sort of approach is that the system (i.e. agent) being modeled is not necessarily going to play along with whatever desiderata we want. We can't just be like "I want an interface which does X"; if X is not a natural fit for the system, then what pops out will be very misleading/confusing/antihelpful. An oversimplified example: suppose I have some predictive model, and I want an interface which gives me a point estimate and confidence interval/region rather than a full distribution. That only works well if the distribution isn't multimodal in any important way. If it is importantly multimodal, then any point estimate will be very misleading/confusing/antihelpful. More generally, the take away here is "we don't get to arbitrarily choose the type signature"; that choice is dependent on properties of the system.
Selection Theorems: A Program For Understanding Agents

We can imagine modeling humans in purely psychological ways with no biological inspiration, so I think you're saying that you want to look at the "natural constraints" on representations / processes, and then in a sense generalize or over-charge those constraints to narrow down model choices?

2johnswentworth19dBasically, yes. Though I would add that narrowing down model choices in some legible way is a necessary step if, for instance, we want to be able to interface with our models in any other way than querying for probabilities over the low-level state of the system.
Selection Theorems: A Program For Understanding Agents

Hm. Suppose sometimes I want to model humans as having propositional beliefs, and other times I want to model humans as having probabilistic beliefs, and still other times I want to model human beliefs as a set of contexts and a transition function. What's stopping me?

I think it depends on the application. What seems like the obvious application is building an AI that models human beliefs, or human preferences. What are some of the desiderata we use when choosing how we want an AI to model us, and how do these compare to typical desiderata used in picking ... (read more)

2johnswentworth19dYou want a model of humans to account for complicated, psychology-dependent limitations on what actions we consider taking. So: what process produced this complicated psychology? Natural selection. What data structures can represent that complicated psychology? That's a type signature question. Put the two together, and we have a selection-theorem-shaped question. In the example with persons A and B: a set of selection theorems would offer a solid foundation for the type signature of human preferences. Most likely, person B would use whatever types the theorems suggest, rather than a utility function, but if for some reason they really wanted a utility function they would probably compute it as an approximation, compute the domain of validity of the approximation, etc. For person A, turning the relevant types into an action-ranking would likely work much the same way that turning e.g. a utility function into an action-ranking works - i.e. just compute the utility (or whatever metrics turn out to be relevant) and sort. Regardless, if extracting preferences, both of them would probably want to work internally with the type signatures suggested by the theorems.
Introduction to Reducing Goodhart

Thanks, this is useful feedback in how I need to be more clear about what I'm claiming :) In october I'm going to be refining these posts a bit - would you be available to chat sometime?

1Adam Shimi19dGlad I could help! I'm going to comment more on your following post in the next few days/next week, and then I'm interested in having a call. We can also talk then about the way I want to present Goodhart as an impossibility result in a textbook project. ;)
Introduction to Reducing Goodhart

I'm mostly arguing against the naive framing where humans are assumed to have a utility function, and then we can tell how well the AI is doing by comparing the results to the actual utility (the "True Values"). The big question is: how do you formally talk about misalignment without assuming some such unique standard to judge the results by?

1Adam Shimi20dHum, but I feel like you're claiming that this framing is wrong while arguing that it is too difficult to apply to be useful. Which is confusing. Still agree that your big question is interesting though.
Cognitive Biases in Large Language Models

Super neat!

I'd also be interested in "control" non-debiasing prompts that are just longer and sound coherent but don't talk about bias. I suspect they might say something interesting about the white bear question.

For Laura the bank teller, does GPT just get more likely to pick "1" from any given list with model size? :P

Pathways: Google's AGI

Speaking for the Youtube Recommender pessimists, the problem I see is first training data, and second the human overseers.

Training just on video names and origins, along with watch statistics across all users, doesn't seem to reward clever planning in the physical world until after the AI is generally intelligent. This is the exact opposite of how you'd design an environment to encourage general real-world planning. (Curriculum learning is a thing for a reason)

Second, the people training the thing aren't trying to make AGI. They're totally fine with merely... (read more)

AXRP Episode 11 - Attainable Utility and Power with Alex Turner

Good episode as always :)

I'm interested in getting deeper into what Alex calls.

this framing of an AI that we give it a goal, it computes the policy, it starts following the policy, maybe we see it mess up and we correct the agent. And we want this to go well over time, even if we can’t get it right initially. 

Is the notion that if we understand how to build low-impact AI, we can build AIs with potentially bad goals, watch them screw up, and we can then fix our mistakes and try again? Does the notion of "low-impact" break down, though, if humans are ev... (read more)

2Alex Turner5dI want to clarify something. I think the notion doesn't break down. The low-impact AI hasn't changed human attainable utilities by the end of the experiments. If we eventually build a high-impact AI, that seems "on us." The low-impact AI itself hasn't done something bad to us. I therefore think the concept I spelled out still works in this situation. As I mentioned in the other comment, I don't feel optimistic about actually designing these AIs via explicit low-impact objectives, though.
2Alex Turner13dDepends. I think this is roughly true for small-scale AI deployments, where the AI makes mistakes which "aren't big deals" for most goals—instead of irreversibly smashing furniture, maybe it just navigates to a distant part of the warehouse. I think this paradigm is less clearly feasible or desirable for high-impact TAI deployment, and I'm currently not optimistic about that use case for impact measures.
Vanessa Kosoy's Shortform

Ah. I indeed misunderstood, thanks :) I'd read "short-term quantilization" as quantilizing over short-term policies evaluated according to their expected utility. My story doesn't make sense if the AI is only trying to push up the reported value estimates (though that puts a lot of weight on these estimates).

Vanessa Kosoy's Shortform

Agree with the first section, though I would like to register my sentiment that although "good at selecting but missing logical facts" is a better model, it's still not one I'd want an AI to use when inferring my values.

I'm not sure what you're saying in the "turning off the stars example". If the probability for the user to autonomously decide to turn off the stars is much lower than the quantilization fraction, then the probability that quantilization will decide to turn off the stars is low. And, the quantilization fraction is automatically selected lik

... (read more)
1Vanessa Kosoy1moI think you misunderstood how the iterated quantilization works. It does not work by the AI setting a long-term goal and then charting a path towards that goal s.t. it doesn't deviate too much from the baseline over every short interval. Instead, every short-term quantilization is optimizing for the user's evaluation in the end of this short-term interval.
Vanessa Kosoy's Shortform

Very interesting - I'm sad I saw this 6 months late.

After thinking a bit, I'm still not sure if I want this desideratum. It seems to require a sort of monotonicity, where we can get superhuman performance just by going through states that humans recognize as good, and not by going through states that humans would think are weird or scary or unevaluable.

One case where this might come up is in competitive games. Chess AI beats humans in part because it makes moves that many humans evaluate as bad, but are actually good. But maybe this example actually suppor... (read more)

1Vanessa Kosoy1moWhen I'm deciding whether to run an AI, I should be maximizing the expectation of my utility function w.r.t. my belief state. This is just what it means to act rationally. You can then ask, how is this compatible with trusting another agent smarter than myself? One potentially useful model is: I'm good at evaluating and bad at searching (after all, P≠NP). I can therefore delegate searching to another agent. But, as you point out, this doesn't account for situations in which I seem to be bad at evaluating. Moreover, if the AI prior takes an intentional stance towards the user (in order to help learning their preferences), then the user must be regarded as good at searching. A better model is: I'm good at both evaluating and searching, but the AI can access actions and observations that I cannot. For example, having additional information can allow it to evaluate better. An important special case is: the AI is connected to an external computer (Turing RL [] ) which we can think of as an "oracle". This allows the AI to have additional information which is purely "logical". We need infra-Bayesianism to formalize this: the user has Knightian uncertainty over the oracle's outputs entangled with other beliefs about the universe. For instance, in the chess example, if I know that a move was produced by exhaustive game-tree search then I know it's a good move, even without having the skill to understand why the move is good in any more detail. Now let's examine short-term quantilization for chess. On each cycle, the AI finds a short-term strategy leading to a position that the user evaluates as good, but that the user would require luck to manage on their own. This is repeated again and again throughout the game, leading to overall play substantially superior to the user's. On the other hand, this play is not as good as the AI would achieve if it just optimized
The theory-practice gap

I guess I fall into the stereotypical pessimist camp? But maybe it depends on what the actual label of the y-axis on this graph is.

Does an alignment scheme that will definitely not work, but is "close" to a working plan in units of number of breakthroughs needed count as high or low on the y-axis? Because I think we occupy a situation where we have some good ideas, but all of them are broken in several ways, and we would obviously be toast if computers got 5 orders of magnitude faster overnight and we had to implement our best guesses.

On the other hand, I'... (read more)

Goodhart Ethology

np, I'm just glad someone is reading/commenting :)

Goodhart Ethology

Yeah, this is right. The variable uncertainty comes in for free when doing curve fitting - close to the datapoints your models tend to agree, far away they can shoot off in different directions. So if you have a probability distribution over different models, applying the correction for the optimizer's curse has the very sensible effect of telling you to stick close to the training data.

1Steve Byrnes1moOh, yup, makes sense thanks
Measurement, Optimization, and Take-off Speed

I'm confused about your picture of "outer optimization power." What sort of decisions would be informed by knowing how sensitive the learned model is to perturbations of hyperparameters?

Any thoughts on just tracking the total amount of gradient-descending done, or total amount of changes made, to measure optimization?

Grokking the Intentional Stance

Nice summary :) It's relevant for the post that I'm about to publish that you can have more than one intentional-stance view of the same human. The inferred agent-shaped model depends not only on the subject and the observer, but also on the environment, and on what the observer hopes to get by modeling.

Charlie Steiner's Shortform

(biorxiv )

Cool paper on trying to estimate how many parameters neurons have (h/t Samuel at EA Hotel). I don't feel like they did a good job distinguishing how hard it was for them to fit nonlinearities that would nonetheless be the same across different neurons, versus the number of parameters that were different from neuron to neuron. But just based on differences in physical arrangement of axons and dendrites, there's a lot of opportuni... (read more)

Research agenda update

I only really know about the first bit, so have a comment about that :)

Predictably, when presented with the 1st-person problem I immediately think of hierarchical models. It's easy to say "just imagine you were in their place." What I'd think could do this thing is accessing/constructing a simplified model of the world (with primitives that have interpretations as broad as "me" and "over there") that is strongly associated with the verbal thought (EDIT: or alternately is a high-level representation that cashes out to the verbal thought via a pathway that e... (read more)

Answering questions honestly instead of predicting human answers: lots of problems and some solutions

I'm having some formatting problems (reading on in firefox) with scroll bars under full-width LaTex covering the following line of text.

(So now I'm finishing reading it on greaterwrong.)

BASALT: A Benchmark for Learning from Human Feedback

Nice! If I had university CS affiliations I would send them this with unsubtle comments that it would be a cool project to get students to try :P

In fact, now that I think about it, I do have one contact through the UIUC datathon. Or would you rather not have this sort of marketing?

3Rohin Shah3moI would be excited to see this competition promoted widely! (Obviously I wouldn't want to do anything that reflected really poorly on the marketers, like blackmail, but this seems to clearly not be in that category.)
Anthropics in infinite universes

A similarly odd question is how this plays with Solomonoff induction. Is a universe with infinite stuff in it of zero prior probability, because it requires infinite bits to specify where the stuff is? Quantum mechanics would say no: we can just specify a simple quantum state of the early universe, and then we're within one branch of that wavefunction. And the (quantum) information required to locate us within that wavefunction is only related to the information we actually see, i.e. finite.

A world in which the alignment problem seems lower-stakes

Weird coincidence, but I just read Superintelligence for the first time, and I was struck by the lack of mention of Steve Omohundro (though he does show up in endnote 8). My citation for instrumental convergence would be Omohundro 2008.

1Ofer Givoli3moI think that most of the citations in Superintelligence are in endnotes. In the endnote that follows the first sentence after the formulation of instrumental convergence thesis, there's an entire paragraph about Stephen Omohundro's work on the topic (including citations of Omohundro's "two pioneering papers on this topic").
Anthropic Effects in Estimating Evolution Difficulty

I'm going to have some criticism here, but don't take it too hard :) Most of this is directed at our state of understanding in 2012.

I think a way to do better is not to mention SSA or SIA at all, and just talk about conditioning on information. Don't even have to say "anthropic conditioning" or anything special - we're just conditioning on the fact that sampling from some distribution (e.g. "worlds with intelligent life who figure out evolution") gave us exactly our planet. (My own arguments for this on LW date from c. 2015, but this was a common position ... (read more)

Thoughts on safety in predictive learning

I feel scooped by this post! :)  I was thinking along different lines - using induction (postdictive learning) to get around Goodhart's law specifically by using the predictions outside of their nominal use case. But now I need to go back and think more about self-fulfilling prophecies and other sorts of feedback.

Maybe I'll try to get you to give me some feedback later this week.

1Steve Byrnes4moSounds interesting!
Experimentally evaluating whether honesty generalizes

How are we imagining prompting the multimodal Go+English AI with questions like "is this group alive or dead?" And how are we imagining training it so that it forms intermodal connections rather than just letting them atrophy?

My past thoughts (from What's the dream for giving natural language instructions to AI) were to do it like an autoencoder + translator - you could simultaneously use a latent space for English and for Go, and you train both on autoencoding (or more general unimodal) tasks and on translation (or more general multimodal) tasks. But I th... (read more)

2Paul Christiano4moI do think the LM-only version seems easier and probably better to start with. The hope is that you can fiddle with these things to get it to answer some questions and then see whether it generalizes. My first guess for an architecture would be producing a 19 x 19 grid of embeddings from the CNN, and then letting a transformer attend over them (along with the prior text). That is, you train a CNN that is supposed to produce both (moves, embeddings) and a transformer that talks and sees the embeddings.
Frequent arguments about alignment

For #2, not sure if this is a skeptic or an advocate point: why have a separate team at all? When designing a bridge you don't have one team of engineers making the bridge, and a separate team of engineers making sure the bridge doesn't fall down. Within openAI, isn't everyone committing to good things happening, and not just strictly picking the lowest-hanging fruit? If alignment-informed research is better long-term, why isn't the whole company the "safety team" out of simple desire to do their job?

We could make this more obviously skeptical by rephrasin... (read more)

2John Schulman4moIn my experience, you need separate teams doing safety research because specialization is useful -- it's easiest to make progress when both individuals and teams specialize a bit and develop taste and mastery of a narrow range of topics.
The Nature of Counterfactuals

In addition to self-consistency, we can also imagine agents that interact with an environment and learn about how to model the environment (thereby having an effective standard for counterfactuals - or it could be explicit if we hand-code the agents to choose actions by explicitly considering counterfactuals) by taking actions and evaluating how good their predictions are.

Knowledge is not just precipitation of action

Do you want to chat sometime about this?

I think it's pretty clear why we think of the map-making sailboat as "having knowledge" even if it sinks, and it's because our own model of the world expects maps to be legible to agents in the environment, and so we lump them into "knowledge" even before actually seeing someone use any particular map. You could try to predict this legibility part of how we think of knowledge from the atomic positions of the item itself, but you're going to get weird edge cases unless you actually make a intentional-stance-level mode... (read more)

Big picture of phasic dopamine

How does the section of the amygdala that a particular dopamine neuron connects to even get trained to do the right thing in the first place? It seems like there should be enough chance in connections that there's really only this one neuron linking a brainstem's particular output to this specific spot in the amygdala - it doesn't have a whole bundle of different signals available to send to this exact spot.

SL in the brain seems tricky because not only does the brainstem have to reinforce behaviors in appropriate contexts, it might have to train certain ou... (read more)

2Steve Byrnes4moI'm proposing that (1) the hypothalamus has an input slot for "flinch now", (2) VTA has an output signal for "should have flinched", (3) there is a bundle of partially-redundant side-by-side loops (see the "probability distribution" comment) that connect specifically to both (1) and (2), by a genetically-hardcoded mechanism. I take your comment to be saying: Wouldn't it be hard for the brain to orchestrate such a specific pair of connections across a considerable distance? Well, I'm very much not an expert on how the brain wires itself up. But I think there's gotta be some way that it can do things like that. I feel like those kinds of feats of wiring are absolutely required for all kinds of reasons. Like, I think motor cortex connects directly to spinal hand-control nerves, but not foot-control nerves. How do the output neurons aim their paths so accurately, such that they don't miss and connect to the foot nerves by mistake? Um, I don't know, but it's clearly possible. "Molecular signaling" or something, I guess? Hmm, one reasonable (to me) possibility along these lines would be something like: "VTA has 20 dopamine output signals, and they're guided to wind up spread out across the amygdala, but not with surgical precision. Meanwhile the corresponding amygdala loops terminate in an "input zone" of the lateral hypothalamus, but not to any particular spot, instead they float around unsure of exactly what hypothalamus "entry point" to connect to. And there are 20 of these intended "entry points" (collections of neurons for flinching, scowling, etc.). OK, then during embryonic development, the entry-point neurons are firing randomly, and that signal goes around the loop—within the hypothalamus and to VTA, then up to the amygdala, then back down to that floating neuron. Then Hebbian learning—i.e. matching the random code—helps the right loop neuron find its way to the matching hypothalamus entry point." I'm not sure if that's exactly what you're proposing, but that
Big picture of phasic dopamine

One thing that strikes me as odd about this model is that it doesn't have the blessing of dimensionality - each plan is one loop, and evaluating feedback to a winning plan just involves feedback to one loop. When it's general reward we can simplify this with just rewarding recent winning plans, but in some places it seems like you do imply highly specific feedback, for which you need N feedback channels to give feedback on ~N possible plans. The "blessing of dimensionality" kicks in when you can use more diverse combinations of a smaller number of feedback... (read more)

1Steve Byrnes4moRight, so I'm saying that the "supervised learning loops" get highly specific feedback, e.g. "if you get whacked in the head, then you should have flinched a second or two ago", "if a salty taste is in your mouth, then you should have salivated a second or two ago", "if you just started being scared, then you should have been scared a second or two ago", etc. etc. That's the part that I'm saying trains the amygdala and agranular prefrontal cortex. Then I'm suggesting that the Success-In-Life thing is a 1D reward signal to guide search in a high-dimensional space of possible thoughts to think, just like RL. In this case, it's not "each plan is one loop", because there's a combinatorial explosion of possible thoughts you can think, and there are not enough loops for that. (It also wouldn't work because for pretty much every thought you think, you've never thought that exact thought before—like you've never put on this particular jacket while humming this particular song and musing about this particular upcoming party...) Instead I think compositionality is involved, such that one plan / thought can involve many simultaneous loops.
Thoughts on the Alignment Implications of Scaling Language Models

Great post! I very much hope we can do some clever things with value learning that let us get around needing AbD to do the things that currently seem to need it.

The fundamental example of this is probably optimizability - is your language model so safe that you can query it as part of an optimization process (e.g. making decisions about what actions are good), without just ending up in the equivalent of deepDream's pictures of Maximum Dog.

Problems facing a correspondence theory of knowledge

I think grappling with this problem is important because it leads you directly to understanding that what you are talking about is part of your agent-like model of systems, and how this model should be applied depends both on the broader context and your own perspective.

Saving Time

What is then stopping us from swapping the two copies of the coarser node?

Isn't it precisely that they're playing different roles in an abstracted model of reality? Though alternatively, you can just throw more logical nodes at the problem and create a common logical cause for both.

Also, would you say what you have in mind is built out of of augmenting a collection of causal graphs with logical nodes, or do you have something incompatible in mind?

Agency in Conway’s Game of Life

The truly arbitrary version seems provably impossible. For example, what if you're trying to make a smiley face, but some other part of the world contains an agent just like you except they're trying to make a frowny face - you obviously both can't succeed. Instead you need some special environment with low entropy, just like humans do in real life.

1Alex Flint5moYeah absolutely - see third bullet in the appendix. One way to resolve this would be to say that to succeed at answering the control question you have to succeed in at least 1% of randomly chosen environments.
Low-stakes alignment

I feel like we can approximately split the full alignment problem into two parts: low stakes and handling catastrophes.

Insert joke about how I can split physics research into two parts: low stakes and handling catastrophes.

I'm a little curious about whether assuming fixed low stakes accidentally favors training regimes that have the real-world drawback of raising the stakes.

But overall I think this is a really interesting way of reframing the "what do we do if we succeed?" question. There is one way it might be misleading, which is I think that we're le... (read more)

Announcing the Alignment Research Center


I'll procrastinate from thesis-writing to fill out the form :)

You're gonna get back to thesis writing quickly, it's a very short form.

Naturalism and AI alignment

Wow, I'm really sorry for my bad reading comprehension.

Anyhow, I'm skeptical that scientist AI part 2 would end up doing the right thing (regardless of our ability to interpret it). I'm curious if you think this could be settled without building a superintelligent AI of uncertain goals, or if you'd really want to see the "full scale" test.

1Michele Campolo6moIf there is a superintelligent AI that ends up being aligned as I've written, probably there is also a less intelligent agent that does the same thing. Something comparable to human-level might be enough. From another point of view: some philosophers are convinced that caring about conscious experiences is the rational thing to do. If it's possible to write an algorithm that works in a similar way to how their mind works, we already have an (imperfect, biased, etc.) agent that is somewhat aligned, and is likely to stay aligned after further reflection. One could argue that these philosophers are fooling themselves, that no really intelligent agent will end up with such weird beliefs. So far, I haven't seen convincing arguments in favour of this; it goes back to the metaethical discussion. I quote a sentence I have written in the post:
Naturalism and AI alignment

This runs headfirst into the problem of radical translation (which in AI is called "AI interpretability." Only slightly joking.)

Inside our Scientist AI it's not going to say "murder is bad," it's going to say "feature 1000 1101 1111 1101 is connected to feature 0000 1110 1110 1101." At at first you might think this isn't so bad, after all, AI interpretability is a flourishing field, let's just look at some examples and visualizations and try to figure out what these things are. But there's no guarantee that these features correspond neatly to their closest... (read more)

1Michele Campolo6moI am aware of interpretability issues. This is why, for AI alignment, I am more interested in the agent described at the beginning of Part II than Scientist AI. Thanks for the link to the sequence on concepts, I found it interesting!
AXRP Episode 6 - Debate and Imitative Generalization with Beth Barnes

I initially thought you were going to debate Beth Barnes.

Also, thanks for the episode :) It was definitely interesting, although I still don't have a good handle on why some people are optimistic that there aren't classes of arguments humans will "fall for" irrespective of their truth value.

1DanielFilan6moYeah the initial title was not good
Testing The Natural Abstraction Hypothesis: Project Intro

One generalization I am also interested in is to learn not merely abstract objects within a big model, but entire self-contained abstract levels of description, together with actions and state transitions that move you between abstract states. E.g. not merely detecting that "the grocery store" is a sealed box ripe for abstraction, but that "go to the grocery store" is a valid action within a simplified world-model with nice properties.

This might be significantly more challenging to say something interesting about, because it depends not just on the world but on how the agent interacts with the world.

Load More