Necromantic comment, sorry :P
I might be misinterpreting, but what I think you're saying is that if the humans make a mistake in using a causal model of the world and tell the AI to optimize for something bad-in-retrospect, this is "mistaken causal structure lead[ing] to regressional or extremal Goodhart", and thus not really causal Goodhart per se (by the categories you're intending). But I'm still a little fuzzy on what you mean to be actual factual causal Goodhart.
Is the idea that humans tell the AI to optimize for something that is not bad-in-retrospect... (read more)
It seems like evaluating human AU depends on the model. There's a "black box" sense where you can replace the human's policy with literally anything in calculating AU for different objectives, and there's a "transparent box" sense in which you have to choose from a distribution of predicted human behaviors.
The former is closer to what I think you mean by "hasn't changed the humans' AU," but I think it's the latter that an AI cares about when evaluating the impact of its own actions.
Yup, I more or less agree with all that. The name thing was just a joke about giving things we like better priority in namespace.
I think quantilization is safe when it's a slightly "lucky" human-imitation (also if it's a slightly "lucky" version of some simpler base distribution, but then it won't be as smart). But push too hard, which might not be very hard at all if you're iterating quantilization steps rather than quantilizing over a long-term policy, and instead you get an unaligned intelligence that happens to interact with the world by picking human-... (read more)
Biologically, I think the evolution of body-plan modularity might be backwards from your general argument for it - the biological goal seems to be to make big (but not automatically bad) changes from small mutations (e.g. entire extra body segments), not to "hide" DOF inside modules to allow for smoother changes per parameter.
In fact, this strikes me as resembling abstraction, so this might be right in your wheelhouse :P Biological modularity seems to specifically select for those modules with simple interfaces that can be cut-and-pasted with the maximal c... (read more)
Like, maybe we should say that "radically superhuman safety and benevolence" is a different problem than "alignment"?
Ah, you mean that "alignment" is a different problem than "subhuman and human-imitating training safety"? :P
So is there a continuum between category 1 and category 2? The transitional fossils could be non-human-imitating AIs that are trying to be a little bit general or have goals that refer to a model of the human a little bit, but the designers still understand the search space better than the AIs.
I'll try not to be too grumpy, here.
Let me make the case why filter-ology isn't so exciting by extending part of your own model: the fact that you take evidence into account.
The basic doomsday argument model asks us to pretend that we are amnesiacs - that we don't know any astronomy or history or engineering, all we know is that we've drawn some random number in a list of indeterminate length. If that numbered list is the ordering of human births, we're halfway through. If we consign the number of humans to the memory hole and say that the list is years in... (read more)
I'm not sure if I disagree with your evaluation of timeline-driven delegative RL, or if I disagree with how you're splitting up the alignment problem. Vanessa's example scheme still requires that humans have certain nice behaviors (like well-calibrated success %s). It sorta sounds like you're putting those nice properties under inner alignment (or maybe just saying that you expect solutions to inner alignment problems to also handle human messiness?).
This causes me to place Vanessa's ideas differently. Because I don't say she's solved the entire... (read more)
This might be related to the notion that if we try to dictate the form of a model ahead of time (i.e. some of the parameters are labeled "world model" in the code, and others are labeled "preferences", and inference is done by optimizing the latter over the former), but then just train it to minimize error, the actual content of the parameters after training doesn't need to respect our preconceptions. What the model really "wants" to do in the limit of lots of compute is find a way to encode an accurate simulation of the human in the parameters in a way th... (read more)
Right. I think I'm more of the opinion that we'll end up choosing those interfaces via desiderata that apply more directly to the interface (like "we want to be able to compare two models' ratings of the same possible future"), rather than indirect desiderata on "how a practical agent should look" that we keep adding to until an interface pops out.
We can imagine modeling humans in purely psychological ways with no biological inspiration, so I think you're saying that you want to look at the "natural constraints" on representations / processes, and then in a sense generalize or over-charge those constraints to narrow down model choices?
Hm. Suppose sometimes I want to model humans as having propositional beliefs, and other times I want to model humans as having probabilistic beliefs, and still other times I want to model human beliefs as a set of contexts and a transition function. What's stopping me?
I think it depends on the application. What seems like the obvious application is building an AI that models human beliefs, or human preferences. What are some of the desiderata we use when choosing how we want an AI to model us, and how do these compare to typical desiderata used in picking ... (read more)
Thanks, this is useful feedback in how I need to be more clear about what I'm claiming :) In october I'm going to be refining these posts a bit - would you be available to chat sometime?
I'm mostly arguing against the naive framing where humans are assumed to have a utility function, and then we can tell how well the AI is doing by comparing the results to the actual utility (the "True Values"). The big question is: how do you formally talk about misalignment without assuming some such unique standard to judge the results by?
I'd also be interested in "control" non-debiasing prompts that are just longer and sound coherent but don't talk about bias. I suspect they might say something interesting about the white bear question.
For Laura the bank teller, does GPT just get more likely to pick "1" from any given list with model size? :P
Speaking for the Youtube Recommender pessimists, the problem I see is first training data, and second the human overseers.
Training just on video names and origins, along with watch statistics across all users, doesn't seem to reward clever planning in the physical world until after the AI is generally intelligent. This is the exact opposite of how you'd design an environment to encourage general real-world planning. (Curriculum learning is a thing for a reason)
Second, the people training the thing aren't trying to make AGI. They're totally fine with merely... (read more)
Good episode as always :)
I'm interested in getting deeper into what Alex calls.
this framing of an AI that we give it a goal, it computes the policy, it starts following the policy, maybe we see it mess up and we correct the agent. And we want this to go well over time, even if we can’t get it right initially.
Is the notion that if we understand how to build low-impact AI, we can build AIs with potentially bad goals, watch them screw up, and we can then fix our mistakes and try again? Does the notion of "low-impact" break down, though, if humans are ev... (read more)
Ah. I indeed misunderstood, thanks :) I'd read "short-term quantilization" as quantilizing over short-term policies evaluated according to their expected utility. My story doesn't make sense if the AI is only trying to push up the reported value estimates (though that puts a lot of weight on these estimates).
Agree with the first section, though I would like to register my sentiment that although "good at selecting but missing logical facts" is a better model, it's still not one I'd want an AI to use when inferring my values.
I'm not sure what you're saying in the "turning off the stars example". If the probability for the user to autonomously decide to turn off the stars is much lower than the quantilization fraction, then the probability that quantilization will decide to turn off the stars is low. And, the quantilization fraction is automatically selected lik
Very interesting - I'm sad I saw this 6 months late.
After thinking a bit, I'm still not sure if I want this desideratum. It seems to require a sort of monotonicity, where we can get superhuman performance just by going through states that humans recognize as good, and not by going through states that humans would think are weird or scary or unevaluable.
One case where this might come up is in competitive games. Chess AI beats humans in part because it makes moves that many humans evaluate as bad, but are actually good. But maybe this example actually suppor... (read more)
I guess I fall into the stereotypical pessimist camp? But maybe it depends on what the actual label of the y-axis on this graph is.
Does an alignment scheme that will definitely not work, but is "close" to a working plan in units of number of breakthroughs needed count as high or low on the y-axis? Because I think we occupy a situation where we have some good ideas, but all of them are broken in several ways, and we would obviously be toast if computers got 5 orders of magnitude faster overnight and we had to implement our best guesses.
On the other hand, I'... (read more)
np, I'm just glad someone is reading/commenting :)
Yeah, this is right. The variable uncertainty comes in for free when doing curve fitting - close to the datapoints your models tend to agree, far away they can shoot off in different directions. So if you have a probability distribution over different models, applying the correction for the optimizer's curse has the very sensible effect of telling you to stick close to the training data.
I'm confused about your picture of "outer optimization power." What sort of decisions would be informed by knowing how sensitive the learned model is to perturbations of hyperparameters?
Any thoughts on just tracking the total amount of gradient-descending done, or total amount of changes made, to measure optimization?
Nice summary :) It's relevant for the post that I'm about to publish that you can have more than one intentional-stance view of the same human. The inferred agent-shaped model depends not only on the subject and the observer, but also on the environment, and on what the observer hopes to get by modeling.
(biorxiv https://www.biorxiv.org/content/10.1101/613141v2 )
Cool paper on trying to estimate how many parameters neurons have (h/t Samuel at EA Hotel). I don't feel like they did a good job distinguishing how hard it was for them to fit nonlinearities that would nonetheless be the same across different neurons, versus the number of parameters that were different from neuron to neuron. But just based on differences in physical arrangement of axons and dendrites, there's a lot of opportuni... (read more)
I only really know about the first bit, so have a comment about that :)
Predictably, when presented with the 1st-person problem I immediately think of hierarchical models. It's easy to say "just imagine you were in their place." What I'd think could do this thing is accessing/constructing a simplified model of the world (with primitives that have interpretations as broad as "me" and "over there") that is strongly associated with the verbal thought (EDIT: or alternately is a high-level representation that cashes out to the verbal thought via a pathway that e... (read more)
I'm having some formatting problems (reading on lesswrong.com in firefox) with scroll bars under full-width LaTex covering the following line of text.
(So now I'm finishing reading it on greaterwrong.)
Nice! If I had university CS affiliations I would send them this with unsubtle comments that it would be a cool project to get students to try :P
In fact, now that I think about it, I do have one contact through the UIUC datathon. Or would you rather not have this sort of marketing?
A similarly odd question is how this plays with Solomonoff induction. Is a universe with infinite stuff in it of zero prior probability, because it requires infinite bits to specify where the stuff is? Quantum mechanics would say no: we can just specify a simple quantum state of the early universe, and then we're within one branch of that wavefunction. And the (quantum) information required to locate us within that wavefunction is only related to the information we actually see, i.e. finite.
Weird coincidence, but I just read Superintelligence for the first time, and I was struck by the lack of mention of Steve Omohundro (though he does show up in endnote 8). My citation for instrumental convergence would be Omohundro 2008.
I'm going to have some criticism here, but don't take it too hard :) Most of this is directed at our state of understanding in 2012.
I think a way to do better is not to mention SSA or SIA at all, and just talk about conditioning on information. Don't even have to say "anthropic conditioning" or anything special - we're just conditioning on the fact that sampling from some distribution (e.g. "worlds with intelligent life who figure out evolution") gave us exactly our planet. (My own arguments for this on LW date from c. 2015, but this was a common position ... (read more)
I feel scooped by this post! :) I was thinking along different lines - using induction (postdictive learning) to get around Goodhart's law specifically by using the predictions outside of their nominal use case. But now I need to go back and think more about self-fulfilling prophecies and other sorts of feedback.
Maybe I'll try to get you to give me some feedback later this week.
How are we imagining prompting the multimodal Go+English AI with questions like "is this group alive or dead?" And how are we imagining training it so that it forms intermodal connections rather than just letting them atrophy?
My past thoughts (from What's the dream for giving natural language instructions to AI) were to do it like an autoencoder + translator - you could simultaneously use a latent space for English and for Go, and you train both on autoencoding (or more general unimodal) tasks and on translation (or more general multimodal) tasks. But I th... (read more)
For #2, not sure if this is a skeptic or an advocate point: why have a separate team at all? When designing a bridge you don't have one team of engineers making the bridge, and a separate team of engineers making sure the bridge doesn't fall down. Within openAI, isn't everyone committing to good things happening, and not just strictly picking the lowest-hanging fruit? If alignment-informed research is better long-term, why isn't the whole company the "safety team" out of simple desire to do their job?
We could make this more obviously skeptical by rephrasin... (read more)
In addition to self-consistency, we can also imagine agents that interact with an environment and learn about how to model the environment (thereby having an effective standard for counterfactuals - or it could be explicit if we hand-code the agents to choose actions by explicitly considering counterfactuals) by taking actions and evaluating how good their predictions are.
I'll see you soon!
Do you want to chat sometime about this?
I think it's pretty clear why we think of the map-making sailboat as "having knowledge" even if it sinks, and it's because our own model of the world expects maps to be legible to agents in the environment, and so we lump them into "knowledge" even before actually seeing someone use any particular map. You could try to predict this legibility part of how we think of knowledge from the atomic positions of the item itself, but you're going to get weird edge cases unless you actually make a intentional-stance-level mode... (read more)
How does the section of the amygdala that a particular dopamine neuron connects to even get trained to do the right thing in the first place? It seems like there should be enough chance in connections that there's really only this one neuron linking a brainstem's particular output to this specific spot in the amygdala - it doesn't have a whole bundle of different signals available to send to this exact spot.
SL in the brain seems tricky because not only does the brainstem have to reinforce behaviors in appropriate contexts, it might have to train certain ou... (read more)
One thing that strikes me as odd about this model is that it doesn't have the blessing of dimensionality - each plan is one loop, and evaluating feedback to a winning plan just involves feedback to one loop. When it's general reward we can simplify this with just rewarding recent winning plans, but in some places it seems like you do imply highly specific feedback, for which you need N feedback channels to give feedback on ~N possible plans. The "blessing of dimensionality" kicks in when you can use more diverse combinations of a smaller number of feedback... (read more)
Great post! I very much hope we can do some clever things with value learning that let us get around needing AbD to do the things that currently seem to need it.
The fundamental example of this is probably optimizability - is your language model so safe that you can query it as part of an optimization process (e.g. making decisions about what actions are good), without just ending up in the equivalent of deepDream's pictures of Maximum Dog.
I think grappling with this problem is important because it leads you directly to understanding that what you are talking about is part of your agent-like model of systems, and how this model should be applied depends both on the broader context and your own perspective.
What is then stopping us from swapping the two copies of the coarser node?
What is then stopping us from swapping the two copies of the coarser node?
Isn't it precisely that they're playing different roles in an abstracted model of reality? Though alternatively, you can just throw more logical nodes at the problem and create a common logical cause for both.
Also, would you say what you have in mind is built out of of augmenting a collection of causal graphs with logical nodes, or do you have something incompatible in mind?
The truly arbitrary version seems provably impossible. For example, what if you're trying to make a smiley face, but some other part of the world contains an agent just like you except they're trying to make a frowny face - you obviously both can't succeed. Instead you need some special environment with low entropy, just like humans do in real life.
I feel like we can approximately split the full alignment problem into two parts: low stakes and handling catastrophes.
I feel like we can approximately split the full alignment problem into two parts: low stakes and handling catastrophes.
Insert joke about how I can split physics research into two parts: low stakes and handling catastrophes.
I'm a little curious about whether assuming fixed low stakes accidentally favors training regimes that have the real-world drawback of raising the stakes.
But overall I think this is a really interesting way of reframing the "what do we do if we succeed?" question. There is one way it might be misleading, which is I think that we're le... (read more)
Thanks for fixing the formatting!
I'll procrastinate from thesis-writing to fill out the form :)
You're gonna get back to thesis writing quickly, it's a very short form.
Wow, I'm really sorry for my bad reading comprehension.
Anyhow, I'm skeptical that scientist AI part 2 would end up doing the right thing (regardless of our ability to interpret it). I'm curious if you think this could be settled without building a superintelligent AI of uncertain goals, or if you'd really want to see the "full scale" test.
This runs headfirst into the problem of radical translation (which in AI is called "AI interpretability." Only slightly joking.)
Inside our Scientist AI it's not going to say "murder is bad," it's going to say "feature 1000 1101 1111 1101 is connected to feature 0000 1110 1110 1101." At at first you might think this isn't so bad, after all, AI interpretability is a flourishing field, let's just look at some examples and visualizations and try to figure out what these things are. But there's no guarantee that these features correspond neatly to their closest... (read more)
I initially thought you were going to debate Beth Barnes.
Also, thanks for the episode :) It was definitely interesting, although I still don't have a good handle on why some people are optimistic that there aren't classes of arguments humans will "fall for" irrespective of their truth value.
One generalization I am also interested in is to learn not merely abstract objects within a big model, but entire self-contained abstract levels of description, together with actions and state transitions that move you between abstract states. E.g. not merely detecting that "the grocery store" is a sealed box ripe for abstraction, but that "go to the grocery store" is a valid action within a simplified world-model with nice properties.
This might be significantly more challenging to say something interesting about, because it depends not just on the world but on how the agent interacts with the world.