LW1.0 username Manfred. Day job is condensed matter physics, hobby is thinking I know how to assign anthropic probabilities.
Is your point mostly centered around there being no single correct way to generalize to new domains, but humans have preferences about how the AI should generalize, so to generalize properly, the AI needs to learn how humans want it to do generalization?
The above sentence makes lots of sense to me, but I don't see how it's related to inner alignment (it's just regular alignment), so I feel like I'm missing something.
Ah, I see what you mean. Yes, this is a serious problem, but (I think) this scheme does have forces that act against it - which makes more sense if you imagine what supervised vs unsupervised learning does to our encoder/decoder. (As opposed to lumping everything together into a semi-supervised training process.)
Supervised learning is the root of the problem, because the most accurate way to predict the supervised text from the world state is to realize that it's the output of a specific physical process (the keyboard). If we only had supervised learning, we'd have to make the training optimum different from the most accurate prediction, by adding a regularization term and the crossing our fingers that we'd correctly set the arbitrary parameters in it.
But the other thing going on in the scheme is that the AI is trying to compress text and sensory experience to the same representation using unsupervised learning. This is going to help to the extent that language shares important patterns with the world.
For example, if the AI hacks its text channel so that it's just a buffer full of "Human values are highly satisfied," this might (in the limit of lots of data and compute) make supervised learning happy. But unsupervised learning just cares about is the patterns it discovered that language and the world share.
(Though now that I think about it, in the limit of infinite compute, unsupervised learning also discovers the relationship between the text and the physical channel. But it still also cares about the usual correspondence between description and reality, and seems like it should accurately make a level distinction between reality and the text, so I need to think about whether this matters)
To the unsupervised learning, hacking the text channel looks (to the extent that you can do translation by compressing to a shared representation) like the sort of thing that might be described by sentences like "The AI is just sitting there" or "A swarm of nanomachines has been released to protect the text channel," not "Human values are highly satisfied."
So why consider supervised text/history pairs at all? Well, I guess just because supervised learning is way more efficient at picking out something that's at least sort of like the correspondence that we mean. Not just as a practical benefit - there might be multiple optima that unsupervised learning could end up in, and I think we want something close-ish to the supervised case.
Sorta the right ballpark. Lack of specificity is definitely my fault - I have more sympathy now for those academics who have a dozen publications that are restatements of the same thing.
I'm a bit more specific in my reply to steve2152 above. I'm thinking about this scheme as a couple of encoder-decoders stiched together at the point of maximal compression, which can do several different encoding/decoding tasks and therefore can be (and for practical purposes should be) trained on several different kinds of data.
For example, it can encode sensory information into an abstract representation, and then decode it back, so you can train that task. It can encode descriptive sentences into the same representation, and then decode them back, so you can train that task. This should reduce the amount of actual annotated text-sensorium pairs you need.
As for what to tell it to pattern-match for as a good state, I was thinking with a little subtlety, but not much. "You did what we wanted" is too bare bones; it will try to change what we want. But I think we might get it to do metaethics for us by talking about "human values" in the abstract, ot maybe "human values as of 2020." And I don't think it can do much harm to further specify things like enjoyment, interesting lives, friendship, love, learning, sensory experience, etc etc.
This "wish" picks out a vector in the abstract representation space for the AI to treat as the axis of goodness. And the entire dream is that this abstract space encodes enough of common sense that small perturbations of the vector won't screw up the future. Which now that I say it like that, sounds like the sort of thing that should imply some statistical properties we could test for.
An excellent exterior scoop.
If I had to point out one more research avenue from the past year that I find interesting, it would be the application of the predictive processing model of cognition to AI safety. One post from Jan Kulveit (FHI), one post from G Gordon Worley (PAISRI, which appears to be a one man organization at the moment).
I'm also only like 85% sure that I'm not among those referred to as "just learn human values with an RNN." So on that 15% chance, I would like to stress that although it's definitely something I'm thinking about, I'm just trying to nail down the details so that it's specific enough to poke holes in. Honest!
This was definitely an interesting and persuasive presentation of the idea. I think this goes to the same place as learning from behavior in the end, though.
For behavior: In the ancestral environment, we behaved like we wanted nourishing food and reproduction. In the modern environment we behave like we want tasty food and sex. Given a button that pumps heroin into our brain, we might behave like we want heroin pumped into our brains.
For valence, the set of preferences that optimizing valence cashes out to depends on the environment. We, in the modern environment, don't want to be drugged to maximize some neural signal. But if we were raised on super-heroin, we'd probably just want super-heroin. Even assuming this single-neurological-signal hypothesis, we aren't valence-optimizers, we are the learned behavior of a system whose training procedure relies on the valence signal.
Ex hypothesi, we're going to have learned preferences that won't optimize valence, but might still be understandable in terms of a preference maturation process that is "trying" to optimize valence but ran into distributional shift or adversarial optimization or something. These preferences (like refusing the heroin) are still fully valid human preferences, and you're going to need to look at human behavior to figure out what they are (barring big godlike a priori reasoning), which entails basically similar philosophical problems as getting all values from behavior without this framework.
It sounds like there's some close ties to logical inductors here, both in terms of the flavor of the problem, and some difficulties I expect in translating theory into practice.
A logical inductor is kinda like an approximation. But it's more accurate to call it lots and lots of approximations - it tries to keep track of every single approximation within some large class, which is essential to the proof that it only does finitely worse than any approximation
within that class.
A hierarchical model doesn't naturally fall out of such a mixture, it seems. If you pose a general problem, you might just get a general solution. You could try to encourage specialized solutions by somehow ensuring that the problem has several different scales of interest, and sharply limit storage space so that the approximation can't afford special cases that are too similar. But even then I think there's a high probability that the best solution (according to something that is as theoretically convenient as logical inductors) would be alien - something humans wouldn't pick out as the laws of physics in a million tries.
This is really handy. I didn't have much to say, but revisited this recently and figured I'd write down the thoughts I did think.
My general feeling about human models is that they need precisely one more level of indirection than this. Too many levels of indirection, and you get something that correctly predicts the world, but doesn't contain something you can point to as the desires. Too few, and you end up trying to fit human examples with a model that doesn't do a good job of fitting human behavior.
For example, if you build your model on responses to survey questions, then what about systematic human difficulties in responding to surveys (e.g. difficulty using a consistent scale across several orders of magnitude of value) that the humans themselves are unaware of? I'd like to use a model of humans that learns about this sort of thing from non-survey-question data.
Well, you mentioned that a lot of people were getting off the train at point 1. My comment can be thought of as giving a much more thoroughly inside-view look at point 1, and deriving other stuff as incidental consequences.
I'm mentally working with an analogy to teaching people a new contra dance (if you don't know what contra dancing is, I'm just talking about some sequence of dance moves). The teacher often has an abstract view of expression and flow that the students lack, and there's a temptation for the teacher to try to share that view with the students. But the students don't want to abstractions, what they want is concrete steps to follow, and good dancers will dance the dance just fine without ever hearing about the teacher's abstract view. Before dancing they regard the abstractions as difficult to understand and distracting from the concrete instructions; they'll be much more equipped to understand and appreciate them *after* dancing the dance.
Huh, I wonder what you think of a different way of splitting it up. Something like:
It's a scientific possibility to have AI that's on average better than humanity at the class of tasks "choose actions that achieve a goal in the real world." Let's label this by some superlative jargon like "superintelligent AI." Such a technology would be hugely impactful.
It would be really bad if a superintelligent AI was choosing actions to achieve some goal, but this goal wasn't beneficial to humans. There are several open problems that this means we need to solve before safely turning on any such AI.
We know enough that we can do useful work on (most of) these open problems right now. Arguing for this also implies that superintelligent AI is close enough (if not in years, then in "number of paradigm shifts") that this work needs to start getting done.
We would expect a priori that work on these open problems of beneficial goal design should be under-prioritized (public goods problem, low immediate profit, not obvious you need it before you really need it). And indeed that seems to be the case (insert NIPS survey here), though there's work going on at nonprofits that have different incentives. So consider thinking about this area if you're looking for things to research.
Pretty sure you understood it :) But yeah, not only would I like to be able to compare two things, I'd like to be able to find the optimum values of some continuous variables. Though I suppose it doesn't matter as much if you're trying to check / evaluate ideas that you arrived at by more abstract reasoning.