I'm Steve Byrnes, a professional physicist in the Boston area. I have a summary of my AGI safety research interests at: https://sjbyrnes.com/agi.html
Gotcha, thanks. I have corrected my comment two above by striking out the words "boundedly-rational", but I think the point of that comment still stands.
Sorry for the stupid question, but what's the difference between "boundedly-rational agent pursuing a reward function" and "any sort of agent pursuing a reward function"?
It's your first day working at the factory, and you're assigned to shadow Alice as she monitors the machines on the line. She walks over to the Big Machine and says, "Looks like it's flooping again," whacks it, and then says "I think that fixed it". This happens a few times a day perpetually. Over time, you learn what flooping is, kinda. When the Big Machine is flooping, it usually (but not always) makes a certain noise, it usually (but not always) has a loose belt, and it usually (but not always) has a gear that shifted out of place. Now you know what it means for the Big Machine to be flooping, although there are lots of edge cases where neither you nor Alice has a good answer for whether or not it's truly flooping, vs sorta flooping, vs not flooping.
By the same token, you could give some labeled examples of "wants to take a walk" to the aliens, and they can find what those examples have in common and develop a concept of "wants to take a walk", albeit with edge cases.
Then you can also give labeled examples of "wants to braid their hair", "wants to be accepted", etc., and after enough cycles of this, they'll get the more general concept of "want", again with edge cases.
I don't think I'm saying anything that goes against your Occam's razor paper. As I understood it (and you can correct me!!), that paper was about fitting observations of humans to a mathematical model of "boundedly-rational agent pursuing a utility function", and proved that there's no objectively best way to do it, where "objectively best" includes things like fidelity and simplicity. (My perspective on that is, "Well yeah, duh, humans are not boundedly-rational agents pursuing a utility function! The model doesn't fit! There's no objectively best way to hammer a square peg into a round hole! (ETA: the model doesn't fit except insofar as the model is tautologically applicable to anything)")
I don't see how the paper rules out the possibility of building an unlabeled predictive model of humans, and then getting a bunch of examples labeled "This is human motivation", and building a fuzzy concept around those examples. The more labeled examples there are, the more tolerant you are of different inductive biases in the learning algorithm. In the limit of astronomically many labeled examples, you don't need a learning algorithm at all, it's just a lookup table.
This procedure has nothing to do with fitting human behavior into a model of a boundedly-rational agent pursuing a utility function. It's just an effort to consider all the various things humans do with their brains and bodies, and build a loose category in that space using supervised learning. Why not?
I think a page titled "here are some tools and resources for thinking about AI-related infohazards" would be helpful and uncontroversial and feasible... That could include things like a list of trusted people in the community who have an open offer to discuss and offer feedback in confidence, and links to various articles and guidelines on the topic (without necessarily "officially" endorsing any particular approach), etc.
I agree that your proposal is well worth doing, it just sounds a lot more ambitious and long-term.
I agree with the idea that we empathetically simulate people, ("simulation theory") ... and I think we have innate social emotions if and only if we use that module when thinking about someone. So I brought that up here as a possible path to "finding human goals" in a world-model, and even talked about how dehumanization and anthropomorphization are opposite errors, in agreement with you. :-D
I think "weakening EH" isn't quite the right perspective, or at least not the whole story, at least when it comes to humans. We have metacognitive powers, and there are emotional / affective implications to using EH vs not using EH, and therefore using EH is at least partly a decision rather than a simple pattern-matching process. If you are torturing someone, you'll find that it's painful to use EH, so you quickly learn to stop using it. If you are playing make-believe with a teddy bear, you find that it's pleasurable to use EH on the teddy bear, so you do use it.
So dehumanization is modeling a person without using EH. When you view someone through a dehumanization perspective, you lose your social emotions. It no longer feels viscerally good or bad that they are suffering or happy.
But I do not think that when you're dehumanizing someone, you lose your ability to talk coherently about their motivations. Maybe we're a little handicapped in accurately modeling them, but still basically competent, and can get better with practice. Like, if a prison guard is dehumanizing the criminals, they can still recognize that a criminal is "trying" to escape.
I guess the core issue is whether motivation is supposed to have a mathematical definition or a normal-human-definition. A normal-human-definition of motivation is perfectly possible without any special modules, just like the definition of every other concept like "doorknob" is possible without special modules. You have a general world-modeling capability that lumps sensory patterns into categories/concepts. Then you're in ten different scenarios where people use the word "doorknob". You look for what those scenarios have in common, and it's that a particular concept was active in your mind, and then you attach that concept to the word "doorknob". Those ten examples were probably all central examples of "doorknob". There are also weird edge cases, where if you ask me "is that a doorknob?" I would say "I don't know, what exactly do you mean by that question?".
We don't need a special module to get an everyday definition of doorknobs, and likewise I don't think we don't need a special module to get an everyday definition of human motivation. We just need to be exposed to ten central examples, and we'll find the concept(s) that are most activated by those examples, and call that concept "motivation". Here's one of the ten examples: "If a person, in a psychologically-normal state of mind, says 'I want to do X', then they are probably motivated to do X."
Of course, just like doorknobs, there are loads of edge cases where if you ask a normal person "what is Alice's motivation here?" they'll say "I don't know, what exactly do you mean by that question?".
A mathematical definition of human motivation would, I imagine, have to be unambiguous and complete, with no edge cases. From a certain perspective, why would you ever think that such a thing even exists? But if we're talking about MDPs and utility functions, or if we're trying to create a specification for an AGI designer to design to, this is a natural thing to hope for and talk about.
I think if you gave an alien the same ten labelled central examples of "human motivation", plus lots of unlabeled information about humans and videos of humans, they could well form a similar concept around it (in the normal-human-definition sense, not the mathematical-definition sense), or at least their concept would overlap ours as long as we stay very far away from the edge cases. That's assuming the aliens' world-modeling apparatus is at least vaguely similar to ours, which I think is plausible, since we live in the same universe. But it is not assuming that the aliens' motivational systems and biases are anything like ours.
Sorry if I'm misunderstanding anything :-)
I second this sentiment.
...Although maybe I would say we need "AI infohazard guidance, options, and resources" rather than an "AI infohazard policy"? I think that would better convey the attitude that we trust each other and are trying to help each other—not just because we do in fact presumably trust each other, but also because we have no choice but to trust each other... The site moderators can enforce a "policy", but if the authors don't buy in, they'll just publish elsewhere.
I was just talking about it (in reference to my own posts) a few days ago—see here. I've just been winging it, and would be very happy to have "AI infohazard guidance, options, and resources". So, I'm following this discussion with interest. :-)
General feedback: my belief is that brain algorithms and today's deep learning models are different types of algorithms, and therefore regardless of whether TAI winds up looking like the former or the latter (or something else entirely), this type of exercise (i.e. where you match the two up along some axis) is not likely to be all that meaningful.
Having said that, I don't think the information value is literally zero, I see why someone pretty much has to do this kind of analysis, and so, might as well do the best job possible. This is a very impressive effort and I applaud it, even though I'm not personally updating on it to any appreciable extent.
Let me try again. Maybe this will be clearer.
The paradigm of the brain is online learning. There are a "small" number of adjustable parameters on how the process is set up, and then each run is long—a billion subjective seconds. And during the run there are a "large" number of adjustable parameters that get adjusted. Almost all the information content comes within a single run.
The paradigm of today's popular ML approaches is train-then-infer. There are a "large" number of adjustable parameters, which are adjusted over the course of an extremely large number of extremely short runs. Almost all the information content comes from the training process, not within the run. Meanwhile, sometimes people do multiple model-training runs with different hyperparameters—hyperparameters are a "small" number of adjustable parameters that sit outside the gradient-descent training loop.
I think the appropriate analogy is:
This seems to work reasonably well all around: (A) takes a long time and involves a lot of information content in the developed "intelligence", (B) is a handful of (perhaps human-interpretable) parameters, (C) is the final "intelligence" that you wind up wanting to deploy.
So again I would analogize one run of the online-learning paradigm with one training of today's popular ML approaches. Then I would try to guess how many runs of online-learning you need, and I would guess 10-100, not based on anything in particular, but you can get a better number by looking into the extent to which people need to play with hyperparameters in their ML training, which is "not much if it's very important not to".
Sure, you can do a boil-the-oceans automated hyperparameter search, but in the biggest projects where you have no compute to spare, they can't do that. Instead, you sit and think about the hyperparameters, you do smaller-scale studies, you try to carefully diagnose the results of each training, etc. etc. Like, GPT-3 only did one training of their largest model, I believe—they worked hard to figure out good hyperparameter settings by extrapolating from smaller studies.
...Whereas it seems that the report is doing a different analogy:
I think that analogy is much worse than the one I proposed. You're mixing short tests with long-calculations-that-involve-a-ton-of-learning, you're mixing human tweaking of understandable parameters with gradient descent, etc.
To be clear, I don't think my proposed analogy is perfect, because I think that brain algorithms are rather different than today's ML algorithms. But I think it's a lot better than what's there now, and maybe it's the best you can do without getting into highly speculative and controversial inside-view-about-brain-algorithms stuff.
I could be wrong or confused :-)
I'm not seeing the merit of the genome anchor. I see how it would make sense if humans didn't learn anything over the course of their lifetime. Then all the inference-time algorithmic complexity would come from the genome, and you would need your ML process to search over a space of models that can express that complexity. However, needless to say, humans do learn things over the course of their lifetime! I feel even more strongly about that than most, but I imagine we can all agree that the inference-time algorithmic complexity of an adult brain is not limited by what's in the genome, but rather also incorporates information from self-supervised learning etc.
The opposite perspective would say: the analogy isn't between the ML trained model and the genome, but rather between the ML learning algorithm and the genome on one level, and between the ML trained model and the synapses at the other level. So, something like ML parameter count = synapse count, and meanwhile the genome size would correspond to "how complicated is the architecture and learning algorithm?"—like, add up the algorithmic complexity of backprop plus dropout regularization plus BatchNorm plus data augmentation plus xavier initialization etc. etc. Or something like that.
I think the truth is somewhere in between, but a lot closer to the synapse-anchor side (that ignores instincts) than the genome-anchor side (that ignores learning), I think...
Sorry if I'm misunderstanding or missing something, or confused.
UPDATE: Or are we supposed to imagine an RNN wherein the genomic information corresponds to the weights, and the synapse information corresponds to the hidden state activations? If so, I didn't think you could design an RNN (of the type typically used today) where the hidden state activations have many orders of magnitude more information content than the weights. Usually there are more weights than hidden state activations, right?
UPDATE 2: See my reply to this comment.
I think the Transformer is successful in part because it tends to solve problems by considering multiple possibilities, processing them in parallel, and picking the one that looks best. (Selection-type optimization.) If you train it on text prediction, that's part of how it will do text prediction. If you train it on a different domain, that's part of how it will solve problems in that domain too.
I don't think GPT builds a "mesa-optimization infrastructure" and then applies that infrastructure to language modeling. I don't think it needs to. I think the Transformer architecture is already raring to go forth and mesa-optimize, as soon as you as you give it any optimization pressure to do so.
So anyway your question is: can it display foresight / planning in a different domain via without being trained in that domain? I would say, "yeah probably, because practically every domain is instrumentally useful for text prediction". So somewhere in GPT-3's billions of parameters I think there's code to consider multiple possibilities, process them in parallel, and pick the best answer, in response to the question of What will happen next when you put a sock in a blender? or What is the best way to fix an oil leak?—not just those literal words as a question, but the concepts behind them, however they're invoked.
(Having said that, I don't think GPT-3 specifically will do side-channel attacks, but for other unrelated reasons off-topic. Namely, I don't think it is capable of make the series of new insights required to develop an understanding of itself and its situation and then take appropriate actions. That's based on my speculations here.)