Some notable/famous signatories that I noted: Geoffrey Hinton, Yoshua Bengio, Demis Hassabis (DeepMind CEO), Sam Altman (OpenAI CEO), Dario Amodei (Anthropic CEO), Stuart Russell, Peter Norvig, Eric Horvitz (Chief Scientific Officer at Microsoft), David Chalmers, Daniel Dennett, Bruce Schneier, Andy Clark (the guy who wrote Surfing Uncertainty), Emad Mostaque (Stability AI CEO), Lex Friedman, Sam Harris.
Edited to add: a more detailed listing from this post:
...Signatories include notable philosophers, ethicists, legal scholars, economists, physicists, politica
But then, "the self-alignment problem" would likewise make it sound like it's about how you need to align yourself with yourself. And while it is the case that increased self-alignment is generally very good and that not being self-aligned causes problems for the person in question, that's not actually the problem the post is talking about.
I don't know how you would describe "true niceness", but I think it's neither of the above.
Agreed. I think "true niceness" is something like, act to maximize people's preferences, while also taking into account the fact that people often have a preference for their preferences to continue evolving and to resolve any of their preferences that are mutually contradictory in a painful way.
...Niceness is natural for agents of similar strengths because lots of values point towards the same "nice" behavior. But when you're much more powerful than anyone else, the ta
Great post!
When LLMs first appeared, people realised that you could ask them queries — for example, if you sent GPT-4 the prompt
I'm very confused by the frequent use of "GPT-4", and am failing to figure out whether this is actually meant to read GPT-2 or GPT-3, whether there's some narrative device where this is a post written at some future date when GPT-4 has actually been released (but that wouldn't match "when LLMs first appeared"), or what's going on.
Thanks, this seems like a nice breakdown of issues!
...If you have more thoughts on how to do this, I’m interested to hear them. You write that PF has a “simple/short/natural algorithmic description”, and I guess that seems possible, but I’m mainly skeptical that the source code will have a slot where we can input this algorithmic description. Maybe the difference is that you’re imagining that people are going to hand-write source code that has a labeled “this is an empathetic simulation” variable, and a “my preferences are being satisfied” variable? Because I
A sufficiently advanced AGI familiar with humans will have a clear concept of “not killing everyone” (or more specifically, “what humans mean when they say the words ‘not killing everyone’”). We just add a bit of an extra component that makes the AGI intrinsically value that concept. This implies that capabilities progress may be closely linked to alignment progress.
Some major differences off the top of my head:
Thanks! Hmm, I think we’re mixing up lots of different issues:
I say “no”. Or at least, I don’t currently know how you would do that, see here. (I think about it a lot; ask me again next year. :) )
If you have more thoughts on how to do this, I’m interested to hear them. You write that PF has a “simple/short/natural algorithmic description”, and I guess that seems possible, but I’m mainly skeptical that the source code will have a slot where we can input this algo...
An observation: it feels slightly stressful to have posted this. I have a mental simulation telling me that there are social forces around here that consider it morally wrong or an act of defection to suggest that alignment might be relatively easy, like it implied that I wasn't taking the topic seriously enough or something. I don't know how accurate that is, but that's the vibe that my simulators are (maybe mistakenly) picking up.
Forget about what the social consensus is. If you have technical understanding of current AIs, do you truly believe there are any major obstacles left? The kind of problems that AGI companies could reliably not tear down with their resources? If you do, state so in the comments, but please do not state what those obstacles are.
I guess the reasoning behind the "do not state" request is something like "making potential AGI developers more aware of those obstacles is going to direct more resources into solving those obstacles". But if someone is trying t...
At least ChatGPT seems to have a longer context window, this experiment suggesting 8192 tokens.
And yeah, I do think the thing I am most worried about with Chat-GPT in addition to just shortening timelines is increasing the number of actors in the space, which also has indirect effects on timelines. A world where both Microsoft and Google are doubling down on AI is probably also a world where AI regulation has a much harder time taking off.
Maybe - but Microsoft and Google are huge organizations, and huge organizations have an incentive to push for regulation that imposes costs that they can pay while disproportionately hampering smaller competitors. ...
I was a bit surprised to see Eliezer invoke the Wason Selection Task. I'll admit that I haven't actually thought this through rigorously, but my sense was that modern machine learning had basically disproven the evpsych argument that those experimental results require the existence of a separate cheating-detection module. As well as generally calling the whole massive modularity thesis into severe question, since the kinds of results that evpsych used to explain using dedicated innate modules now look a lot more like something that could be produced with s...
No worries!
That being said, I think it's a bit weird to have "habitual subagents", since the word "agent" seems to imply some amount of goal-directedness.
Yeah, I did drift towards more generic terms like "subsystems" or "parts" later in the series for this reason, and might have changed the name of the sequence if only I'd managed to think of something better. (Terms like "subagents" and "multi-agent models of mind" still gesture away from rational agent models in a way that more generic terms like "subsystems" don't.)
Unlike in subagent models, the subcomponents of agents are not themselves always well modeled as (relatively) rational agents. For example, there might be shards that are inactive most of the time and only activate in a few situations.
For what it's worth, at least in my conception of subagent models, there can also be subagents that are inactive most of the time and only activate in a few situations. That's probably the case for most of person's subagents, though of course "subagent" isn't necessarily a concept that cuts reality a joints, so this depends o...
I think Anders Sandberg did research on this at one point, and I recall him summarizing his findings as "things are easy to ban as long as nobody really wants to have them". IIRC, things that went into that category were chemical weapons (they actually not very effective in modern warfare), CFCs (they were relatively straightforward to replace with equally effective alternatives), and human cloning.
I currently guess that even the most advanced shards won't have private world-models which they can query in relative isolation from the rest of the shard economy.
What's your take on "parts work" techniques like IDC, IFS, etc. seeming to bring up something like private (or at least not completely shared) world models? Do you consider the kinds of "parts" those access as being distinct from shards?
I would find it plausible to assume by default that shards have something like differing world models since we know from cognitive psychology that e.g. different ...
I think Shard Theory is one of the most promising approaches on human values that I've seen on LW, and I'm very happy to see this work posted. (Of course, I'm probably biased in that I also count my own approaches to human values among the most promising and Shard Theory shares a number a similarities with it - e.g. this post talks about something-like-shards issuing mutually competitive bids that get strengthened or weakened depending on how environmental factors activate those shards, and this post talked about values and world-models being learned in an...
The one big coordination win I recall us having was the 2015 Research Priorities document that among other things talked about the threat of superintelligence. The open letter it was an attachment to was signed by over 8000 people, including prominent AI and ML researchers.
And then there's basically been nothing of equal magnitude since then.
I would presume that the AI would know that humans are likely to try to resist a takeover attempt, and to have various safeguards against it. It might be smart enough to be able to overcome any human response, but that seems to only work if it actually puts that intelligence to work by thinking about what (if anything) it needs to do to counteract the human response.
More generally, humans are such a major influence on the world as well as a source of potential resources, that it would seem really odd for any superintelligence to naturally arrive on a world-takeover plan without at any point happening to consider how this will affect humanity and whether that suggests any changes to the plan.
so that the people who end up reading it are at least more likely to be plugged into the LW ecosystem and are also going to get exposed to arguments about AI risk.
There's also the chance that if these posts are not gated, people who previously weren't plugged into the LW ecosystem but are interested in AI find LW through articles such as this one. And then eventually also start reading other articles here and become more interested in alignment concerns.
There's also a bit of a negative stereotype among some AI researchers as alignment people being theoreti...
This, in turn, implies that human values/biases/high-level cognitive observables are produced by relatively simpler hardcoded circuitry, specifying e.g. the learning architecture, the broad reinforcement learning and self-supervised learning systems in the brain, and regional learning hyperparameters.
See also the previous LW discussion of The Brain as a Universal Learning Machine.
...... the evolved modularity cluster posits that much of the machinery of human mental algorithms is largely innate. General learning - if it exists at all - exists only in sp
Conversely, the genome can access direct sensory observables, because those observables involve a priori-fixed “neural addresses.” For example, the genome could hardwire a cute-face-detector which hooks up to retinal ganglion cells (which are at genome-predictable addresses), and then this circuit could produce physiological reactions (like the release of reward). This kind of circuit seems totally fine to me.
Related: evolutionary psychology used to have a theory according to which humans had a hardwired fear of some stimuli (e.g. spide...
This, in turn, implies that human values/biases/high-level cognitive observables are produced by relatively simpler hardcoded circuitry, specifying e.g. the learning architecture, the broad reinforcement learning and self-supervised learning systems in the brain, and regional learning hyperparameters.
See also the previous LW discussion of The Brain as a Universal Learning Machine.
...... the evolved modularity cluster posits that much of the machinery of human mental algorithms is largely innate. General learning - if it exists at all - exists only in sp
A reasonable counterargument is that the analogy doesn't hold up because fitness-as-optimisation-target isn't a good way to characterise evolution as an optimiser.
Yes, that was my argument in the comment that I linked. :)
5. Empirical evidence: human intelligence generalised far without staying aligned with its optimisation target.
I think this one is debatable. It seems to me that human intelligence has remained reasonably well aligned with its optimization target, if its optimization target is defined as "being well-fed, getting social status, remaining healthy, having sex, raising children, etc.", i.e. the things that evolution actually could optimize humans for rather than something like inclusive fitness that it couldn't directly optimize for. Yes there are individual h...
Animal breeding would be a better analogy, and seems to suggest a different and much more tentative conclusion. For example, if humans were being actively bred for corrigibility and friendliness, it looks to me like like they would quite likely be corrigible and friendly up through the current distribution of human behavior.
I was just thinking about this. The central example that's often used here is "evolution optimized humans for inclusive genetic fitness, nonetheless humans do not try to actually maximize the amount of their surviving offspring, such...
The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn't make the resulting humans optimize mentally for IGF.
This analogy gets brought out a lot, but has anyone actually spelled it out explicitly? Because it's not clear to me that it holds if you try to explicitly work out the argument.
In particular, I don't quite understand what it would mean for evolution to optimize the species for fitness, given that fitness is defined as a measure of reproductive success within the species. A genotype has a high fitness, if it...
Agreed, but the black-box experimentation seems like it's plausibly a prerequisite for actual understanding? E.g. you couldn't analyze InceptionV1 or CLIP to understand its inner workings before you actually had those models. To use your car engine metaphor from the other comment, we can't open the engine and stick it full of sensors before we actually have the engine. And now that we do have engines, people are starting to stick them full of sensors, even if most of the work is still focused on building even fancier engines.
It seems reasonable to expect t...
I had a pretty strong negative reaction to it. I got the feeling that the post derives much of its rhetorical force from setting up an intentionally stupid character who can be condescended to, and that this is used to sneak in a conclusion that would seem much weaker without that device.
Didn't they train a separate MuZero agent for each game? E.g. the page you link only talks about being able to learn without pre-existing knowledge.
However, I don't think this is the whole explanation. The technological advantage of the conquistadors was not overwhelming.
With regard to the Americas at least, I just happened to read this article by a professional military historian, who characterizes the Native American military technology as being "thousands of years behind their Old World agrarian counterparts", which sounds like the advantage was actually rather overwhelming.
...There is a massive amount of literature to explain what is sometimes called ‘the Great Divergence‘ (a term I am going to use h
Thankfully, there have already been some successes in agent-agnostic thinking about AI x-risk
Also Sotala 2018 mentions the possibility of control over society gradually shifting over to a mutually trading collective of AIs (p. 323-324) as one "takeoff" route, as well as discussing various economic and competitive pressures to shift control over to AI systems and the possibility of a “race to the bottom of human control” where state or business actors [compete] to reduce human control and [increase] the autonomy of their AI systems to obtain an edge over th...
Oh cool! I put some effort into pursuing a very similar idea earlier:
I'll start this post by discussing a closely related hypothesis: that given a specific learning or reasoning task and a certain kind of data, there is an optimal way to organize the data that will naturally emerge. If this were the case, then AI and human reasoning might naturally tend to learn the same kinds of concepts, even if they were using very different mechanisms.
but wasn't sure of how exactly to test it or work on it so I didn't get very far.
Does any military use meditation as part of its training?
. Yes, e.g.
...This [2019] winter, Army infantry soldiers at Schofield Barracks in Hawaii began using mindfulness to improve shooting skills — for instance, focusing on when to pull the trigger amid chaos to avoid unnecessary civilian harm.
The British Royal Navy has given mindfulness training to officers, and military leaders are rolling it out in the Army and Royal Air Force for some officers and enlisted soldiers. The New Zealand Defence Force recently adopted the technique, and military forces o
Appreciate this post! I had seen the good regulator theorem referenced every now and then, but wasn't sure what exactly the relevant claims were, and wouldn't have known how to go through the original proof myself. This is helpful.
(E.g. the result was cited by Frith & Metzinger as part of their argument that, as an agent seeks to avoid being punished by society, this constitutes an attempt to regulate society's behavior; and for the regulation be successful, the agent needs to internalize a model of the society's preferences, which once internalized be...
IMO, a textbook would either overlook big chunks of the field or look more like an enumeration of approaches than a unified resource.
Textbooks that cover a number of different approaches without taking a position on which one is the best are pretty much the standard in many fields. (I recall struggling with it in some undergraduate psychology courses, as previous schooling didn't prepare me for a textbook that would cover three mutually exclusive theories and present compelling evidence in favor of each. Before moving on and presenting three mutually exclusive theories about some other phenomenon on the very next page.)
The "many decisions can be thought of as a committee requiring unanimous agreement" model felt intuitively right to me, and afterwards I've observed myself behaving in ways which seem compatible with it, and thought of this post.
You probably know of these already, but just in case: lukeprog wrote a couple of articles on the history of AI risk thought [1, 2] going back to 1863. There's also the recent AI ethics article in the Stanford Encyclopedia of Philosophy.
I'd also like to imagine that my paper on superintelligence and astronomical suffering might say something that someone might consider important, but that is of course a subjective question. :-)
because who's talking about medium-size risks from AGI?
Well, I have talked about them... :-)
...The capability claim is often formulated as the possibility of an AI achieving a decisive strategic advantage (DSA). While the notion of a DSA has been implicit in many previous works, the concept was first explicitly defined by Bostrom (2014, p. 78) as “a level of technological and other advantages sufficient to enable [an AI] to achieve complete world domination.”
However, assuming that an AI will achieve a DSA seems like an unnecessarily strong form of the capabil
I read you to be asking "what decision theory is implied by predictive processing as it's implemented in human brains". It's my understanding that while there are attempts to derive something like a "decision theory formulated entirely in PP terms", there are also serious arguments for the brain actually having systems that are just conventional decision theories and not directly derivable from PP.
Let's say you try, as some PP theorists do, to explain all behavior as free energy minimization as opposed to expected utility maximization. Ransom et al. (2020)...
It sounds a bit absurd: you've already implemented a sophisticated RL algorithm, which keeps track of value estimates for states and actions, and propagates these value estimates to steer actions toward future value. Why would the learning process re-implement a scheme like that, nested inside of the one you implemented? Why wouldn't it just focus on filling in the values accurately?
I've thought of two possible reasons so far.
...
- Perhaps your outer RL algorithm is getting very sparse rewards, and so does not learn very fast. The inner RL could implemen
Or e.g. that it always leads to the organism optimizing for a set of goals which is unrecognizably different from the base objective. I don't think you see these things, and I'm interested in figuring out how evolution prevented them.
As I understand it, Wang et al. found that their experimental setup trained an internal RL algorithm that was more specialized for this particular task, but was still optimizing for the same task that the RNN was being trained on? And it was selected exactly because it did that very goal better. If the circumstances...
The thing I meant by "catastrophic" is just "leading to death of the organism." I suspect mesa-optimization is common in humans, but I don't feel confident about this, nor that this is a joint-carvey ontology. I can imagine it being the case that many examples of e.g. addiction, goodharting, OCD, and even just "everyday personal misalignment"-type problems of the sort IFS/IDC/multi-agent models of mind sometimes help with, are caused by phenomena which might reasonably be described as inner alignment failures.
But I think these things don't kill people very...
I don't feel at all tempted to do that anthropomorphization, and I think it's weird that EY is acting as if this is a reasonable thing to do.
"It's tempting to anthropomorphize GPT-3 as trying its hardest to make John smart" seems obviously incorrect if it's explicitly phrased that way, but e.g. the "Giving GPT-3 a Turing Test" post seems to implicitly assume something like it:
This gives us a hint for how to stump the AI more consistently. We need to ask questions that no normal human would ever talk about....
Q: How m
If some circuit in the brain is doing something useful, then it's humanly feasible to understand what that thing is and why it's useful, and to write our own CPU code that does the same useful thing.
In other words, the brain's implementation of that thing can be super-complicated, but the input-output relation cannot be that complicated—at least, the useful part of the input-output relation cannot be that complicated.
Robin Hanson makes a similar argument in "Signal Processors Decouple":
The bottom line is that to emulate a bi...
Ah, looking at earlier posts in your sequence, I realize that you are defining abstraction in such a way that the properties of the high-level abstraction imply information about the low-level details - something like "a class XYZ star (high-level classification) has an average mass of 3.5 suns (low-level detail)".
That explains my confusion since I forgot that you were using this definition, and was thinking of "abstraction" as it was defined in my computer science classes, where an abstraction was something that explicitly discarded all the information about the low-level details.
Or, to put it differently: an abstraction is “valid” when components of the low-level model are independent given the values of high-level variables. The roiling of plasmas within far-apart stars is (approximately) independent given the total mass, momentum, and center-of-mass position of each star. As long as this condition holds, we can use the abstract model to correctly answer questions about things far away/far apart. [...]
So if we draw a box around just one of the two instances, then those low-level calculation details we threw out w...
This is probably a dumb beginner question indicative of not understanding the definition of key terms, but to reveal my ignorance anyway - isn't any company that consistently makes a profit successfully exploiting the market? And if it is, why do we say that markets are inexploitable, if they're built on the existence of countless actors exploiting them?