All of Kaj_Sotala's Comments + Replies

Just based on a loose qualitative understanding of coherence arguments, one might think that the inexploitability (i.e. efficiency) of markets implies that they maximize a utility function.

This is probably a dumb beginner question indicative of not understanding the definition of key terms, but to reveal my ignorance anyway - isn't any company that consistently makes a profit successfully exploiting the market? And if it is, why do we say that markets are inexploitable, if they're built on the existence of countless actors exploiting them?

Two answers here. First, the standard economics answer: economic profit ≠ accounting profit. Economic profit is how much better a company does than their opportunity cost; accounting profit is revenue minus expenses. A trading firm packed with top-notch physicists, mathematicians, and programmers can make enormous accounting profit and yet still make zero economic profit, because the opportunity costs for such people are quite high. "Efficient markets" means zero economic profits, not zero accounting profits. Second answer: as Zvi is fond of pointing out, the efficient market hypothesis is false (even after accounting for the distinction between economic and accounting profit). For instance, Renaissance - a real trading firm packed with top-notch physicists, mathematicians, and programmers - in fact makes far more money than the opportunity cost of its employees and capital. That said, market efficiency is still a very good approximation for a lot of purposes, and I'd be very curious to know whether selection pressures have already induced the trades which would make markets approximately aggregate into a utility maximizer.

Some notable/famous signatories that I noted: Geoffrey Hinton, Yoshua Bengio, Demis Hassabis (DeepMind CEO), Sam Altman (OpenAI CEO), Dario Amodei (Anthropic CEO), Stuart Russell, Peter Norvig, Eric Horvitz (Chief Scientific Officer at Microsoft), David Chalmers, Daniel Dennett, Bruce Schneier, Andy Clark (the guy who wrote Surfing Uncertainty), Emad Mostaque (Stability AI CEO), Lex Friedman, Sam Harris.

Edited to add: a more detailed listing from this post:

Signatories include notable philosophers, ethicists, legal scholars, economists, physicists, politica

... (read more)

But then, "the self-alignment problem" would likewise make it sound like it's about how you need to align yourself with yourself. And while it is the case that increased self-alignment is generally very good and that not being self-aligned causes problems for the person in question, that's not actually the problem the post is talking about.

I don't know how you would describe "true niceness", but I think it's neither of the above.

Agreed. I think "true niceness" is something like, act to maximize people's preferences, while also taking into account the fact that people often have a preference for their preferences to continue evolving and to resolve any of their preferences that are mutually contradictory in a painful way.

Niceness is natural for agents of similar strengths because lots of values point towards the same "nice" behavior. But when you're much more powerful than anyone else, the ta

... (read more)

Great post!

When LLMs first appeared, people realised that you could ask them queries — for example, if you sent GPT-4 the prompt

I'm very confused by the frequent use of "GPT-4", and am failing to figure out whether this is actually meant to read GPT-2 or GPT-3, whether there's some narrative device where this is a post written at some future date when GPT-4 has actually been released (but that wouldn't match "when LLMs first appeared"), or what's going on.

Thanks, this seems like a nice breakdown of issues!

If you have more thoughts on how to do this, I’m interested to hear them. You write that PF has a “simple/short/natural algorithmic description”, and I guess that seems possible, but I’m mainly skeptical that the source code will have a slot where we can input this algorithmic description. Maybe the difference is that you’re imagining that people are going to hand-write source code that has a labeled “this is an empathetic simulation” variable, and a “my preferences are being satisfied” variable? Because I

... (read more)
2Steve Byrnes7mo
An AI that sees human language will certainly learn the human concept “human flourishing”, since after all it needs to understand what humans mean when they utter that specific pair of words. So then you can go into the AI and put super-positive valence on (whatever neural activations are associated with “human flourishing”). And bam, now the AI thinks that the concept “human flourishing” is really great, and if we’re lucky / skillful then the AI will try to actualize that concept in the world. There are a lot of unsolved problems and things that could go wrong with that (further discussion here), but I think something like that is not entirely implausible as a long-term alignment research vision. I guess the anthropomorphic analog would be: try to think of who is the coolest / highest-status-to-you / biggest-halo-effect person in your world. (Real or fictional.) Now imagine that this person says to you: “You know what’s friggin awesome? The moon. I just love it. The moon is the best.” You stand there with your mouth agape. “Wow, huh, the moon, yeah, I never thought about it that way.” (But 100× moreso. Maybe you’re on some psychedelic at the time, or whatever.) How would that event change your motivations? Well, you’re probably going to spend a lot more time gazing at the moon when it’s in the sky. You’re probably going to be much more enthusiastic about anything associated with the moon. If there are moon trading cards, maybe you would collect them. If NASA is taking volunteers to train as astronauts for a lunar exploration mission, maybe you would be first in line. If a supervillain is planning to blow up the moon, you’ll probably be extremely opposed to that. Now by the same token, imagine we do that kind of thing for an extremely powerful AGI and the concept of “human flourishing”. What actions will this AGI then take? Umm, I don’t know really. It seems very hard to predict. But it seems to me that there would be a decent chance that its actions would be goo

A sufficiently advanced AGI familiar with humans will have a clear concept of “not killing everyone” (or more specifically, “what humans mean when they say the words ‘not killing everyone’”). We just add a bit of an extra component that makes the AGI intrinsically value that concept. This implies that capabilities progress may be closely linked to alignment progress.

Some major differences off the top of my head:

  • Picking out a specific concept such as "not killing everyone" and making the AGI specifically value that seems hard. I assume that the AGI would ha
... (read more)

Thanks! Hmm, I think we’re mixing up lots of different issues:

  • 1. Is installing a PF motivation into an AGI straightforward, based on what we know today? 

I say “no”. Or at least, I don’t currently know how you would do that, see here. (I think about it a lot; ask me again next year. :) )

If you have more thoughts on how to do this, I’m interested to hear them. You write that PF has a “simple/short/natural algorithmic description”, and I guess that seems possible, but I’m mainly skeptical that the source code will have a slot where we can input this algo... (read more)

An observation: it feels slightly stressful to have posted this. I have a mental simulation telling me that there are social forces around here that consider it morally wrong or an act of defection to suggest that alignment might be relatively easy, like it implied that I wasn't taking the topic seriously enough or something. I don't know how accurate that is, but that's the vibe that my simulators are (maybe mistakenly) picking up.

Forget about what the social consensus is. If you have technical understanding of current AIs, do you truly believe there are any major obstacles left? The kind of problems that AGI companies could reliably not tear down with their resources? If you do, state so in the comments, but please do not state what those obstacles are.

I guess the reasoning behind the "do not state" request is something like "making potential AGI developers more aware of those obstacles is going to direct more resources into solving those obstacles". But if someone is trying t... (read more)

At least ChatGPT seems to have a longer context window, this experiment suggesting 8192 tokens.

And yeah, I do think the thing I am most worried about with Chat-GPT in addition to just shortening timelines is increasing the number of actors in the space, which also has indirect effects on timelines. A world where both Microsoft and Google are doubling down on AI is probably also a world where AI regulation has a much harder time taking off.

Maybe - but Microsoft and Google are huge organizations, and huge organizations have an incentive to push for regulation that imposes costs that they can pay while disproportionately hampering smaller competitors. ... (read more)

I was a bit surprised to see Eliezer invoke the Wason Selection Task. I'll admit that I haven't actually thought this through rigorously, but my sense was that modern machine learning had basically disproven the evpsych argument that those experimental results require the existence of a separate cheating-detection module. As well as generally calling the whole massive modularity thesis into severe question, since the kinds of results that evpsych used to explain using dedicated innate modules now look a lot more like something that could be produced with s... (read more)

No worries!

That being said, I think it's a bit weird to have "habitual subagents", since the word "agent" seems to imply some amount of goal-directedness.

Yeah, I did drift towards more generic terms like "subsystems" or "parts" later in the series for this reason, and might have changed the name of the sequence if only I'd managed to think of something better. (Terms like "subagents" and "multi-agent models of mind" still gesture away from rational agent models in a way that more generic terms like "subsystems" don't.)

Unlike in subagent models, the subcomponents of agents are not themselves always well modeled as (relatively) rational agents. For example, there might be shards that are inactive most of the time and only activate in a few situations.

For what it's worth, at least in my conception of subagent models, there can also be subagents that are inactive most of the time and only activate in a few situations. That's probably the case for most of person's subagents, though of course "subagent" isn't necessarily a concept that cuts reality a joints, so this depends o... (read more)

3Lawrence Chan9mo
Thanks for the clarification! I agree that your model of subagents in the two posts share a lot of commonalities with parts of Shard Theory, and I should've done a lit review of your subagent posts. (I based my understanding of subagent models on some of the AI Safety formalisms I've seen as well as John Wentworth's Why Subagents?.) My bad.  That being said, I think it's a bit weird to have "habitual subagents", since the word "agent" seems to imply some amount of goal-directedness. I would've classified your work as closer to Shard Theory than the subagent models I normally think about. 

I think Anders Sandberg did research on this at one point, and I recall him summarizing his findings as "things are easy to ban as long as nobody really wants to have them". IIRC, things that went into that category were chemical weapons (they actually not very effective in modern warfare), CFCs (they were relatively straightforward to replace with equally effective alternatives), and human cloning.

2Lawrence Chan1y
This is my impression as well, but it's very possible that we're looking at the wrong reference class (IE its plausible that many "sane" things large governments have done are not salient). Maybe some of the big social welfare/early environmental protection programs? 

I currently guess that even the most advanced shards won't have private world-models which they can query in relative isolation from the rest of the shard economy.

What's your take on "parts work" techniques like IDC, IFS, etc. seeming to bring up something like private (or at least not completely shared) world models? Do you consider the kinds of "parts" those access as being distinct from shards?

I would find it plausible to assume by default that shards have something like differing world models since we know from cognitive psychology that e.g. different ... (read more)

I think Shard Theory is one of the most promising approaches on human values that I've seen on LW, and I'm very happy to see this work posted. (Of course, I'm probably biased in that I also count my own approaches to human values among the most promising and Shard Theory shares a number a similarities with it - e.g. this post talks about something-like-shards issuing mutually competitive bids that get strengthened or weakened depending on how environmental factors activate those shards, and this post talked about values and world-models being learned in an... (read more)

The one big coordination win I recall us having was the 2015 Research Priorities document that among other things talked about the threat of superintelligence. The open letter it was an attachment to was signed by over 8000 people, including prominent AI and ML researchers.

And then there's basically been nothing of equal magnitude since then.

I would presume that the AI would know that humans are likely to try to resist a takeover attempt, and to have various safeguards against it. It might be smart enough to be able to overcome any human response, but that seems to only work if it actually puts that intelligence to work by thinking about what (if anything) it needs to do to counteract the human response. 

More generally, humans are such a major influence on the world as well as a source of potential resources, that it would seem really odd for any superintelligence to naturally arrive on a world-takeover plan without at any point happening to consider how this will affect humanity and whether that suggests any changes to the plan. 

That assumes humans are, in fact, likely to meaningfully resist a takeover attempt. My guess is that humans are not likely to meaningfully resist a takeover attempt, and the AI will (implicitly) know that. I mean, if the AI tries to change who's at the top of society's status hierarchy (e.g. the President), then sure, the humans will freak out. But what does an AI care about the status hierarchy? It's not like being at the top of the status hierarchy conveys much real power. It's like your "total horse takeover" thing; what the AI actually wants is to be able to control outcomes at a relatively low level. Humans, by and large, don't even bother to track all those low-level outcomes, they mostly pay attention to purely symbolic status stuff. Now, it is still true that humans are a major influence on the world and source of resources. An AI will very plausibly want to work with the humans, use them in various ways. But that doesn't need to parse to human social instincts as a "takeover".

so that the people who end up reading it are at least more likely to be plugged into the LW ecosystem and are also going to get exposed to arguments about AI risk.

There's also the chance that if these posts are not gated, people who previously weren't plugged into the LW ecosystem but are interested in AI find LW through articles such as this one. And then eventually also start reading other articles here and become more interested in alignment concerns.

There's also a bit of a negative stereotype among some AI researchers as alignment people being theoreti... (read more)

This, in turn, implies that human values/biases/high-level cognitive observables are produced by relatively simpler hardcoded circuitry, specifying e.g. the learning architecture, the broad reinforcement learning and self-supervised learning systems in the brain, and regional learning hyperparameters.

See also the previous LW discussion of The Brain as a Universal Learning Machine.

... the evolved modularity cluster posits that much of the machinery of human mental algorithms is largely innate. General learning - if it exists at all - exists only in sp

... (read more)

Conversely, the genome can access direct sensory observables, because those observables involve a priori-fixed “neural addresses.” For example, the genome could hardwire a cute-face-detector which hooks up to retinal ganglion cells (which are at genome-predictable addresses), and then this circuit could produce physiological reactions (like the release of reward). This kind of circuit seems totally fine to me.

Related: evolutionary psychology used to have a theory according to which humans had a hardwired fear of some stimuli (e.g. spide... (read more)

This, in turn, implies that human values/biases/high-level cognitive observables are produced by relatively simpler hardcoded circuitry, specifying e.g. the learning architecture, the broad reinforcement learning and self-supervised learning systems in the brain, and regional learning hyperparameters.

See also the previous LW discussion of The Brain as a Universal Learning Machine.

... the evolved modularity cluster posits that much of the machinery of human mental algorithms is largely innate. General learning - if it exists at all - exists only in sp

... (read more)

A reasonable counterargument is that the analogy doesn't hold up because fitness-as-optimisation-target isn't a good way to characterise evolution as an optimiser.

Yes, that was my argument in the comment that I linked. :)

5. Empirical evidence: human intelligence generalised far without staying aligned with its optimisation target.

I think this one is debatable. It seems to me that human intelligence has remained reasonably well aligned with its optimization target, if its optimization target is defined as "being well-fed, getting social status, remaining healthy, having sex, raising children, etc.", i.e. the things that evolution actually could optimize humans for rather than something like inclusive fitness that it couldn't directly optimize for. Yes there are individual h... (read more)

1Alexander Gietelink Oldenziel1y
The point isn't about goal misalignment but capability generalisation. It is surprising to some degree that just selecting on reproductive fitness through its proxies of being well-fed, social status etc humans have obtained the capability to go to the moon. It points toward a coherent notion & existence of 'general intelligence' as opposed to specific capabilities. 
1Ramana Kumar1y
I think what you say makes sense, but to be clear the argument does not consider those things as the optimisation target but rather considers fitness or reproductive capacity as the optimisation target. (A reasonable counterargument is that the analogy doesn't hold up because fitness-as-optimisation-target isn't a good way to characterise evolution as an optimiser.)

Animal breeding would be a better analogy, and seems to suggest a different and much more tentative conclusion. For example, if humans were being actively bred for corrigibility and friendliness, it looks to me like like they would quite likely be corrigible and friendly up through the current distribution of human behavior.

I was just thinking about this. The central example that's often used here is "evolution optimized humans for inclusive genetic fitness, nonetheless humans do not try to actually maximize the amount of their surviving offspring, such... (read more)

1Alexander Gietelink Oldenziel1y
What about selecting for "moderation in all things"? Is that not virtue? Aristotle invented quantification you heard here first

The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn't make the resulting humans optimize mentally for IGF.

This analogy gets brought out a lot, but has anyone actually spelled it out explicitly? Because it's not clear to me that it holds if you try to explicitly work out the argument. 

In particular, I don't quite understand what it would mean for evolution to optimize the species for fitness, given that fitness is defined as a measure of reproductive success within the species. A genotype has a high fitness, if it... (read more)

5Rohin Shah1y
Somewhat relevant comment thread

Agreed, but the black-box experimentation seems like it's plausibly a prerequisite for actual understanding? E.g. you couldn't analyze InceptionV1 or CLIP to understand its inner workings before you actually had those models. To use your car engine metaphor from the other comment, we can't open the engine and stick it full of sensors before we actually have the engine. And now that we do have engines, people are starting to stick them full of sensors, even if most of the work is still focused on building even fancier engines.

It seems reasonable to expect t... (read more)

Best of luck!

2Stuart Armstrong2y

I would be happy to see you write a top-level post about this paper. :)

2Neel Nanda2y
Thanks! I'm probably not going to have time to write a top-level post myself, but I liked Evan Hubinger's post about it.

I had a pretty strong negative reaction to it. I got the feeling that the post derives much of its rhetorical force from setting up an intentionally stupid character who can be condescended to, and that this is used to sneak in a conclusion that would seem much weaker without that device.

Didn't they train a separate MuZero agent for each game? E.g. the page you link only talks about being able to learn without pre-existing knowledge.

5Adam Shimi2y
Actually, I think you're right. I always thought that MuZero was one and the same system for every game, but the Nature paper describes it as an architecture that can be applied to learn different games. I'd like a confirmation from someone who actually studied it more, but it looks like MuZero indeed isn't the same system for each game.

However, I don't think this is the whole explanation. The technological advantage of the conquistadors was not overwhelming.

With regard to the Americas at least, I just happened to read this article by a professional military historian, who characterizes the Native American military technology as being "thousands of years behind their Old World agrarian counterparts", which sounds like the advantage was actually rather overwhelming.

There is a massive amount of literature to explain what is sometimes called ‘the Great Divergence‘ (a term I am going to use h

... (read more)
2Daniel Kokotajlo2y
I don't really see this as in strong conflict with what I said. I agree that technology is the main factor; I said it was also "strategic and diplomatic cunning;" are you suggesting that it wasn't really that at all and that if Cortez had gifted his equipment to 500 locals they would have been just as successful at taking over as he was? I could be convinced of this I suppose.

Thankfully, there have already been some successes in agent-agnostic thinking about AI x-risk

Also Sotala 2018 mentions the possibility of control over society gradually shifting over to a mutually trading collective of AIs (p. 323-324) as one "takeoff" route, as well as discussing various economic and competitive pressures to shift control over to AI systems and the possibility of a “race to the bottom of human control” where state or business actors [compete] to reduce human control and [increase] the autonomy of their AI systems to obtain an edge over th... (read more)

Oh cool! I put some effort into pursuing a very similar idea earlier:

I'll start this post by discussing a closely related hypothesis: that given a specific learning or reasoning task and a certain kind of data, there is an optimal way to organize the data that will naturally emerge. If this were the case, then AI and human reasoning might naturally tend to learn the same kinds of concepts, even if they were using very different mechanisms.

but wasn't sure of how exactly to test it or work on it so I didn't get very far.

One idea that I had for testing i... (read more)

On the role of values: values clearly do play some role in determining which abstractions we use. An alien who observes Earth but does not care about anything on Earth's surface will likely not have a concept of trees, any more than an alien which has not observed Earth at all. Indifference has a similar effect to lack of data. However, I expect that the space of abstractions is (approximately) discrete. A mind may use the tree-concept, or not use the tree-concept, but there is no natural abstraction arbitrarily-close-to-tree-but-not-the-same-as-tree. There is no continuum of tree-like abstractions. So, under this model, values play a role in determining which abstractions we end up choosing, from the discrete set of available abstractions. But they do not play any role in determining the set of abstractions available. For AI/alignment purposes, this is all we need: as long as the set of natural abstractions is discrete and value-independent, and humans concepts are drawn from that set, we can precisely define human concepts without a detailed model of human values. Also, a mostly-unrelated note on the airplane example: when we're trying to "define" a concept by drawing a bounding box in some space (in this case, a literal bounding box in physical space), it is almost always the case that the bounding box will not actually correspond to the natural abstraction. This is basically the same idea as the cluster structure of thingspace and rubes vs bleggs. (Indeed, Bayesian clustering is directly interpretable as abstraction discovery: the cluster-statistics are the abstract summaries, and they induce conditional independence between the points in each cluster.) So I would interpret the airplanes exampe (and most similar examples in the legal system) not as a change in a natural concept, but rather as humans being bad at formally defining their natural concepts, and needing to update their definitions as new situations crop up. The definitions are not the natural conc

Does any military use meditation as part of its training? 

. Yes, e.g.

This [2019] winter, Army infantry soldiers at Schofield Barracks in Hawaii began using mindfulness to improve shooting skills — for instance, focusing on when to pull the trigger amid chaos to avoid unnecessary civilian harm.

The British Royal Navy has given mindfulness training to officers, and military leaders are rolling it out in the Army and Royal Air Force for some officers and enlisted soldiers. The New Zealand Defence Force recently adopted the technique, and military forces o

... (read more)
4Daniel Kokotajlo2y
Hmmm, if this is the most it's been done, then that counts as a No in my book. I was thinking something like "Ah yes, the Viet Cong did this for most of the war, and it's now standard in both the Vietnamese and Chinese armies." Or at least "Some military somewhere has officially decided that this is a good idea and they've rolled it out across a large portion of their force."

Appreciate this post! I had seen the good regulator theorem referenced every now and then, but wasn't sure what exactly the relevant claims were, and wouldn't have known how to go through the original proof myself. This is helpful.

(E.g. the result was cited by Frith & Metzinger as part of their argument that, as an agent seeks to avoid being punished by society, this constitutes an attempt to regulate society's behavior; and for the regulation be successful, the agent needs to internalize a model of the society's preferences, which once internalized be... (read more)

IMO, a textbook would either overlook big chunks of the field or look more like an enumeration of approaches than a unified resource.

Textbooks that cover a number of different approaches without taking a position on which one is the best are pretty much the standard in many fields. (I recall struggling with it in some undergraduate psychology courses, as previous schooling didn't prepare me for a textbook that would cover three mutually exclusive theories and present compelling evidence in favor of each. Before moving on and presenting three mutually exclusive theories about some other phenomenon on the very next page.)

1Adam Shimi3y
Fair enough. I think my real issue with an AI Alignment textbook is that for me a textbook presents relatively foundational and well established ideas and theories (maybe multiple ones), whereas I feel that AI Alignment is basically only state-of-the-art exploration, and that we have very few things that should actually be put into a textbook right now. But I could change my mind if you have an example of what should be included in such an AI Alignment textbook.

The "many decisions can be thought of as a committee requiring unanimous agreement" model felt intuitively right to me, and afterwards I've observed myself behaving in ways which seem compatible with it, and thought of this post.

You probably know of these already, but just in case: lukeprog wrote a couple of articles on the history of AI risk thought [1, 2] going back to 1863. There's also the recent AI ethics article in the Stanford Encyclopedia of Philosophy.

I'd also like to imagine that my paper on superintelligence and astronomical suffering might say something that someone might consider important, but that is of course a subjective question. :-)

3Stuart Armstrong3y

because who's talking about medium-size risks from AGI?

Well, I have talked about them... :-)

The capability claim is often formulated as the possibility of an AI achieving a decisive strategic advantage (DSA). While the notion of a DSA has been implicit in many previous works, the concept was first explicitly defined by Bostrom (2014, p. 78) as “a level of technological and other advantages sufficient to enable [an AI] to achieve complete world domination.”

However, assuming that an AI will achieve a DSA seems like an unnecessarily strong form of the capabil

... (read more)

I read you to be asking "what decision theory is implied by predictive processing as it's implemented in human brains". It's my understanding that while there are attempts to derive something like a "decision theory formulated entirely in PP terms", there are also serious arguments for the brain actually having systems that are just conventional decision theories and not directly derivable from PP.

Let's say you try, as some PP theorists do, to explain all behavior as free energy minimization as opposed to expected utility maximization. Ransom et al. (2020)... (read more)

I generally agree with this. Specifically, I tend to imagine that PP is trying to make our behavior match a model in which we behave like an agent (at least sometimes). Thus, for instance, the tendency for humans to do things which "look like" or "feel like" optimizing for X without actually optimizing for X. In that case, PP would be consistent with many decision theories, depending on the decision theory used by the model it's trying to match.

It sounds a bit absurd: you've already implemented a sophisticated RL algorithm, which keeps track of value estimates for states and actions, and propagates these value estimates to steer actions toward future value. Why would the learning process re-implement a scheme like that, nested inside of the one you implemented? Why wouldn't it just focus on filling in the values accurately?

I've thought of two possible reasons so far.

  1. Perhaps your outer RL algorithm is getting very sparse rewards, and so does not learn very fast. The inner RL could implemen
... (read more)
I'm a little confused as to why there's any question here. Every algorithm lies on a spectrum of tradeoffs from general to narrow. The narrower a class of solved problems, the more efficient (in any way you care to name) an algorithm can be: a Tic-Tac-Toe solver is going to be a lot more efficient than AIXI. Meta-learning works because the inner algorithm can be far more specialized, and thus, more performant or sample-efficient than the highly general outer algorithm which learned the inner algorithm. For example, in Dactyl, PPO trains a RNN to adapt to many possible robot hands on the fly in as few samples as possible; it's probably several orders of magnitude faster than online training of an RNN by PPO directly. "Why not just use that RNN for DoTA2, if it's so much better than PPO?" Well, because DoTA2 has little or nothing to do with robotic hands rotating cubes, an algorithm that excels at robot hand will not transfer to DoTA2. PPO will still work, though.

Good point, I wasn't thinking of social effects changing the incentive landscape.

Or e.g. that it always leads to the organism optimizing for a set of goals which is unrecognizably different from the base objective. I don't think you see these things, and I'm interested in figuring out how evolution prevented them.

As I understand it, Wang et al. found that their experimental setup trained an internal RL algorithm that was more specialized for this particular task, but was still optimizing for the same task that the RNN was being trained on? And it was selected exactly because it did that very goal better. If the circumstances... (read more)

5Abram Demski3y
It seems possible to me. A common strategy in religious groups is to steer for a wide barrier between them and particular temptations. This could be seen as a strategy for avoiding DA signals which would de-select for the behaviors encouraged by the religious group: no rewards are coming in for alternate behavior, so the best the DA can do is reinforce the types of reward which the PFC has restricted itself to. This can be supplemented with modest rewards for desired behaviors, which force the DA to reinforce the inner optimizer's desired behaviors. Although is easier in a community which supports the behaviors, it's entirely possible to do this to oneself in relative isolation, as well.
That said, if mesa-optimization is a standard feature[4] of brain architecture, it seems notable that humans don’t regularly experience catastrophic inner alignment failures.

What would it look like if they did?

The thing I meant by "catastrophic" is just "leading to death of the organism." I suspect mesa-optimization is common in humans, but I don't feel confident about this, nor that this is a joint-carvey ontology. I can imagine it being the case that many examples of e.g. addiction, goodharting, OCD, and even just "everyday personal misalignment"-type problems of the sort IFS/IDC/multi-agent models of mind sometimes help with, are caused by phenomena which might reasonably be described as inner alignment failures.

But I think these things don't kill people very... (read more)

I don't feel at all tempted to do that anthropomorphization, and I think it's weird that EY is acting as if this is a reasonable thing to do.

"It's tempting to anthropomorphize GPT-3 as trying its hardest to make John smart" seems obviously incorrect if it's explicitly phrased that way, but e.g. the "Giving GPT-3 a Turing Test" post seems to implicitly assume something like it:

This gives us a hint for how to stump the AI more consistently. We need to ask questions that no normal human would ever talk about.

Q: How m
... (read more)
If some circuit in the brain is doing something useful, then it's humanly feasible to understand what that thing is and why it's useful, and to write our own CPU code that does the same useful thing.
In other words, the brain's implementation of that thing can be super-complicated, but the input-output relation cannot be that complicated—at least, the useful part of the input-output relation cannot be that complicated.

Robin Hanson makes a similar argument in "Signal Processors Decouple":

The bottom line is that to emulate a bi
... (read more)

Ah, looking at earlier posts in your sequence, I realize that you are defining abstraction in such a way that the properties of the high-level abstraction imply information about the low-level details - something like "a class XYZ star (high-level classification) has an average mass of 3.5 suns (low-level detail)".

That explains my confusion since I forgot that you were using this definition, and was thinking of "abstraction" as it was defined in my computer science classes, where an abstraction was something that explicitly discarded all the information about the low-level details.

Or, to put it differently: an abstraction is “valid” when components of the low-level model are independent given the values of high-level variables. The roiling of plasmas within far-apart stars is (approximately) independent given the total mass, momentum, and center-of-mass position of each star. As long as this condition holds, we can use the abstract model to correctly answer questions about things far away/far apart. [...]
So if we draw a box around just one of the two instances, then those low-level calculation details we threw out w
... (read more)
Two issues here. First, in this case, we're asking about one specific query: the trajectory of the sun. In general, abstraction is about validity of some class of queries - typically most-or-all queries which can be formulated within the high-level model. So validity of the star-abstraction would depend not just on the validity of the calculation of the sun's trajectory, but also the validity of the other stars' trajectories, and possibly other things depending on what else is in the model. (Of course, we can choose just one query as our class, but then the abstraction won't be very useful; we might as well just directly compute the query's answer instead.) For a star-abstraction, we expect the abstraction to be valid for any queries on kinematics of the stars, as long as the stars are far apart. Queries we don't expect to be valid include e.g. kinematics in situations where stars get close together and are torn apart by tidal forces. Second, it's generally ok if the low-level structures happen to be identical, just as two independent die rolls will sometimes come out the same. The abstraction breaks down when the low-level structures are systematically correlated, given whatever information we have about them. What does "break down" mean here? Well, I have some low-level data about one star, and I want to calculate what that tells me about another star. Using the abstraction, I can do that in three steps: * Use my low-level data on star 1 to compute what I can about the high-level variables of star 1 (low-level 1 -> high-level 1) * Use the high-level model to compute predictions about high-level variables of star 2 from those of star 1 (high-level 1 -> high-level 2) * Update my low-level information on star 2 to account for the new high-level information (high-level 2 -> low-level 2) ... and that should be equivalent to directly computing what the low-level of star 1 tells me about the low-level of star 2 (low-level 1 -> low-level 2). (It's a comm
Load More