Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Recent Discussion

(Colab notebook here.)

This post is about language model scaling laws, specifically the laws derived in the DeepMind paper that introduced Chinchilla.[1]

The paper came out a few months ago, and has been discussed a lot, but some of its implications deserve more explicit notice in my opinion.  In particular:

  • Data, not size, is the currently active constraint on language modeling performance.  Current returns to additional data are immense, and current returns to additional model size are miniscule; indeed, most recent landmark models are wastefully big.
    • If we can leverage enough data, there is no reason to train ~500B param models, much less 1T or larger models.
    • If we have to train models at these large sizes, it will mean we have encountered a barrier to exploitation of data scaling, which would
4Raymond Arnold8h
Something I'm unsure about (commenting from my mod-perspective but not making a mod pronouncement) is how LW should relate to posts that lay out ideas that may advance AI capabilities. My current understanding is that all major AI labs have already figured out the chinchilla results on their own, but that younger or less in-the-loop AI orgs may have needed to run experiments that took a couple months of staff time. This post was one of the most-read posts on LW this month, and shared heavily around twitter. It's plausible to me that spreading these arguments plausibly speeds up AI timelines by 1-4 weeks on average. It seems important to be able to talk about that and model the world, but I'm wondering if posts like this should live behind a "need to log-in" filter, maybe with a slight karma-gate, so that the people who end up reading it are at least more likely to be plugged into the LW ecosystem and are also going to get exposed to arguments about AI risk. nostalgiabraist, I'm curious how you would feel about that.

so that the people who end up reading it are at least more likely to be plugged into the LW ecosystem and are also going to get exposed to arguments about AI risk.

There's also the chance that if these posts are not gated, people who previously weren't plugged into the LW ecosystem but are interested in AI find LW through articles such as this one. And then eventually also start reading other articles here and become more interested in alignment concerns.

There's also a bit of a negative stereotype among some AI researchers as alignment people being theoreti... (read more)

This post has been recorded as part of the LessWrong Curated Podcast, and can be listened to on Spotify, Apple Podcasts, and Libsyn.

TL;DR: To even consciously consider an alignment research direction, you should have evidence to locate it as a promising lead. As best I can tell, many directions seem interesting but do not have strong evidence of being “entangled” with the alignment problem such that I expect them to yield significant insights. 

For example, “we can solve an easier version of the alignment problem by first figuring out how to build an AI which maximizes the number of real-world diamonds” has intuitive appeal and plausibility, but this claim doesn’t have to be true and this problem does not necessarily have a natural, compact solution. In contrast, there do in fact exist humans...

Curated. I'm not sure I endorse all the specific examples, but the general principles make sense to me as considerations to help guide alignment research directions.

This was written as part of the first Refine blog post day. Thanks for comments by Chin Ze Shen, Tamsin Leake, Paul Bricman, Adam Shimi.

Magic agentic fluid/force

Somewhere in my brain there is some sort of physical encoding of my values. This encoding could be spread out over the entire brain, it could be implicit somehow. I’m not making any claim of how values are implemented in a brain, just that the information is somehow in there.

Somewhere in the future a super intelligent AI is going to do some action. 

If we solve alignment, then there will be some causal link between the values in my head (or some human head) and the action of that AI. In some way, whatever the AI does, it should do it because that...

You can't zoom infinitely far in on the causal chain between values and actions, because values (and to a large extent actions) are abstractions that we use when modeling agents like ourselves. They are emergent. To talk about my values at all is to use a model of me where I use my models in a certain agenty way and you don't sweat the details too hard.

1Vladimir Nesov1d
Not sure if this is an intended meaning, but the claim that values don't depend on content of the world outside the brain is generally popular (especially in decision theory), and there seems to be no basis for it. Brains are certainly some sort of pointers to value, but a lot (or at least certainly some) of the content of values could be somewhere else, most likely in civilization's culture. This is an important distinction for corrigibility, because this claim is certainly false for a corrigible agent, it instead wants to find content of its values in environment, it's not part of its current definition/computation. It also doesn't make sense to talk about this agent pursuing its goals in a diverse set of environments, unless we expect the goals to vary with environment. For decision theory of such agents, this could be a crucial point. For example, an updateless corrigible agent wouldn't be able to know the goals that it must choose a policy in pursuit of. The mapping from observations to actions that UDT would pick now couldn't be chosen as the most valuable mapping, because value/goal itself depends on observations, and even after some observations it's not pinned down precisely. So if this point is taken into account, we need a different decision theory, even if it's not trying to do anything fancy with corrigibility or mild optimization, but merely acknowledges that goal content could be located in the environment!
1Linda Linsefors21h
I mean that the information of what I value exists in my brain. Some of this information is pointers to things in the real world. So in a sense the information partly exist in the relation/correlation between me and the world. I defiantly don't mean that I can only care about my internal brain state. To me that is just obviously wrong. Although I have met people who disagree, so I see where the misunderstanding came from.
1Vladimir Nesov21h
That's not what I'm talking about. I'm not talking about what goals are about, I'm talking about where the data to learn what they are is located. There is a particular thing, say a utility function, that is the intended formulation of goals. It could be the case that this intended utility function could be found somewhere in the brain. That doesn't mean that it's a utility function that cares about brains, the questions of where it's found and what it cares about are unrelated. Or it could be the case that it's recorded on an external hard drive, and the brain only contains the name of the drive (this name is a "pointer to value"). It's simply not the case that you can recover this utility function without actually looking at the drive, and only looking at the brain. So utility function u itself depends on environment E, that is there is some method of formulating utility functions t such that u=t(E). This is not the same as saying that utility of environment depends on environment, giving the utility value u(E)=t(E)(E) (there's no typo here). But if it's actually in the brain, and says that hard drives are extremely valuable, then you do get to know what it is without looking at the hard drives, and learn that it values hard drives.

When people talk about prosaic alignment proposals, there’s a common pattern: they’ll be outlining some overcomplicated scheme, and then they’ll say “oh, and assume we have great interpretability tools, this whole thing just works way better the better the interpretability tools are”, and then they’ll go back to the overcomplicated scheme. (Credit to Evan for pointing out this pattern to me.) And then usually there’s a whole discussion about the specific problems with the overcomplicated scheme.

In this post I want to argue from a different direction: if we had great interpretability tools, we could just use those to align an AI directly, and skip the overcomplicated schemes. I’ll call the strategy “Just Retarget the Search”.

We’ll need to make two assumptions:


One of the main reasons I expect this to not work is because optimization algorithms that are the best at optimizing some objective given a fixed compute budget seem like they basically can't be generally-retargetable. E.g. if you consider something like stockfish, it's a combination of search (which is retargetable), sped up by a series of very specialized heuristics that only work for winning. If you wanted to retarget stockfish to "maximize the max number of pawns you ever have" you had, you would not be able to use [specialized for telling whether a mo... (read more)

8Rohin Shah2d
Definitely agree that "Retarget the Search" is an interesting baseline alignment method you should be considering. I like what you call "complicated schemes" over "retarget the search" for two main reasons: 1. They don't rely on the "mesa-optimizer assumption" that the model is performing retargetable search (which I think will probably be false in the systems we care about). 2. They degrade gracefully with worse interpretability tools, e.g. in debate, even if the debaters can only credibly make claims about whether particular neurons are activated, they can still stay stuff like "look my opponent is thinking about synthesizing pathogens, probably it is hoping to execute a treacherous turn", whereas "Retarget the Search" can't use this weaker interpretability at all. (Depending on background assumptions you might think this doesn't reduce x-risk at all; that could also be a crux.)
3Evan R. Murphy2d
Why do you think we probably won't end up with mesa-optimizers in the systems we care about? Curious about both which systems you think we'll care about (e.g. generative models, RL-based agents, etc.) and why you don't think mesa-optimization is a likely emergent property for very scaled-up ML models.
3Rohin Shah2d
1. It's a very specific claim about how intelligence works, so gets a low prior, from which I don't update much (because it seems to me we know very little about how intelligence works structurally and the arguments given in favor seem like relatively weak considerations). 2. Search is computationally inefficient relative to heuristics, and we'll be selecting really hard on computational efficiency (the model can't just consider 10^100 plans and choose the best one when it only has 10^15 flops to work with). It seems very plausible that the model considers, say, 10 plans and chooses the best one, or even 10^6 plans, but then most of the action is in which plans were generated in the first place and "retarget the search" doesn't necessarily solve your problem. I'm not thinking much about whether we're considering generative models vs RL-based agents for this particular question (though generally I tend to think about foundation models finetuned from human feedback).
I am very confused by (2). It sounds like you are imagining that search necessarily means brute-force search (i.e. guess-and-check)? Like non-brute-force search is just not a thing? And therefore heuristics are necessarily a qualitatively different thing from search? But I don't think you're young enough to have never seen A* search, so presumably you know that formal heuristic search is a thing, and how to use relaxation to generate heuristics. What exactly do you imagine that the word "search" refers to?
I indeed think those are the relevant cruxes.
5Evan R. Murphy4d
Agree that this is looks like a promising approach. People interested in this idea can read some additional discussion in Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs [] from my post, "Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios". As you mention, having this kind of advanced interpretability essentially solves the inner alignment problem, but leaves a big question mark about outer alignment. In that Scenario 2 link above, I have some discussion of expected impacts from this kind of interpretability on a bunch of different outer alignment and robustness techniques including: Relaxed adversarial training, Intermittent oversight, Imitative amplification, Approval-based amplification, Recursive reward modeling, Debate, Market making, Narrow reward modeling, Multi-agent, Microscope AI, STEM AI and Imitative generalization. [1] (You need to follow the link to the Appendix 1 section about this scenario [] though to get some of these details). I'm not totally sure that the ability to reliably detect mesa-optimizers and their goals/optimization targets would automatically grant us the ability to "Just Retarget the Search" on a hot model. It might, but I agree with your section on Problems that it may look more like restarting training on models where we detect a goal that's different from what we want. But this still seems like it could accomplish a lot of what we want from being able to retargeting the search on a hot model, even though it's clunkier. -- [1]: In a lot of these techniques it can make sense to check that the mesa-optimizer is aligned (and do some kind of goal retargeting if it's not). However, in others we pr
5Steve Byrnes4d
It also works in the scenario where human programmers develop a general-purpose (i.e. retargetable) internal search process, i.e. brain-like AGI [] or pretty much any other flavor of model-based RL. You would look for things in the world-model and manually set their “value” (in RL jargon) / “valence” (in psych jargon) to very high or low, or neutral, as the case may be. I’m all for that, and indeed I bring it up with some regularity. My progress towards a plan along those lines (such as it is) is mostly here [] . (Maybe it doesn’t look that way, but note that “Thought Assessors” ( ≈ multi-dimensional value function) can be thought of as a specific simple approach to interpretability, see discussion of the motivation-vs-interpretability duality in §9.6 here [] .) Some of the open problems IMO include: * figuring out what exactly value to paint onto exactly what concepts; * dealing with concept extrapolation [] when concepts hit edge-cases [concepts can hit edge-cases both because the AGI keeps learning new things, and because the AGI may consider the possibility of executing innovative plans that would take things out of distribution]; * getting safely through the period where the “infant AGI” hasn’t yet learned the concepts which we want it to pursue (maybe solvable with a good sandbox [] ); * getting the interpretability itself to work well (including the fi

Problem: an overseer won’t see the AI which kills us all thinking about how to kill humans, not because the AI conceals that thought, but because the AI doesn’t think about how to kill humans in the first place. The AI just kills humans as a side effect of whatever else it’s doing.

Analogy: the Hawaii Chaff Flower didn’t go extinct because humans strategized to kill it. It went extinct because humans were building stuff nearby, and weren’t thinking about how to keep the flower alive. They probably weren’t thinking about the flower much at all.

Hawaii Chaff Flower (source)

More generally: how and why do humans drive species to extinction? In some cases the species is hunted to extinction, either because it's a threat or because it's economically profitable to hunt....

5Rohin Shah1d
Forget about "sharp left turn", you must be imagining strapping on a rocket and shooting into space []. (I broadly agree with Buck's take but mostly I'm like "jeez how did this AGI strap on a rocket and shoot into space")

Lol. I don't think the crux here is actually about how powerful we imagine the AI to be (though we probably do have different expectations there). I think the idea in this post applies even to very mildly superhuman AIs. (See this comment for some intuition behind that; the main idea is that I think the ideas in this post kick in even between the high vs low end of the human intelligence spectrum, or between humans with modern technical knowledge vs premodern humans.)

21Buck Shlegeris2d
[writing quickly, sorry for probably being unclear] If the AI isn't thinking about how to deceive the humans who are operating it, it seems to me much less likely that it takes actions that cause it to grab a huge amount of power. The humans don't want to have the AI grab power, and so they'll try in various ways to make it so that they'll notice if the AI is trying to grab power; the most obvious explanation for why the humans would fail at this is that the AI is trying to prevent them from noticing, which requires the AI to think about what the humans will notice. At a high enough power level, the AI can probably take over the world without ever explicitly thinking about the fact that humans are resisting it. (For example, if humans build a house in a place where a colony of ants lives, the human might be able to succeed at living there, even if the ants are coordinatedly trying to resist them and the humans never proactively try to prevent the ants from resisting them by eg proactively killing them all.) But I think that doom from this kind of scenario is substantially less likely than doom from scenarios where the AI is explicitly thinking about how to deceive. You probably don't actually think this, but the OP sort of feels like it's mixing up the claim "the AI won't kill us out of malice, it will kill us because it wants something that we're standing in the way of" (which I mostly agree with) and the claim "the AI won't grab power by doing something specifically optimized for its instrumental goal of grabbing power, it will grab power by doing something else that grabs power as a side effect" (which seems probably false to me).
That's definitely my crux, for purposes of this argument. I think AGI will just be that much more powerful than humans. And I think the bar isn't even very high. I think my intuition here mostly comes from pointing my inner sim at differences within the current human distribution. For instance, if I think about myself in a political policy conflict with a few dozen IQ-85-ish humans... I imagine the IQ-85-ish humans maybe manage to organize a small protest if they're unusually competent, but most of the time they just hold one or two meetings and then fail to actually do anything at all. Whereas my first move would be to go talk to someone in whatever bureacratic position is most relevant about how they operate day-to-day, read up on the relevant laws and organizational structures, identify the one or two people who I actually need to convince, and then meet with them. Even if the IQ-85 group manages their best-case outcome (i.e. organize a small protest), I probably just completely ignore them because the one or two bureaucrats I actually need to convince are also not paying any attention to their small protest (which probably isn't even in a place where the actually-relevant bureaucrats would see it, because the IQ-85-ish humans have no idea who the relevant bureaucrats are). And those IQ-85-ish humans do seem like a pretty good analogy for humanity right now with respect to AGI. Most of the time the humans just fail to do anything effective at all about the AGI; the AGI has little reason to pay attention to them.
5Buck Shlegeris2d
What do you imagine happening if humans ask the AI questions like the following: * Are you an unaligned AI? * If we let you keep running, are you (or some other AI) going to end up disempowering us? * If we take the action you just proposed, will we be happy with the outcomes? I think that for a lot of cases of misaligned AIs, these questions are pretty easy for the AI to answer correctly at some point before it's powerful enough to kill us all as a side effect of its god tier nanotech. (If necessary, we can ask the AI these questions once every five minutes.). And so if it answers them incorrectly it was probably on purpose. Maybe you think that the AI will say "yes, I'm an unaligned AI". In that case I'd suggest asking the AI the question "What do you think we should do in order to produce an AI that won't disempower us?" I think that the AI is pretty likely to be able to answer this question correctly (including possibly saying things like "idk man, turn me off and work on alignment for a while more before doing capabilities"). I think that AI labs, governments, etc would be enormously more inclined to slow down AI development if the AI literally was telling us "oh yeah I am definitely a paperclipper, definitely you're gonna get clipped if you don't turn me off, you should definitely do that". Maybe the crux here is whether the AI will have a calibrated guess about whether it's misaligned or not?
The first thing I imagine is that nobody asks those questions. But let's set that aside. The second thing I imagine is that somebody literally types those questions into a GPT-3 prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but it doesn't result in the AI thinking about how to deceive humans either. It just thinks about what text would follow that question if it appeared on the internet somewhere. And then I imagine someone with a bunch of interpretability tools saying "yup, it's just thinking about what text typically follows this question", and then that person's boss is like "great, it's not trying to deceive us, guess we can trust the answer", and they both just haven't really thought of the fact that the AI's response-text does not have anything in particular to do with whether the AI is aligned or whether they'll be happy with the outcome or whatever. (It's essentially the same mistake as a GOFAI person looking at a node in some causal graph that says "will_kill_humans", and seeing that node set to 99% False, and thinking that somehow implies the GOFAI will not kill humans.) Now, presumably future systems will train for things other than "predict what text typically follows this question", but I expect the general failure mode to stay the same. When a human asks "Are you an unaligned AI?" or whatever, the AI thinks about a bunch of stuff which is just not particularly related to whether it's an unaligned AI. The AI wasn't trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language; humans have no clue how to train such a thing. Probably the stuff the AI thinks about does not involve intentionally deceiving humans, because why would it? And then the AI gives some answer which is not particularly related to whether it's an unaligned AI, and the humans interpret that as an answer to their original question, ther
2William Saunders2d
If we can't get the AI to answer something like "If we take the action you just proposed, will we be happy with the outcomes?", why can we get it to also answer the question of "how do you design a fusion power generator?" to get a fusion power generator that does anything reliably in the world (including having consequences that kill us), rather than just getting out something that looks to us like a plan for a fusion generator but doesn't actually work?
6Buck Shlegeris2d
I disagree fwiw I agree. This is where I think the meat of the question lies; I overall disagree and think that the model does have to be thinking about deception in order to be dangerous while also performing well on the tasks we might train it on (eg "answer questions well, as judged by some human labeler"). I don't have time to say much about what I think is going on here right now; I might come back later.
Seems like there are multiple possibilities here: * (1) The AI does something that will, as an intended consequence, result in human extinction, because this is instrumental to preventing shutdown, etc. It attempts to circumvent our interpretability, oversight, etc. This is the typical deceptive alignment setting which is attempted to be addressed by myopia, interpretability, oversight, etc. * (2) The AI does something that will, as an unintended side consequence, result in human extinction. The AI also realizes that this is a consequence of its actions but doesn't really care. (This is within the "without ever explicitly thinking about the fact that humans are resisting it" scenario.) This is isomorphic to ELK. * If we can solve ELK, we can get the AI to tell us whether it thinks its plan will actually result in human extinction. This is the "oh yeah I am definitely a paperclipper" scenario. * Also, if it has a model of the humans using ELK to determine whether to shut down the AI, the fact that it knows we will shut it off after we find out the consequences of its plan will incentivize it to either figure out how to implement plans that it itself cannot see how it will lead to human extinction (third scenario), or try to subvert our ability to turn it off after we learn of the consequences (first scenario). * If we can't solve ELK, we can get the AI to tell us something that doesn't really correspond to the actual internal knowledge inside the model. This is the "yup, it's just thinking about what text typically follows this question" scenario. * (3) The AI does something that will, as an unintended side consequence, result in human extinction. The AI does not realize this is a consequence of its actions, so solving ELK doesn't help us here. Failures of this type fall on a spectrum of how unforeseeable the consequences

This post is the second in what is likely to become a series of uncharitable rants about alignment proposals (previously: Godzilla Strategies). In general, these posts are intended to convey my underlying intuitions. They are not intended to convey my all-things-considered, reflectively-endorsed opinions. In particular, my all-things-considered reflectively-endorsed opinions are usually more kind. But I think it is valuable to make the underlying, not-particularly-kind intuitions publicly-visible, so people can debate underlying generators directly. I apologize in advance to all the people I insult in the process.

With that in mind, let's talk about problem factorization (a.k.a. task decomposition).


It all started with HCH, a.k.a. The Infinite Bureaucracy.

The idea of The Infinite Bureaucracy is that a human (or, in practice, human-mimicking AI) is given a problem. They only have a...

If anyone has questions for Ought specifically, we're happy to answer them as part of our AMA on Tuesday [].
5Rohin Shah7d
One more disanalogy: 4. the rest of the world pays attention to large or powerful real-world bureaucracies and force rules on them that small teams / individuals can ignore (e.g. Secret Congress [], Copenhagen interpretation of ethics [], startups being able to do illegal stuff [] ), but this presumably won't apply to alignment approaches. One other thing I should have mentioned is that I do think the "unconscious economics" point is relevant and could end up being a major problem for problem factorization, but I don't think we have great real-world evidence suggesting that unconscious economics by itself is enough to make teams of agents not be worthwhile. Re disanalogy 1: I'm not entirely sure I understand what your objection is here but I'll try responding anyway. I'm imagining that the base agent is an AI system that is pursuing a desired task with roughly human-level competence, not something that acts the way a whole-brain emulation in a realistic environment would act. This base agent can be trained by imitation learning where you have the AI system mimic human demonstrations of the task, or by reinforcement learning on a reward model trained off of human preferences, but (we hope) is just trying to do the task and doesn't have all the other human wants and desires. (Yes, this leaves a question of how you get that in the first place; personally I think that this distillation is the "hard part", but that seems separate from the bureaucracy point.) Even if you did get a bureaucracy made out of agents with human desires, it still seems like you get a lot of benefit from the fact that the agents are identical to each other, and so have less politics. Re disanalogy 3: I agree that you have to think that a small / medium / large bureaucracy of Alices-with-15
I think a lot of alignment tax-imposing interventions (like requiring local work to be transparent for process-based feedback) could be analogous?
I was mostly thinking of the unconscious economics stuff. I should have asked for a mental picture sooner, this is very useful to know. Thanks. If I imagine a bunch of Johns, I think that they basically do fine, though mainly because they just don't end up using very many Johns. I do think a small team of Johns would do way better than I do.
8Wei Dai8d
I'm still not getting a good picture of what your thinking is on this. Seems like the inferential gap is wider than you're expecting? Can you go into more details, and maybe include an example? My intuition around (1) being important mostly comes from studying things like industrial organization [] and theory of the firm []. If you look at the economic theories (mostly based on game theory today) that try to explain why economies are organized the way they are, and where market inefficiencies come from, they all have a fundamental dependence on the assumption of different participants having different interests/values. In other words, if you removed that assumption from the theoretical models and replaced it with the opposite assumption, they would collapse in the sense that all or most of the inefficiencies ("transaction costs") would go away and it would become very puzzling why, for example, there are large hierarchical firms instead of everyone being independent contractors who just buy and sell their labor/services on the open market, or why monopolies are bad (i.e., cause "deadweight loss" in the economy). I still have some uncertainty that maybe these ivory tower theories/economists are wrong, and you're actually right about (1) not being that important, but I'd need some more explanations/arguments in that direction for it to be more than a small doubt at this point.
Oh that's really interesting. I did a dive into theory of the firm research a couple years ago (mainly interested in applying it to alignment and subagent models) and came out with totally different takeaways. My takeaway was that the difficulty of credit assignment is a major limiting factor (and in particular this led to thinking about Incentive Design with Imperfect Credit Assignment [] , which in turn led to my current formulation of the Pointers Problem [] ). Now, the way economists usually model credit assignment is in terms of incentives, which theoretically aren't necessary if all the agents share a goal. On the other hand, looking at how groups work in practice, I expect that the informational role of credit assignment is actually the load-bearing part at least as much as (if not more than) the incentive-alignment role. For instance, a price mechanism doesn't just align incentives, it provides information for efficient production decisions, such that it still makes sense to use a price mechanism even if everyone shares a single goal. If the agents share a common goal, then in theory there doesn't need to be a price mechanism, but a price mechanism sure is an efficient way to internally allocate resources in practice. ... and now that I'm thinking about it, there's a notable gap in economic theory here: the economists are using agents with different goals to motivate price mechanisms (and credit allocation more generally), even though the phenomenon does not seem like it should require different goals. Memetics example: in the vanilla HCH tree, some agent way down the tree ignores their original task and returns an answer which says "the top-level question asker urgently needs to know X!" followed by some argument. And that sort of argument,
4Wei Dai6d
With existing human institutions, a big part of the problem has to be that every participant has an incentive to distort the credit assignment (i.e., cause more credit to be assigned to oneself). (This is what I conclude from economic theory and also fits with my experience and common sense.) It may well be the case that even if you removed this issue, credit assignment would still be a major problem for things like HCH, but how can you know this from empirical experience with real-world human institutions (which you emphasize in the OP)? If you know of some theory/math/model that says that credit assignment would be a big problem with HCH, why not talk about that instead?
4Ben Pace19h
Wei Dai says: I'm going to jump in briefly to respond on one line of reasoning. John says the following, and I'd like to just give two examples from my own life of it. Microcovid Tax In my group house during the early pandemic, we often spent hours each week negotiating rules about what we could and couldn't do. We could order take-out food if we put it in the oven for 20 mins, we could go for walks outside with friends if 6 feet apart, etc. This was very costly, and tired everyone out. We later replaced it (thanks especially to Daniel Filan for this proposal) with a microcovid [] tax, where each person could do as they wished, then calculate the microcovids they gathered, and pay the house $1/microcovid (this was determined by calculating everyone's cost/life, multiplying by expected loss of life if they got covid, dividing by 1 million, then summing over all housemates). This massively reduced negotiation overhead and also removed the need for norm-enforcement mechanisms. If you made a mistake, we didn't punish you or tell you off, we just charged you the microcovid tax. This was a situation where everyone was trusted to be completely honest about their exposures. It nonetheless made it easier for everyone to make tradeoffs in everyone else's interests. Paying for Resources Sometimes within the Lightcone team, when people wish to make bids on others' resources, people negotiate a price. If some team members want another team member to e.g. stay overtime for a meeting, move the desk they work from, change what time frame they're going to get something done, or otherwise bid for a use of the other teammate's resources, it's common enough for someone to state a price, and then the action only goes through if both parties agree to a trade. I don't think this is because we all have different goals. I think it's primarily because it's genuinely difficult to know (a) how valuable it is to the asker and (b) how costly it is to the askee. O
4Wei Dai18h
I don't disagree with this. I would add that if agents aren't aligned, then that introduces an additional inefficiency into this pricing process, because each agent now has an incentive to distort the price to benefit themselves, and this (together with information asymmetry) means some mutually profitable trades will not occur. Some work being "detailed and costly" isn't necessarily a big problem for HCH, since we theoretically have an infinite tree of free labor, whereas the inefficiencies introduced by agents having different values/interests seem potentially of a different character. I'm not super confident about this (and I'm overall pretty skeptical about HCH for this and other reasons), but just think that John was too confident in his position in the OP or at least hasn't explained his position enough. To restate the question I see being unanswered: why is alignment + infinite free labor still not enough to overcome the problems we see with actual human orgs?

(I have added the point I wanted to add to this conversation, and will tap out now.)

Because there exist human institutions in which people generally seem basically aligned and not trying to game the credit assignment. For instance, most of the startups I've worked at were like this (size ~20 people), and I think the alignment research community is basically like this today (although I'll be surprised if that lasts another 3 years). Probably lots of small-to-medium size orgs are like this, especially in the nonprofit space. It's hard to get very big orgs/communities without letting in some credit monsters, but medium-size is still large enough to see coordination problems kick in (we had no shortage of them at ~20-person startups). And, to be clear, I'm not saying these orgs have zero incentive to distort credit assignment. Humans do tend to do that sort of thing reflexively, to some extent. But to the extent that it's reflexive, it would also apply to HCH and variants thereof. For instance, people in HCH would still reflexively tend to conceal evidence/arguments contradicting their answers. (And when someone does conceal contradictory evidence/arguments, that would presumably increase the memetic fitness of their claims, causing them to propagate further up the tree, so that also provides a selection channel.) Similarly, if the HCH implementation has access to empirical testing channels and the ability to exchange multiple messages, people would still reflexively tend to avoid/bury tests which they expect will actually falsify their answers, or try to blame incorrect answers on subquestions elsewhere in the tree when an unexpected experimental outcome occurs and someone tries to backpropagate to figure out where the prediction-failure came from. (And, again, those who shift blame successfully will presumably have more memetic fitness, etc.)
1Vladimir Nesov4d
What if 90% or 99% of the work was not object level, but about mechanism/incentive design, surveillance/interpretability, and rationality training/tuning, including specialized to particular projects being implemented, including the projects that set this up, iterating as relevant wisdom/tuning and reference texts accumulate? This isn't feasible for most human projects, as it increases costs by orders of magnitude in money (salaries), talent (number of capable people), and serial time. But in HCH you can copy people, it runs faster, and distillation should get rid of redundant steering if it converges to a legible thing in the limit of redundancy.
Remember all that work still needs to be done by HCH itself. Mechanism/incentive design, surveillance/interpretability, and rationality training/tuning all seem about-as-difficult as the alignment problem itself, if not more so. Copying people is a potential game changer in general, but HCH seems like a really terrible way to organize those copies.
3Vladimir Nesov2d
In my view, the purpose of human/HCH distinction is that there are two models, that of a "human" and that of HCH (bureaucracies). This gives some freedom in training/tuning the bureaucracies model, to carry out multiple specialized objectives and work with prompts that the human is not robust enough to handle. This is done without changing the human model, to preserve its alignment properties and use the human's pervasive involvement/influence at all steps to keep the bureaucracy training/tuning aligned. The bureaucracies model starts out as that of a human. An episode involves multiple (but only a few) instances of both humans and bureaucracies, each defined by a self-changed internal state and an unchanging prompt/objective. It's a prompt/mission-statement that turns the single bureaucracies model into a particular bureaucracy, for example one of the prompts might instantiate the ELK head of the bureaucracies model. Crucially, the prompts/objectives of humans are less weird than those of bureaucracies, don't go into the chinese room territory, and each episode starts as a single human in control of the decision about which other humans and bureaucracies to initially instantiate in what arrangement. It's only the bureaucracies that get to be exposed to chinese room prompts/objectives, and they can set up subordinate bureaucracy instances with similarly confusing-for-humans prompts. Since the initial human model is not very capable or aligned, the greater purpose of the construction is to improve the human model [] . The setting allows instantiating and training multiple specialized bureaucracies, and possibly generalizing [] their prompt/role/objective from the examples used in training/tuning the bureaucracies model (the episodes). After all, robustness of the bureaucracies model to weird prompt
I'd be interested in your thoughts on [Humans-in-a-science-lab consulting HCH], for questions where we expect that suitable empirical experiments could be run on a significant proportion of subquestions. It seems to me that lack of frequent empirical grounding is what makes HCH particularly vulnerable to memetic selection. Would you still expect this to go badly wrong (assume you get to pick the humans)? If so, would you expect sufficiently large civilizations to be crippled through memetic selection by default? If [yes, no], what do you see as the important differences? I don't think it's a gap in economic theory in general: pretty sure I've heard the [price mechanisms as distributed computation] idea from various Austrian-school economists without reliance on agents with different goals - only on "What should x cost in context y?" being a question whose answer depends on the entire system.
Ok, so, some background on my mental image. Before yesterday, I had never pictured HCH as a tree of John Wentworths (thank you Rohin for that). When I do picture John Wentworths, they mostly just... refuse to do the HCH thing. Like, they take one look at this setup and decide to (politely) mutiny or something. Maybe they're willing to test it out, but they don't expect it to work, and it's likely that their output is something like the string "lol nope". I think an entire society of John Wentworths would probably just not have bureaucracies at all; nobody would intentionally create them, and if they formed accidentally nobody would work for them or deal with them. Now, there's a whole space of things-like-HCH, and some of them look less like a simulated infinite bureaucracy and more like a simulated society. (The OP mostly wasn't talking about things on the simulated-society end of the spectrum, because there will be another post on that.) And I think a bunch of John Wentworths in something like a simulated society would be fine - they'd form lots of small teams working in-person, have forums like LW for reasonably-high-bandwidth interteam communication, and have bounties on problems and secondary markets on people trying to get the bounties and independent contractors and all that jazz. Anyway, back to your question. If those John Wentworths lacked the ability to run experiments, they would be relatively pessimistic about their own chances, and a huge portion of their work would be devoted to figuring out how to pump bits of information and stay grounded without a real-world experimental feedback channel. That's not a deal-breaker; background knowledge of our world already provides far more bits of evidence than any experiment ever run, and we could still run experiments on the simulated-Johns. But I sure would be a lot more optimistic with an experimental channel. I do not think memetic selection in particular would cripple those Johns, because that's exactly t
Interesting, thanks. This makes sense to me. I do think strong-HCH [] can support the "...more like a simulated society..." stuff in some sense - which is to say that it can be supported so long as we can rely on individual Hs to robustly implement the necessary pointer passing (which, to be fair, we can't). To add to your "tree of John Wentworths", it's worth noting that H doesn't need to be an individual human - so we could have our H be e.g. {John Wentworth, Eliezer Yudkowsky, Paul Christiano, Wei Dai}, or whatever team would make you more optimistic about lack of memetic disaster. (we also wouldn't need to use the same H at every level)
Yeah, at some point we're basically simulating the alignment community (or possibly several copies thereof interacting with each other). There will probably be another post on that topic soonish.
3Vladimir Nesov8d
The agents at the top of most theoretical infinite bureaucracies should be thought of as already superhumanly capable and aligned, not weak language models, because the way IDA works iteratively retrains models on output of bureaucracy, so that agents at higher levels of the theoretical infinite bureaucracy are stronger (from later amplification/distillation epochs) than those at lower levels. It doesn't matter if an infinite bureaucracy instantiated for a certain agent fails to solve important problems, as long as the next epoch does better. For HCH specifically, this is normally intended to apply to the HCHs, not to humans in it, but then the abstraction of humans being actual humans (exact imitations) leaks, and we start expecting something other than actual humans there. If this is allowed, if something less capable/aligned than humans can appear in HCH, then by the same token these agents should improve with IDA epochs (perhaps not of HCH, but of other bureaucracies) and those "humans" at the top of an infinite HCH should be much better than the starting point, assuming the epochs improve things.
Load More