All of Lanrian's Comments + Replies

[AN #156]: The scaling hypothesis: a plan for building AGI
(The human baseline is a loss of 0.7 bits, with lots of uncertainty on that figure.)

I'd like to know what this figure is based on. In the linked post, Gwern writes:

The pretraining thesis argues that this can go even further: we can compare this performance directly with humans doing the same objective task, who can achieve closer to 0.7 bits per character⁠.

But in that linked post, there's no mention of "0.7" bits in particular, as far as I or cmd-f can see. The most relevant passage I've read is:

Claude Shannon found that each character was carrying more
... (read more)

It's based on those estimates and the systematic biases in such methods & literatures. Just as you know that psychology and medical effects are always overestimated and can be rounded down by 50% to get a more plausible real world estimate, such information-theoretic methods will always overestimate model performance and underestimate human performance, and are based on various idealizations: they use limited genres and writing styles (formal, omitting informal like slang), don't involve extensive human calibration or training like the models get, don'... (read more)

Decoupling deliberation from competition

Thanks, computer-speed deliberation being a lot faster than space-colonisation makes sense. I think any deliberation process that uses biological humans as a crucial input would be a lot slower, though; slow enough that it could well be faster to get started with maximally fast space colonisation. Do you agree with that? (I'm a bit surprised at the claim that colonization takes place over "millenia" at technological maturity; even if the travelling takes millenia, it's not clear to me why launching something maximally-fast – that... (read more)

3Paul Christiano4moI agree that biological human deliberation is slow enough that it would need to happen late. By "millennia" I mostly meant that traveling is slow (+ the social costs of delay are low, I'm estimating like 1/billionth of value per year of delay). I agree that you can start sending fast-enough-to-be-relevant ships around the singularity rather than decades later. I'd guess the main reason speed matters initially is for grabbing resources from nearby stars under whoever-gets-their-first property rights (but that we probably will move away from that regime before colonizing). I do expect to have strong global coordination prior to space colonization. I don't actually know if you would pause long enough for deliberation amongst biological humans to be relevant. So on reflection I'm not sure how much time you really have as biological humans. In the OP I'm imagining 10+ years (maybe going up to a generation) but that might just not be realistic. Probably my single best guess is that some (many?) people would straggle out over years or decades (in the sense that relevant deliberation for controlling what happens with their endowment would take place with biological humans living on earth), but that before that there would be agreements (reached at high speed) to avoid them taking a huge competitive hit by moving slowly. But my single best guess is not that likely and it seems much more likely that something else will happen (and even that I would conclude that some particular other thing is much more likely if I thought about it more).
Decoupling deliberation from competition

I'm curious about how this interacts with space colonisation. The default path of efficient competition would likely lead to maximally fast space-colonisation, to prevent others from grabbing it first. But this would make deliberating together with other humans a lot trickier, since some space ships would go to places where they could never again communicate with each other. For things to turn out ok, I think you either need:

  • to pause before space colonisation.
  • to finish deliberating and bargaining before space colonisation.
  • to equip each space ship with
... (read more)

I think I'm basically optimistic about every option you list.

  • I think space colonization is extremely slow relative to deliberation (at technological maturity I think you probably have something like million-fold speedup over flesh and blood humans, and colonization takes place over decades and millennia rather than years). Deliberation may not be "finished" until the end of the universe, but I think we will e.g. have deliberated enough to make clear agreements about space colonization / to totally obsolete existing thinking / likely to have reached a "gran
... (read more)
The strategy-stealing assumption

Categorising the ways that the strategy-stealing assumption can fail:

  • Humans don't just care about acquiring flexible long-term influence, because
    • 4. They also want to stay alive.
    • 5 and 6. They want to stay in touch with the rest of the world without going insane.
    • 11. and also they just have a lot of other preferences.
    • (maybe Wei Dai's point about logical time also goes here)
  • It is intrinsically easier to gather flexible influence in pursuit of some goals, because
    • 1. It's easier to build AIs to pursue goals that are easy to check.
    • 3. It's easie
... (read more)
Imitative Generalisation (AKA 'Learning the Prior')

Starting with amplification as a baseline; am I correct to infer that imitative generalisation only boosts capabilities, and doesn't give you any additional safety properties?

My understanding: After going through the process of finding z, you'll have a z that's probably too large for the human to fully utilise on their own, so you'll want to use amplification or debate to access it (as well as to generally help the human reason). If we didn't have z, we could train an amplification/debate system on D' anyway, while allowing th... (read more)

2Beth Barnes7moI think the distinction isn't actually super clear, because you can usually trade off capabilities problems and safety problems. I think of it as expanding the range of questions you can get aligned answers to in a reasonable number of steps. If you're just doing IDA/debate, and you try to get your model to give you answers to questions where the model only knows the answer because of updating on a big dataset, you can either keep going through the big dataset when any question of this type comes up (very slow, so capability limitation), or not trust these answers (capability limitation), or just hope they're correct (safety problem). The latter :) I think the only way to get debate to be able to answer all the questions that debate+IG can answer is to include subtrees that are the size of your whole training dataset at arbitrary points in your debate tree, which I think counts as a ridiculous amount of compute
Prediction can be Outer Aligned at Optimum

Cool, seems reasonable. Here are some minor responses: (perhaps unwisely, given that we're in a semantics labyrinth)

Evan's footnote-definition doesn't rule out malign priors unless we assume that the real world isn't a simulation

Idk, if the real world is a simulation made by malign simulators, I wouldn't say that an AI accurately predicting the world is falling prey to malign priors. I would probably want my AI to accurately predict the world I'm in even if it's simulated. The simulators control everything that happens a... (read more)

Prediction can be Outer Aligned at Optimum
Isn't that exactly the point of the universal prior is misaligned argument? The whole point of the argument is that this abstraction/specification (and related ones) is dangerous.


I guess your title made it sound like you were teaching us something new about prediction (as in, prediction can be outer aligned at optimum) when really you are just arguing that we should change the definition of outer-aligned-at-optimum, and your argument is that the current definition makes outer alignment too hard to achieve

I mean, it's true that I'm ... (read more)

1Daniel Kokotajlo8moWell, at this point I feel foolish for arguing about semantics. I appreciate your post, and don't have a problem with saying that the malignity problem is an inner alignment problem. (That is zero evidence that it isn't also an outer alignment problem though!) Evan's footnote-definition doesn't rule out malign priors unless we assume that the real world isn't a simulation. We may have good pragmatic reasons to act as if it isn't, but I still think you are changing the definition of outer alignment if you think it assumes we aren't in a simulation. But *shrug* if that's what people want to do, then that's fine I guess, and I'll change my usage to conform with the majority.
Prediction can be Outer Aligned at Optimum

Things I believe about what sort of AI we want to build:

  • It would be kind of convenient if we had an AI that could help us do acausal trade. If assuming that it's not in a simulation would preclude an AI from doing acausal trade, that's a bit inconvenient. However, I don't think this matters for the discussion at hand, for reasons I describe in the final array of bullet points below.
  • Even if it did matter, I don't think that the ability to do acausal trade is a deal-breaker. If we had a corrigible, aligned, superintelligent AI that couldn
... (read more)
1Daniel Kokotajlo8moThanks, this is helpful. --You might be right that an AI which assumes it isn't in a simulation is OK--but I think it's too early to conclude that yet. We should think more about acausal trade before concluding it's something we can safely ignore, even temporarily. There's a good general heuristic of "Don't make your AI assume things which you think might not be true" and I don't think we have enough reason to violate it yet. --You say Isn't that exactly the point of the universal prior is misaligned argument? The whole point of the argument is that this abstraction/specification (and related ones) is dangerous. So... I guess your title made it sound like you were teaching us something new about prediction (as in, prediction can be outer aligned at optimum) when really you are just arguing that we should change the definition of outer-aligned-at-optimum, and your argument is that the current definition makes outer alignment too hard to achieve? If this is a fair summary of what you are doing, then I retract my objections I guess, and reflect more.
Imitative Generalisation (AKA 'Learning the Prior')
We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the future

I don't think this is right. I've put my proposed modifications in cursive:

We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to... (read more)

1Richard Ngo8moOoops, yes, this seems correct. I'll edit mine accordingly.
Prediction can be Outer Aligned at Optimum

Oops, I actually wasn't trying to discuss whether the action-space was wide enough to take over the world. Turns out concrete examples can be ambiguous too. I was trying to highlight whether the loss function and training method incentivised taking over the world or not.

Instead of an image-classifier, lets take GPT-3, which has a wide enough action-space to take over the world. Lets assume that:

1. GPT-3 is currently being tested on on a validation set which have some correct answers. (I'm fine with "optimal performance" either requiring... (read more)

5Rohin Shah9moAh, in hindsight your comment makes more sense. Argh, I don't know, you're positing a setup that breaks the standard ML assumptions and so things get weird. If you have vanilla SGD, I think I agree, but I wouldn't be surprised if that's totally wrong. There are definitely setups where I don't agree, e.g. if you have an outer hyperparameter tuning loop around the SGD, then I think you can get the opposite behavior than what you're claiming (I think this paper [] shows this in more detail, though it's been edited significantly since I read it). That would still depend on how often you do the hyperparameter tuning, what hyperparameters you're allowed to tune, etc. ---- On the rest of the comment: I feel like the argument you're making is "when the loss function is myopic, the optimal policy ignores long-term consequences and is therefore safe". I do feel better about this calling this "aligned at optimum", if the loss function also incentivizes the AI system to do that which we designed the AI system for. It still feels like the lack of convergent instrumental subgoals is "just because of" the myopia, and that this strategy won't work more generally. ---- Returning to the original claim: I do agree that these setups probably exist, perhaps using the myopia trick in conjunction with the simulated world trick [] . (I don't think myopia by itself is enough; to have STEM AI enable a pivotal act you presumably need to give the AI system a non-trivial amount of "thinking time".) I think you will still have a pretty rough time trying to define "optimal performance" in a way that doesn't depend on a lot of details of the setup, but at least conceptually I see what you mean. I'm not as convinced that these sorts of setups are really feasible -- they seem to sacrifice a lot of benefits -- but I'm pretty unconfident here.
Prediction can be Outer Aligned at Optimum
That is, if you write down a loss function like "do the best possible science", then the literal optimal AI would take over the world and get a lot of compute and robots and experimental labs to do the best science it can do.

I think this would be true for some way to train a STEM AI with some loss functions (especially if it's RL-like, can interact with the real world, etc) but I think that there are some setups where this isn't the case (e.g. things that look more like alphafold). Specifically, I think there exists some setups and so... (read more)

3Rohin Shah9moRoughly speaking, you can imagine two ways to get safety: 1. Design the output channels so that unsafe actions / plans do not exist 2. Design the AI system so that even though unsafe actions / plans do exist, the AI system doesn't take them. I would rephrase your argument as "there are some types of STEM AI that are safe because of 1, it seems that given some reasonable loss function those AI systems should be said to be outer aligned at optimum". This is also the argument that applies to image classifiers. ---- In the case where point 1 is literally true, I just wouldn't even talk about whether the system is "aligned"; if it doesn't have the possibility of an unsafe action, then whether it is "aligned" feels meaningless to me. (You can of course still say that it is "safe".) Note that in any such situation, there is no inner alignment worry. Even if the model is completely deceptive and wants to kill as many people as possible, by hypothesis we said that unsafe actions / plans do not exist, and the model can't ever succeed at killing people. ---- A counterargument could be "okay, sure, some unsafe action / plan exists by which the AI takes over the world, but that happens only via side channels, not via the expected output channel". I note that in this case, if you include all the channels available to the AI system, then the system is not outer aligned at optimum, because the optimal thing to do is to take over the world and then always feed in inputs to which the outputs are perfectly known leading to zero loss. Presumably what you'd want instead is to say something like "given a model in which the only output channel available to the AI system is ___, the optimal policy that only gets to act through that channel is aligned". But this is basically saying that in the abstract model you've chosen, (1) applies; and again I feel like saying that this system is "aligned" is somehow missing the point of what "aligned" is supposed to mean. As a concrete
2020 AI Alignment Literature Review and Charity Comparison

He's definitely given some money, and I don't think the 990 absence means much. From here:

in 2016, the IRS was still processing OpenAI’s non-profit status, making it impossible for the organization to receive charitable donations. Instead, the Musk Foundation gave $10m to another young charity, [...] The Musk Foundation’s grant accounted for the majority of’s revenue, and almost all of its own funding, when it passed along $10m to OpenAI later that year.

Also, when he quit in 2018, OpenAI wrote "Elon Musk will depart the OpenAI Board but ... (read more)

2gwern6moThat's interesting. I did see YC listed as a major funding source, but given Sam Altman's listed loans/donations, I assumed, because YC has little or nothing to do with Musk, that YC's interest was Altman, Paul Graham, or just YC collectively. I hadn't seen anything at all about YC being used as a cutout for Musk. So assuming the Guardian didn't screw up its understanding of the finances there completely (the media is constantly making mistakes in reporting on finances and charities in particular, but this seems pretty detailed and specific and hard to get wrong), I agree that that confirms Musk did donate money to get OA started and it was a meaningful sum. But it still does not seem that Musk donated the majority or even plurality of OA donations, much less the $1b constantly quoted (or any large fraction of the $1b collective pledge, per ESRogs).
Extrapolating GPT-N performance

This has definitely been productive for me. I've gained useful information, I see some things more clearly, and I've noticed some questions I still need to think a lot more about. Thanks for taking the time, and happy holidays!

Extrapolating GPT-N performance
I'm not sure exactly what you mean here, but if you mean "holding an ordinary conversation with a human" as a task, my sense is that is extremely hard to do right (much harder than, e.g., SuperGLUE). There's a reason that it was essentially proposed as a grand challenge of AI; in fact, it was abandoned once it was realized that actually it's quite gameable.

"actually it's quite gameable" = "actually it's quite easy" ;)

More seriously, I agree that a full blown turing test is hard, but this is becau... (read more)

2julianjm9moYou joke, but one of my main points is that these are very, very different things. Any benchmark, or dataset, acts as a proxy for the underlying task that we care about. Turing used natural conversation because it was a domain where a wide range of capabilities are normally used by humans. The problem is that in operationalizing the test (e.g., trying to fool a human), it ends up being possible or easy to pass without necessarily using or requiring all of those capabilities. And this can happen for reasons beyond just overfitting to the data distribution, because the test itself may just not be sensitive enough to capture "human-likeness" beyond a certain threshold (i.e., the noise ceiling). What I'm saying is I really do not think that's true. In my experience, at least one of the following holds for pretty much every NLP benchmark out there: * The data is likely artificially easy compared to what would be demanded of a model in real-world settings. (It's hard to know this for sure for any dataset until the benchmark is beaten by non-robust models; but I basically assume it as a rule of thumb for things that aren't specifically using adversarial methods.) Most QA and Reading Comprehension datasets fall into this category. * The annotation spec is unclear enough, or the human annotations are noisy enough, that even human performance on the task is at an insufficient reliability level for practical automation tasks which use it as a subroutine, except in cases which are relatively tolerant of incorrect outputs (like information retrieval and QA in search). This is in part because humans do these annotations in isolation, without a practical usage or business context to align their judgments. RTE, WiC, and probably MultiRC and BoolQ fall into this category. * For the datasets with hard examples and high agreement, the task is artificial and basic enough that operationalizing it into something economically useful remains
Extrapolating GPT-N performance

Cool, thanks. I agree that specifying the problem won't get solved by itself. In particular, I don't think that any jobs will become automated by describing the task and giving 10 examples to an insanely powerful language model. I realise that I haven't been entirely clear on this (and indeed, my intuitions about this are still in flux). Currently, my thinking goes along the following lines:

    • Fine-tuning on a representative dataset is really, really powerful, and it gets more powerful the narrower the task is. Since most benchmarks are more na
... (read more)
2julianjm9moRe: how to update based on benchmark progress in general, see my response to you above [] . On the rest, I think the best way I can think of explaining this is in terms of alignment and not correctness. The bird example is good. My contention is basically that when it comes to making something like "recognizing birds" economically useful, there is an enormous chasm between 90% performance on a subset of ImageNet and money in the bank. For two reasons, among others: * Alignment. What do we mean by "recognize birds"? Do pictures of birds count? Cartoon birds? Do we need to identify individual organisms e.g. for counting birds? Are some kinds of birds excluded? * Engineering. Now that you have a module which can take in an image and output whether it has a bird in it, how do you produce value? I'll admit that this might seem easy to do, and that ML is doing pretty much all the heavy lifting here. But my take on that is it's because object recognition/classification is a very low-level and automatic, sub-cognitive, thing. Once you start getting into questions of scene understanding, or indeed language understanding, there is an explosion of contingencies beyond silly things like cartoon birds. What humans are really really good at is understanding these (often unexpected) contingencies in the context of their job and business's needs, and acting appropriately. At what point would you be willing to entrust an ML system to deal with entirely unexpected contingencies in a way that suits your business needs (and indeed, doesn't tank them)? Even the highest level of robustness on known contingencies may not be enough, because almost certainly, the problem is fundamentally underspecified [] from the instructions and input data. And so, in order to successfully automate the task, you need to successfully characterize t
Extrapolating GPT-N performance

Re 3: Yup, this seems like a plausibly important training improvement. FWIW, when training GPT-3, they did filter the common crawl using a classifier that was trained to recognise high-quality data (with wikipedia, webtext, and some books as positive examples) but unfortunately they don't say how big of a difference it made.

I've been assuming (without much thoughts) that doing this better could make training up to ~10x cheaper, but probably not a lot more than that. I'd be curious if this sounds right to you, or if you think it could make a substantially bigger difference.

Extrapolating GPT-N performance
Benchmarks are filtered for being easy to use, and useful for measuring progress. (...) So they should be difficult, but not too difficult. (...) Only very recently has this started to change with adversarial filtering and evaluation, and the tasks have gotten much more ambitious, because of advances in ML.

That makes sense. I'm not saying that all benchmarks are necessarily hard, I'm saying that these ones look pretty hard to me (compared with ~ordinary conversation).

many of these ambitious datasets turn out ultimately to be gameable

My intuitio... (read more)

1julianjm9moI'm not sure exactly what you mean here, but if you mean "holding an ordinary conversation with a human" as a task, my sense is that is extremely hard to do right (much harder than, e.g., SuperGLUE). There's a reason that it was essentially proposed as a grand challenge of AI; in fact, it was abandoned once it was realized that actually it's quite gameable. This is why the Winograd Schema Challenge was proposed [] , but even that and new proposed versions of it have seen lots of progress recently — at the end of the day it turns out to be hard to write very difficult tests even in the WSC format, for all the reasons related to shallow heuristic learning etc.; the problem is that our subjective assessment of the difficulty of a dataset generally assumes the human means of solving it and associated conceptual scaffolding, which is no constraint for an Alien God []. So to address the difference between a language model and a general-purpose few-shot learner: I agree that we should expect its solutions to be much more general. The question at issue is: how does it learn to generalize? It is basically impossible to fully specify a task with a small training set and brief description — especially if the training set is only a couple of items. With so few examples, generalization behavior is almost entirely a matter of inductive bias. In the case of humans, this inductive bias comes from social mental modeling: the entire process of embodied language learning for a human trains us to be amazing at figuring out what you mean from what you say. In the case of GPT's few-shot learning, the inductive bias comes entirely from a language modeling assumption, that the desired task output can be approximated using language modeling probabilities prefixed with a task description and a few I/O examples. This gets us an incredible amount
Extrapolating GPT-N performance
Take for example writing news / journalistic articles. [...] I think similar concerns apply to management, accounting, auditing, engineering, programming, social services, education, etc. And I can imagine many ways in which ML can serve as a productivity booster in these fields but concerns like the ones I highlighted for journalism make it harder for me to see how AI of the sort that can sweep ML benchmarks can play a singular role in automation, without being deployed along a slate of other advances.

Completely agree that high benchmark performance (and ... (read more)

Extrapolating GPT-N performance

Thanks! I agree that if we required GPT-N to beat humans on every benchmark question that we could throw at them, then we would have a much more difficult task.

I don't think this matters much in practice, though, because humans and ML are really differently designed, so we're bound to be randomly better at some things and randomly worse at some things. By the time ML is better than humans at all things, I think they'll already be vastly better at most things. And I care more about the point when ML will first surpass humans at most things. This is most cle... (read more)

1julianjm9moI guess my main concern here is — besides everything I wrote in my reply to you below — basically that reliability of GPT-N on simple, multiclass classification tasks lacking broader context may not be representative of its reliability in real-world automation settings. If we're to take SuperGLUE as representative, well.. it's already basically solved. One of the problems here is that when you have the noise ceiling set so low, like it is in SuperGLUE, reaching human performance does not mean the model is reliable. It means the humans aren't. It means you wouldn't even trust a human to do this task if you really cared about the result. Coming up with tasks where humans can be reliable is actually quite difficult! And making humans reliable in the real world usually depends on them having an understanding of the rules they are to follow and the business stakes involved in their decisions — much broader context that is very difficult to distill into artificial annotation tasks. So when it comes to reliable automation, it's not clear to me that just looking at human performance on difficult benchmarks is a reasonable indicator. You'd want to look at reliability on tasks with clear economic viability, where the threshold of viability is clear. But the process of faithfully distilling economically viable tasks into benchmarks is a huge part of the difficulty in automation in the first place. And I have a feeling that where you can do this successfully, you might find that the task is either already subject to automation, or doesn't necessarily require huge advances in ML in order to become viable.
Extrapolating GPT-N performance

Thank you, this is very useful! To start out with responding to 1:

1a. Even when humans are used to perform a task, and even when they perform it very effectively, they are often required to participate in rule-making, provide rule-consistent rationales for their decisions, and stand accountable (somehow) for their decisions

I agree this is a thing for judges and other high-level decisions, but I'm not sure how important it is for other tasks. We have automated a lot of things in the past couple of 100 years with unaccountable machines and unaccounta... (read more)

3julianjm9moOn 1a: Take for example writing news / journalistic articles. Distinguishability from human-written articles is used as evidence for GPT's abilities. The abilities are impressive here, but the task at hand for the original writer is not to write an article that looks human, but one that reports the news. This means deciding what is newsworthy, aggregating evidence, contacting sources, and summarizing and reporting the information accurately. In addition to finding and summarizing information (which can be reasonably thought as a mapping from input -> output), there is also the interactive process of interfacing with sources: deciding who to reach out to, what to ask them, which sources to trust on what, and how to report and contextualize what they tell you in an article (forgetting of course the complexity of goal-oriented dialogue when interviewing them). This process involves a great deal of rules: mutual understanding with sources about how their information will be represented, an understanding of when to disclose sources and when not to, careful epistemics when it comes to drawing conclusions on the basis of the evidence they provide and representing the point of view of the news outlet, etc.; it also involves building relationships with sources and with other news outlets, conforming to copyright standards, etc.; and the news outlet has an stake in (and accountability for) all of these elements of the process, which is incumbent on the journalist. Perhaps you could try and record all elements of this process and treat it all as training data, but the task here is so multimodal, stateful, and long-horizon that it's really unclear (at least to me) how to reduce it to an I/O format amenable to ML that doesn't essentially require replicating the I/O interface of a whole human. Reducing it to an ML problem seems itself like a big research problem (and one having more to do with knowledge representation and traditional software than ML). If you put aside these mo
Homogeneity vs. heterogeneity in AI takeoff scenarios
In fact I was imagining that maybe most (or even all) of them would be narrow AIs / tool AIs for which the concept of alignment doesn't really apply.

Ah, yeah, for the purposes of my previous comment I count this as being aligned. If we only have tool AIs (or otherwise alignable AIs), I agree that Evan's conclusion 2 follow (while the other ones aren't relevant).

I think the relevant variable for homogeneity isn't whether we've solved alignment--maybe it's whether the people making AI think they've solved alignment

So for ho... (read more)

4Evan Hubinger9moI disagree with this. I don't expect a failure of inner alignment to produce random goals, but rather systematically produce goals which are simpler/faster proxies of what we actually want. That is to say, while I expect the goals to look random to us, I don't actually expect them to differ that much between training runs, since it's more about your training process's inductive biases than inherent randomness in the training process in my opinion.
2Daniel Kokotajlo9moThis is helpful, thanks. I'm not sure I agree that for something to count as a faction, the members must be aligned with each other. I think it still counts if the members have wildly different goals but are temporarily collaborating for instrumental reasons, or even if several of the members are secretly working for the other side. For example, in WW2 there were spies on both sides, as well as many people (e.g. most ordinary soldiers) who didn't really believe in the cause and would happily defect if they could get away with it. Yet the overall structure of the opposing forces was very similar, from the fighter aircraft designs, to the battleship designs, to the relative proportions of fighter planes and battleships, to the way they were integrated into command structure.
Homogeneity vs. heterogeneity in AI takeoff scenarios

I think this is only right if we assume that we've solved alignment. Otherwise you might not be able to train a specialised AI that is loyal to your faction.

Here's how I imagine Evan's conclusions to fail in a very CAIS-like world:

1. Maybe we can align models that do supervised learning, but can't align RL, so we'll have humans+GPT-N competing against a rogue RL-agent that someone created. (And people initially trained both of these because GPT-N makes for a better chatbot, while the RL agent seemed better at making money-maximizin... (read more)

2Daniel Kokotajlo9moThanks! I'm not sure I'm following everything you said, but I like the ideas. Just to be clear, I wasn't imagining the AIs on the team of a faction to all be aligned necessarily. In fact I was imagining that maybe most (or even all) of them would be narrow AIs / tool AIs for which the concept of alignment doesn't really apply. Like AlphaFold2. Also, I think the relevant variable for homogeneity isn't whether we've solved alignment--maybe it's whether the people making AI think they've solved alignment. If the Chinese and US militaries think AI risk isn't a big deal, and build AGI generals to prosecute the cyberwar, they'll probably use similar designs, even if actually the generals are secretly planning treacherous turns.
Homogeneity vs. heterogeneity in AI takeoff scenarios

I think this depends a ton on your reference class. If you compare AI with military fighter planes: very homogenous. If you compare AI with all vehicles: very heterogenous.

Maybe the outside view can be used to say that all AIs designed for a similar purpose will be homogenous, implying that we only get heterogenity in a CAIS scenario, where there are many different specialised designs. But I think the outside view also favors a CAIS scenario over a monolithic AI scenario (though that's not necessarily decisive).

4Daniel Kokotajlo9moYes, but I think we can say something a bit stronger than that: AIs competing with each other will be homogenous. Here's my current model at least: Let's say the competition for control of the future involves N skills: Persuasion, science, engineering, .... etc. Even if we suppose that it's most efficient to design separate AIs for each skill, rather than a smaller number of AIs that have multiple skills each, insofar as there are factions competing for control of the future, they'll have an AI for each of the skills. They wouldn't want to leave one of the skills out, or how are they going to compete? So each faction will consist of a group of AIs working together, that collectively has all the relevant skills. And each of the AIs will be designed to be good at the skill it's assigned, so (via the principle you articulated) each AI will be similar to the other-faction AIs it directly competes with, and the factions as a whole will be pretty similar too, since they'll be collections of similar AIs. (Compare to militaries: Not only were fighter planes similar, and trucks similar, and battleships similar, the armed forces of Japan, USA, USSR, etc. were similar. By contrast with e.g. the conquistadors vs. the Aztecs, or in sci-fi the Protoss vs. the Zerg, etc.)
Homogeneity vs. heterogeneity in AI takeoff scenarios

I find the prospect of multiple independent mesa-optimizers inside of the same system relatively unlikely.

I think Jesse was just claiming that it's more likely that everyone uses an architecture especially prone to mesa optimization. This means that (if multiple people train that architecture from scratch) the world is likely to end up with many different mesa optimizers in it (each localised to a single system). Because of the random nature of mesa optimization, they may all have very different goals.

4Evan Hubinger9moI'm not sure if that's true—see my comments here [] and here [] .
Draft report on AI timelines

I implemented the model for 2020 compute requirements in Guesstimate here. It doesn't do anything that the notebook can't do (and it can't do the update against currently affordable compute), but I find the graphical structure very helpful for understanding how it works (especially with arrows turned on in the "View" menu).

The strategy-stealing assumption

My impression of commitment races and logical time is that the amount of computation we use in general doesn't matter; but that things we learn that are relevant to the acausal bargaining problems do matter. Concretely, using computation during a competitive period to e.g. figure out better hardware cooling systems should be innocuous, because it matters very little for bargaining with other civilisations. However, thinking about agents in other worlds, and how to best bargain with them, would be a big step forward in logical time. This would mean that it'... (read more)

How does iterated amplification exceed human abilities?

If you picked the median human by mathematical ability, and put them in this setup, I would be rather surprised if they produced a valid proof of Fermats last theorem.

I would too. IDA/HCH doesn't have to work with the median human, though. It's ok to pick an excellent human, who has been trained for being in that situation. Paul has argued that it wouldn't be that surprising if some humans could be arbitrarily competent in an HCH-setup, even if some couldn't.

1Donald Hobson1yEpistemic status: Intuition dump and blatant speculation Suppose that instead of the median human, you used Euclid in the HCH. (Ancient greek, invented basic geometry) I would still be surprised if he could produce a proof of fermat's last theorem (given a few hours for each H). I would suspect that there are large chunks of modern maths that he would be unable to do. Some areas of modern maths have layers of concepts built on concepts. And in some areas of maths, just reading all the definitions will take up all the time. Assuming that there are large and interesting branches of maths that haven't been explored yet, the same would hold true for modern mathematicians. Of course, it depends how big you make the tree. You could brute force over all possible formal proofs, and then set a copy on checking the validity of each line. But at that point, you have lost all alignment, someone will find their proof is a convincing argument to pass the message up the tree. I feel that it is unlikely that any kind of absolute threshold lies between the median human, and an unusually smart human, given that the gap is small in an absolute sense.
The Dualist Predict-O-Matic ($100 prize)

SGD is not going to play the future forward to see the new feedback mechanism you’ve described and incorporate it into the loss function which is being minimized

My 'new feedback mechanism' is part of the training procedure. It's not going to be good at that by 'playing the future forward', it's going to become good at that by being trained on it.

I suspect we're using SGD in different ways, because everything we've talked about seems like it could be implemented with SGD. Do you agree that letting the Predict-O-Matic predict the future and rewarding it f

... (read more)
1John Maxwell2yFair enough, I was thinking about supervised learning.
The Dualist Predict-O-Matic ($100 prize)

Assuming that people don't think about the fact that Predict-O-Matic's predictions can affect reality (which seems like it might have been true early on in the story, although it's admittedly unlikely to be true for too long in the real world), they might decide to train it by letting it make predictions about the future (defining and backpropagating the loss once the future comes about). They might think that this is just like training on predefined data, but now the Predict-O-Matic can change the data that it's evaluated against, so t... (read more)

1John Maxwell2yI think it depends on internal details of the Predict-O-Matic's prediction process. If it's still using SGD, SGD is not going to play the future forward to see the new feedback mechanism you've described and incorporate it into the loss function which is being minimized. However, it's conceivable that given a dataset about its own past predictions and how they turned out, the Predict-O-Matic might learn to make its predictions "more self-fulfilling" in order to minimize loss on that dataset?
The Dualist Predict-O-Matic ($100 prize)
Yes, that sounds more like reinforcement learning. It is not the design I'm trying to point at in this post.

Ok, cool, that explains it. I guess the main differences between RL and online supervised learning is whether the model takes actions that can affect their environment or only makes predictions of fixed data; so it seems plausible that someone training the Predict-O-Matic like that would think they're doing supervised learning, while they're actually closer to RL.

That description sounds a lot like SGD. I think you'll need to be
... (read more)
1John Maxwell2yHow's that?
The Dualist Predict-O-Matic ($100 prize)

I think our disagreement comes from you imagining offline learning, while I'm imagining online learning. If we have a predefined set of (situation, outcome) pairs, then the Predict-O-Matic's predictions obviously can't affect the data that it's evaluated against (the outcome), so I agree that it'll end up pretty dualistic. But if we put a Predict-O-Matic in the real world, let it generate predictions, and then define the loss according to what happens afterwards, a non-dualistic Predict-O-Matic will be selected for over dualistic v... (read more)

1John Maxwell2yYes, that sounds more like reinforcement learning. It is not the design I'm trying to point at in this post. That description sounds a lot like SGD. I think you'll need to be crisper for me to see what you're getting at.
The Dualist Predict-O-Matic ($100 prize)
If dualism holds for Abram’s prediction AI, the “Predict-O-Matic”, its world model may happen to include this thing called the Predict-O-Matic which seems to make accurate predictions—but it’s not special in any way and isn’t being modeled any differently than anything else in the world. Again, I think this is a pretty reasonable guess for the Predict-O-Matic’s default behavior. I suspect other behavior would require special code which attempts to pinpoint the Predict-O-Matic in its own world model and give
... (read more)
1John Maxwell2ySGD searches for a set of parameters which minimize a loss function. Selection, not control []. Only if that info is included in the dataset that SGD is trying to minimize a loss function with respect to. Suppose we're running SGD trying to find a model which minimizes the loss over a set of (situation, outcome) pairs. Suppose some of the situations are situations in which the Predict-O-Matic made a prediction, and that prediction turned out to be false. It's conceivable that SGD could learn that the Predict-O-Matic predicting something makes it less likely to happen and use that as a feature. However, this wouldn't be helpful because the Predict-O-Matic doesn't know what prediction it will make at test time. At best it could infer that some of its older predictions will probably end up being false and use that fact to inform the thing it's currently trying to predict. Not necessarily. The scenario I have in mind is the standard ML scenario where SGD is just trying to find some parameters which minimize a loss function which is supposed to approximate the predictive accuracy of those parameters. Then we use those parameters to make predictions. SGD isn't concerned with future hypothetical rounds of SGD on future hypothetical datasets. In some sense, it's not even concerned with predictive accuracy except insofar as training data happens to generalize to new data. If you think including historical observations of a Predict-O-Matic (which happens to be 'oneself') making bad (or good) predictions in the Predict-O-Matic's training dataset will cause a catastrophe, that's within the range of scenarios I care about, so please do explain! By the way, if anyone wants to understand the standard ML scenario more deeply, I recommend this class [] .
Misconceptions about continuous takeoff
Possibly you'd want to rule out (c) with your stipulation that the tests are "robust"? But I'm not sure you can get tests that robust.

That sounds right. I was thinking about an infinitely robust misalignment-oracle to clarify my thinking, but I agree that we'll need to be very careful with any real-world-tests.

If I imagine writing code and using the misalignment-oracle on it, I think I mostly agree with Nate's point. If we have the code and compute to train a superhuman version of GPT-2, and the oracle tells us that any agent... (read more)

Misconceptions about continuous takeoff
We might reach a state of knowledge when it is easy to create AIs that (i) misaligned (ii) superhuman and (iii) non-singular (i.e. a single such AI is not stronger than the sum total of humanity and aligned AIs) but hard/impossible to create aligned superhuman AIs.

My intuition is that it'd probably be pretty easy to create an aligned superhuman AI if we knew how to create non-singular, mis-aligned superhuman AIs, and had cheap, robust methods to tell if a particular AI was misaligned. However, it seems pretty plausible that we'll end up in a s... (read more)

My intuition is that it'd probably be pretty easy to create an aligned superhuman AI if we knew how to create non-singular, mis-aligned superhuman AIs, and had cheap, robust methods to tell if a particular AI was misaligned.

This sounds different from how I model the situation; my views agree here with Nate's (emphasis added):

I would rephrase 3 as "There are many intuitively small mistakes one can make early in the design process that cause resultant systems to be extremely difficult to align with operators’ intentions.” Iȁ
... (read more)
Misconceptions about continuous takeoff
Second, we could more-or-less deal with systems which defect as they arise. For instance, during deployment we could notice that some systems are optimizing something different than what we intended during training, and therefore we shut them down.
Each individual system won’t by themselves carry more power than the sum of projects before it. Instead, AIs will only be slightly better than the ones that came before it, including any AIs we are using to monitor the newer ones.

If the sum of projects from before carry more power than the individual syste... (read more)

1Matthew Barnett2yAdmittedly, I did not explain this point well enough. What I meant to say was that before we have the first successful defection, we'll have some failed defection. If the system could indefinitely hide its own private intentions to later defect, then I would already consider that to be a 'successful defection.' Knowing about a failed defection, we'll learn from our mistake and patch that for future systems. To be clear, I'm definitely not endorsing this as a normative standard for safety. I agree with the rest of your comment.

Expanding on that a little, even if we know our AIs are misaligned that doesn't necessarily save us. We might reach a state of knowledge when it is easy to create AIs that (i) misaligned (ii) superhuman and (iii) non-singular (i.e. a single such AI is not stronger than the sum total of humanity and aligned AIs) but hard/impossible to create aligned superhuman AIs. Since misaligned AIs that can't take over still mostly follow human instructions, there will be tremendous economic incentives to deploy more such systems. This is effectively a tragedy of the co

... (read more)
Logical Optimizers

Ah, sorry, I misread the terminology. I agree.

Soft takeoff can still lead to decisive strategic advantage

Hm, my prior is that speed of learning how stolen code works would scale along with general innovation speed, though I haven't thought about it a lot. On the one hand, learning the basics of how the code works would scale well with more automated testing, and a lot of finetuning could presumably be automated without intimate knowledge. On the other hand, we might be in a paradigm where AI tech allows us to generate lots of architectures to test, anyway, and the bottleneck is for engineers to develop an intuition for them, which seems like the thing that you're pointing at.

2Hoagy2yI think this this points to the strategic supremacy of relevant infrastructure in these scenarios. From what I remember of the battleship era, having an advantage in design didn't seem to be a particularly large advantage - once a new era was entered, everyone with sufficient infrastructure switches to the new technology and an arms race starts from scratch. This feels similar to the AI scenario, where technology seems likely to spread quickly through a combination of high financial incentive, interconnected social networks, state-sponsored espionage etc. The way in which a serious differential emerges is likely to be more through a gap in the infrastructure to implement the new technology. It seems that the current world is tilted towards infrastructure ability diffusing fast enough to, but it seems possible that if we have a massive increase in economic growth then this balance is altered and infrastructure gaps emerge, creating differentials that can't easily be reversed by a few algorithm leaks.
2Adele Lopez2yYeah, I think the engineer intuition is the bottleneck I'm pointing at here.
Formalising decision theory is hard
In INP, any reinforcement learning (RL) algorithm will converge to one-boxing, simply because one-boxing gives it the money. This is despite RL naively looking like CDT.

Yup, like Caspar, I think that model-free RL learns the EDT policy in most/all situations. I'm not sure what you mean with it looking like CDT.

In Newcomb's paradox CDT succeeds but EDT fails. Let's consider an example where EDT succeeds and CDT fails: the XOR blackmail.

Isn't it the other way around? The one-boxer gets more money, but gives in to blackmail, and therefore gets blackmailed in the first place.

1Vanessa Kosoy2yRL is CDT in the sense that, your model of the world consists of actions and observations, and some causal link from past actions and observations to current observations, but there is no causal origin to the actions. The actions are just set by the agent to whatever it wants. And, yes, I got CDT and EDT flipped there, good catch!
Soft takeoff can still lead to decisive strategic advantage

Great post!

One thing I noticed is that claim 1 speak about nationstates while most of the AI-bits speak about companies/projects. I don't think this is a huge problem, but it seems worth looking into.

It seems true that it'll be necessary to localize the secret bits into single projects, in order to keep things secret. It also seems true that such projects could keep a lead on the order of months/years.

However, note that this does no longer correspond to having a country that's 30 years ahead of the rest of the world. Instead, it corresponds ... (read more)

Logical Optimizers

One problem that could cause the searching process to be unsafe is if the prior contained a relatively large measure of malign agents. This could happen if you used the universal prior, per Paul's argument. Such agents could maximize across the propositions you test them on, but do something else once they think they're deployed.

2Donald Hobson2yIf the prior is full of malign agents, then you are selecting your new logical optimizer based on its ability to correctly answer arbitrary questions (in a certain format) about malign agents. This doesn't seem to be that problematic. If the set of programs being logically optimized over is malign, then you have trouble.