All of Lukas Finnveden's Comments + Replies

I'm curious if anyone made a serious attempt at the shovel-ready math here and/or whether this approach to counterfactuals still looks promising to Abram? (Or anyone else with takes.)

Competence does not seem to aggressively overwhelm other advantages in humans: 


g. One might counter-counter-argue that humans are very similar to one another in capability, so even if intelligence matters much more than other traits, you won’t see that by looking at  the near-identical humans. This does not seem to be true. Often at least, the difference in performance between mediocre human performance and top level human performance is large, relative to the space below, iirc. For instance, in chess, the Elo difference between the best and

... (read more)

Another fairly common argument and motivation at OpenAI in the early days was the risk of "hardware overhang," that slower development of AI would result in building AI with less hardware at a time when they can be more explosively scaled up with massively disruptive consequences. I think that in hindsight this effect seems like it was real, and I would guess that it is larger than the entire positive impact of the additional direct work that would be done by the AI safety community if AI progress had been slower 5 years ago.

Could you clarify this bit? It ... (read more)

One positive consideration is: AI will be built at a time when it is more expensive (slowing later progress). One negative consideration is: there was less time for AI-safety-work-of-5-years-ago. I think that this particular positive consideration is larger than this particular negative consideration, even though other negative considerations are larger still (like less time for growth of AI safety community).

This seems plausible if the environment is a mix of (i) situations where task completion correlates (almost) perfectly with reward, and (ii) situations where reward is very high while task completion is very low. Such as if we found a perfect outer alignment objective, and the only situation in which reward could deviate from the overseer's preferences would be if the AI entirely seized control of the reward.

But it seems less plausible if there are always (small) deviations between reward and any reasonable optimization target that isn't reward (or close e... (read more)

3Steve Byrnes10mo
Sure, other things equal. But other things aren’t necessarily equal. For example, regularization could stack the deck in favor of one policy over another, even if the latter has been systematically producing slightly higher reward. There are lots of things like that; the details depend on the exact RL algorithm. In the context of brains, I have discussion and examples in §9.3.3 here [].

As the main author of the "Alignment"-appendix of the truthful AI paper, it seems worth clarifying: I totally don't think that "train your AI to be truthful" in itself is a plan for how to tackle any central alignment problems. Quoting from the alignment appendix:

While we’ve argued that scaleable truthfulness would constitute significant progress on alignment (and might provide a solution outright), we don’t mean to suggest that truthfulness will sidestep all difficulties that have been identified by alignment researchers. On the contrary, we expect work o

... (read more)

Here's what the curves look like if you fit them to the PaLM data-points as well as the GPT-3 data-points.

Keep in mind that this is still based on Kaplan scaling laws. The Chinchilla scaling laws would predict faster progress.



(But we wouldn't observe that on these graphs because they weren't trained Chinchilla-style, of course.)

First I gotta say: I thought I knew the art of doing quick-and-dirty calculations, but holy crap, this methodology is quick-and-dirty-ier than I would ever have thought of. I'm impressed.

But I don't think it currently gets to right answer. One salient thing: it doesn't take into account Kaplan's "contradiction". I.e., Kaplan's laws already suggested that once we were using enough FLOP, we would have to scale data faster than we have to do in the short term. So when I made my extrapolations, I used a data-exponent that was larger than the one that's represe... (read more)

but I am surprised that Chinchilla's curves uses an additive term that predicts that loss will never go below 1.69. What happened with the claims that ideal text-prediction performance was like 0.7?

Apples & oranges, you're comparing different units. Comparing token perplexities is hard when the tokens (not to mention datasets) differ. Chinchilla isn't a character-level model but BPEs (well, they say SentencePiece which is more or less BPEs), and BPEs didn't even exist until the past decade so there will be no human estimates which are in BPE units (... (read more)

Ok so I tried running the numbers for the neural net anchor in my bio-anchors guesstimate replica.

Previously the neural network anchor used an exponent (alpha) of normal(0.8, 0.2) (first number is mean, second is standard deviation). I tried changing that to normal(1, 0.1) (smaller uncertainty because 1 is a more natural number, and some other evidence was already pointing towards 1). Also, the model previously said that a 1-trillion parameter model should be trained with 10^normal(11.2, 1.5) data points. I changed that to have a median at 21.2e12 paramete... (read more)

Depends on how you were getting to that +N OOMs number.

If you were looking at my post, or otherwise using the scaling laws to extrapolate how fast AI was improving on benchmarks (or subjective impressiveness), then the chinchilla laws means you should get there sooner. I haven't run the numbers on how much sooner.

If you were looking at Ajeya's neural network anchor (i.e. the one using the Kaplan scaling-laws, not the human-lifetime or evolution anchors), then you should now expect that AGI comes later. That model anchors the number of parameters in AGI to ... (read more)

4Daniel Kokotajlo1y
You calculated things for the neural network brain size anchor; now here's the peformance scaling trend calculation (I think): I took these graphs from the Chinchilla paper and then made them transparent and superimposed them on one another and then made a copy on the right to extend the line. And I drew some other lines to extend them. Eyeballing this graph it looks like whatever performance we could achieve with 10^27 FLOPs under the Kaplan scaling laws, we can now achieve with 10^25 FLOPs. (!!!) This is a big deal if true. Am I reasoning incorrectly here? If this is anywhere close to correct, then the distinction you mention between two methods of getting timelines -- "Assume it happens when we train a brain-sized model compute-optimally" vs. "assume it happens when we get to superhuman performance on this ensemble of benchmarks that we already have GPT trends for" becomes even more exciting and important than I thought! It's like, a huge huge crux, because it basically makes for a 4 OOM difference! EDIT: To be clear, if this is true then I think I should update away from the second method, on the grounds that it predicts we are only about 1 OOM away and that seems implausible.
2Daniel Kokotajlo1y
Cool. Yep, that makes sense. I'd love to see those numbers if you calculate them!

Ok so I tried running the numbers for the neural net anchor in my bio-anchors guesstimate replica.

Previously the neural network anchor used an exponent (alpha) of normal(0.8, 0.2) (first number is mean, second is standard deviation). I tried changing that to normal(1, 0.1) (smaller uncertainty because 1 is a more natural number, and some other evidence was already pointing towards 1). Also, the model previously said that a 1-trillion parameter model should be trained with 10^normal(11.2, 1.5) data points. I changed that to have a median at 21.2e12 paramete... (read more)

In fact, if we think of pseudo-inputs as predicates that constrain X, we can approximate the probability of unacceptable behavior during deployment as[7]
P(C(M,x) | x∼deploy)≈maxα∈XpseudoP(α(x) | x∼deploy)⋅ P(C(M,x) | α(x), x∼deploy) such that, if we can get a good implementation of P, we no longer have to worry as much about carefully constraining Xpseudo, as we can just let P's prior do that work for us.

Where footnote 7 reads:

Note that this approximation is tight if and only if there exists some α∈Xpseudo such that α(x)↔C(M,x)

I think the "if" direction is... (read more)

I'm at like 30% on fast takeoff in the sense of "1 year doubling without preceding 4 year doubling" (a threshold roughly set to break any plausible quantitative historical precedent).

Huh, AI impacts looked at one dataset of GWP (taken from wikipedia, in turn taken from here) and found 2 precedents for "x year doubling without preceding 4x year doubling", roughly during the agricultural evolution. The dataset seems to be a combination of lots of different papers' estimates of human population, plus an assumption of ~constant GWP/capita early in history.

5Paul Christiano1y
Yeah, I think this was wrong. I'm somewhat skeptical of the numbers and suspect future revisions systematically softening those accelerations, but 4x still won't look that crazy. (I don't remember exactly how I chose that number but it probably involved looking at the same time series so wasn't designed to be much more abrupt.)
I agree that i does slightly worse than t on consistency-checks, but i also does better on other regularizers you're (maybe implicitly) using like speed/simplicity, so as long as i doesn't do too much worse it'll still beat out the direct translator.

Any articulable reason for why i just does slightly worse than t? Why would a 2N-node model fix a large majority of disrepancys between an N-node model and a 1e12*N-node model? I'd expect it to just fix a small fraction of them.

I think this rapidly runs into other issues with consistency checks, like the fact
... (read more)
1Mark Xu1y
The high-level reason is that the 1e12N model is not that much better at prediction than the 2N model. You can correct for most of the correlation even with only a vague guess at how different the AI and human probabilities are, and most AI and human probabilities aren't going to be that different in a way that produces a correlation the human finds suspicious. I think that the largest correlations are going to be produced by the places the AI and the human have the biggest differences in probabilities, which are likely also going to be the places where the 2N model has the biggest differences in probabilities, so they should be not that hard to correct. I think it wouldn't be clear that extending the counterexample would be possible, although I suspect it would be. It might require exhibiting more concrete details about how the consistency check would be defeated, which would be interesting. In some sense, maintaining consistency across many inputs is something that you expect to be pretty hard for the human simulator to do because it doesn't know what set of inputs it's being checked for. I would be excited about a consistency check that gave the direct translator minimal expected consistency loss. Note that I would also be interested in basically any concrete proposal for a consistency check that seemed like it was actually workable.

Hypothesis: Maybe you're actually not considering a reporter i that always use an intermediate model; but instead a reporter i' that does translations on hard questions, and just uses the intermediate model on questions where it's confident that the intermediate model understands everything relevant. I see three different possible issues with that idea:

1. To do this, i' needs an efficient way (ie one that doesn't scale with the size of the predictor) to (on at least some inputs) be highly confident that the intermediate model understands everything relevan... (read more)

I don't understand your counterexample in the appendix Details for penalizing inconsistencies across different inputs. You present a cheating strategy that requires the reporter to run and interpret the predictor a bunch of times, which seems plausibly slower than doing honest translation. And then you say you fix this issue with:

But this dependence could be avoided if there was an intermediate model between the predictor’s Bayes net (which we are assuming is very large) and the human’s Bayes net. Errors identified by the intermediate model are likely to b
... (read more)
1Mark Xu1y
I agree that i does slightly worse than t on consistency checks, but i also does better on other regularizers you're (maybe implicitly) using like speed/simplicity, so as long as i doesn't do too much worse it'll still beat out the direct translator. One possible thing you might try is some sort of lexicographic ordering of regularization losses. I think this rapidly runs into other issues with consistency checks, like the fact that the human is going to be systematically wrong about some correlations, so i potentially is more consistent than t.
1Lukas Finnveden1y
Hypothesis: Maybe you're actually not considering a reporter i that always use an intermediate model; but instead a reporter i' that does translations on hard questions, and just uses the intermediate model on questions where it's confident that the intermediate model understands everything relevant. I see three different possible issues with that idea: 1. To do this, i' needs an efficient way (ie one that doesn't scale with the size of the predictor) to (on at least some inputs) be highly confident that the intermediate model understands everything relevant about the situation. I think this is a reasonable "worst-case" assumption, but I'm not sure. If you're using it, I'd be curious to know. 2. Even when the reporter gets inputs that the intermediate model fully understands, it seems like the reporter will run into issues if its sampling-process (for generating distributions) runs into some inputs that the intermediate model doesn't fully understand. (I.e., if i' gets an input for which X1 is true, and then simulates many more random inputs for which X1 is true, and by chance X2 is true for one of them, then the reporter will have to do translation on that sample). Which makes it seem unlikely that i' wouldn't have to do translation at least once per input. 3. If the model is confident that the intermediate model understands everything relevant about the situation, it seems more efficient to return the intermediate model's answer to the question at hand than to loop over it many times, trying to fix correlations. So really we should get a reporter i'' that does translation on the predictor on hard questions and returns an intermediate model's latent knowledge on easy questions. That seems like an ok reporter to get.

It's very easy to construct probability distributions that have earlier timelines, that look more intuitively unconfident, and that have higher entropy than the bio-anchors forecast. You can just take some of the probability mass from the peak around 2050 and redistribute it among earlier years, especially years that are very close to the present, where bioanchors are reasonably confident that AGI is unlikely.

Oh, come on. That is straight-up not how simple continuous toy models of RSI work. Between a neutron multiplication factor of 0.999 and 1.001 there is a very huge gap in output behavior.

Nitpick: I think that particular analogy isn't great.

For nuclear stuff, we have two state variables: amount of fissile material and current number of neutrons flying around. The amount of fissile material determines the "neutron multiplication factor", but it is the number of neutrons that goes crazy, not fissile material. And the current number of neurons doesn't matter f... (read more)

While GPT-4 wouldn't be a lot bigger than GPT-3, Sam Altman did indicate that it'd use a lot more compute. That's consistent with Stack More Layers still working; they might just have found an even better use for compute.

(The increased compute-usage also makes me think that a Paul-esque view would allow for GPT-4 to be a lot more impressive than GPT-3, beyond just modest algorithmic improvements.)

If they've found some way to put a lot more compute into GPT-4 without making the model bigger, that's a very different - and unnerving - development.

and some of my sense here is that if Paul offered a portfolio bet of this kind, I might not take it myself, but EAs who were better at noticing their own surprise might say, "Wait, that's how unpredictable Paul thinks the world is?"

If Eliezer endorses this on reflection, that would seem to suggest that Paul actually has good models about how often trend breaks happen, and that the problem-by-Eliezer's-lights is relatively more about, either:

  • that Paul's long-term predictions do not adequately take into account his good sense of short-term trend breaks.
  • tha
... (read more)

No, the form says that 1=Paul. It's just the first sentence under the spoiler that's wrong.

2Edouard Harris2y
Good catch! I didn't check the form. Yes you are right, the spoiler should say (1=Paul, 9=Eliezer) but the conclusion is the right way round.

Presumably you're referring to this graph. The y-axis looks like the kind of score that ranges between 0 and 1, in which case this looks sort-of like a sigmoid to me, which accelerates when it gets closer to ~50% performance (and decelarates when it gets closer to 100% performance).

If so, we might want to ask whether these tasks are chosen ~randomly (among tasks that are indicative of how useful AI is) or if they're selected for difficulty in some way. In particular, assume that most tasks look sort-of like a sigmoid as they're scaled up (accelerating arou... (read more)

The preliminary results where obtained on a subset of the full benchmark (~90 tasks vs 206 tasks). And there were many changes since then, including scoring changes. Thus, I'm not sure we'll see the same dynamics in the final results. Most likely yes, but maybe not. I agree that the task selection process could create the dynamics that look like the acceleration. A good point.  As I understand, the organizers have accepted almost all submitted tasks (the main rejection reasons were technical - copyright etc). So, it was mostly self-selection, with the bias towards the hardest imaginable text tasks. It seems that for many contributors, the main motivation was something like:  This includes many cognitive tasks that are supposedly human-complete (e.g. understanding of humor, irony, ethics), and the tasks that are probing the model's generality (e.g. playing chess, recognizing images, navigating mazes - all in text). I wonder if the performance dynamics on such tasks will follow the same curve.   The list of of all tasks is available here [].
95% of all ML researchers don't think it's a problem, or think it's something we'll solve easily

The 2016 survey of people in AI asked people about the alignment problem as described by Stuart Russell, and 39% said it was an important problem and 33% that it's a harder problem than most other problem in the field.

given realistic treatments of moral uncertainty you should not care too much more about preventing drift than about preventing extinction given drift (e.g. 10x seems very hard to justify to me).

I think you already believe this, but just to clarify: this "extinction" is about the extinction of Earth-originating intelligence, not about humans in particular. So AI alignment is an intervention to prevent drift, not an intervention to prevent extinction. (Though of course, we could care differently about persuasion-tool-induced drift vs unaliged-AI-induced drift.)

Interesting! Here's one way to look at this:

  • EDT+SSA-with-a-minimal-reference-class behaves like UDT in anthropic dilemmas where updatelessness doesn't matter.
  • I think SSA with a minimal reference class is roughly equivalent to "notice that you exist; exclude all possible worlds where you don't exist; renormalize"
  • In large worlds where your observations have sufficient randomness that observers of all kinds exists in all worlds, the SSA update step cannot exclude any world. You're updateless by default. (This is the case in the 99% example above.)
  • In small or
... (read more)

Re your edit: That bit seems roughly correct to me.

If we are in a simulation, SIA doesn't have strong views on late filters for unsimulated reality. (This is my question (B) above.) And since SIA thinks we're almost certainly in a simulation, it's not crazy to say that SIA doesn't have strong view on late filters for unsimulated reality. SIA is very ok with small late filters, as long as we live in a simulation, which SIA says we probably do.

But yeah, it is a little bit confusing, in that we care more about late-filters-in-unsimulated reality if we live in... (read more)

I think it's important to be clear about what SIA says in different situations, here. Consider the following 4 questions:

A) Do we live in a simulation?

B) If we live in a simulation, should we expect basement reality to have a large late filter?

C) If we live in basement reality, should we expect basement reality (ie our world) to have a large late filter?

D) If we live in a simulation, should we expect the simulation (ie our world) to have a large late filter?

In this post, you persuasively argue that SIA answers "yes" to (A) and "not necessarily" to (B). How... (read more)

5Daniel Kokotajlo2y
I disagree that (B) is not decision-relevant and that (C) is. I'm not sure, haven't thought through all this yet, but that's my initial reaction at least.
2Zach Stein-Perlman2y
Ha, I wrote a comment like yours but slightly worse, then refreshed and your comment appeared. So now I'll just add one small note: To the extent that (1) normatively, we care much more about the rest of the universe than our personal lives/futures, and (2) empirically, we believe that our choices are much more consequential if we are non-simulated than if we are simulated, we should in practice act as if there are greater odds that we are non-simulated than we have reason to believe for purely epistemic purposes. So in practice, I'm particularly interested in (C) (and I tentatively buy SIA doomsday as explained by Katja Grace). Edit: also, isn't the last part of this sentence from the post wrong:
(The human baseline is a loss of 0.7 bits, with lots of uncertainty on that figure.)

I'd like to know what this figure is based on. In the linked post, Gwern writes:

The pretraining thesis argues that this can go even further: we can compare this performance directly with humans doing the same objective task, who can achieve closer to 0.7 bits per character⁠.

But in that linked post, there's no mention of "0.7" bits in particular, as far as I or cmd-f can see. The most relevant passage I've read is:

Claude Shannon found that each character was carrying more
... (read more)

It's based on those estimates and the systematic biases in such methods & literatures. Just as you know that psychology and medical effects are always overestimated and can be rounded down by 50% to get a more plausible real world estimate, such information-theoretic methods will always overestimate model performance and underestimate human performance, and are based on various idealizations: they use limited genres and writing styles (formal, omitting informal like slang), don't involve extensive human calibration or training like the models get, don'... (read more)

Thanks, computer-speed deliberation being a lot faster than space-colonisation makes sense. I think any deliberation process that uses biological humans as a crucial input would be a lot slower, though; slow enough that it could well be faster to get started with maximally fast space colonisation. Do you agree with that? (I'm a bit surprised at the claim that colonization takes place over "millenia" at technological maturity; even if the travelling takes millenia, it's not clear to me why launching something maximally-fast – that... (read more)

3Paul Christiano2y
I agree that biological human deliberation is slow enough that it would need to happen late. By "millennia" I mostly meant that traveling is slow (+ the social costs of delay are low, I'm estimating like 1/billionth of value per year of delay). I agree that you can start sending fast-enough-to-be-relevant ships around the singularity rather than decades later. I'd guess the main reason speed matters initially is for grabbing resources from nearby stars under whoever-gets-their-first property rights (but that we probably will move away from that regime before colonizing). I do expect to have strong global coordination prior to space colonization. I don't actually know if you would pause long enough for deliberation amongst biological humans to be relevant. So on reflection I'm not sure how much time you really have as biological humans. In the OP I'm imagining 10+ years (maybe going up to a generation) but that might just not be realistic. Probably my single best guess is that some (many?) people would straggle out over years or decades (in the sense that relevant deliberation for controlling what happens with their endowment would take place with biological humans living on earth), but that before that there would be agreements (reached at high speed) to avoid them taking a huge competitive hit by moving slowly. But my single best guess is not that likely and it seems much more likely that something else will happen (and even that I would conclude that some particular other thing is much more likely if I thought about it more).

I'm curious about how this interacts with space colonisation. The default path of efficient competition would likely lead to maximally fast space-colonisation, to prevent others from grabbing it first. But this would make deliberating together with other humans a lot trickier, since some space ships would go to places where they could never again communicate with each other. For things to turn out ok, I think you either need:

  • to pause before space colonisation.
  • to finish deliberating and bargaining before space colonisation.
  • to equip each space ship with
... (read more)

I think I'm basically optimistic about every option you list.

  • I think space colonization is extremely slow relative to deliberation (at technological maturity I think you probably have something like million-fold speedup over flesh and blood humans, and colonization takes place over decades and millennia rather than years). Deliberation may not be "finished" until the end of the universe, but I think we will e.g. have deliberated enough to make clear agreements about space colonization / to totally obsolete existing thinking / likely to have reached a "gran
... (read more)

Categorising the ways that the strategy-stealing assumption can fail:

  • It is intrinsically easier to gather flexible influence in pursuit of some goals, because
    • 1. It's easier to build AIs to pursue goals that are easy to check.
    • 3. It's easier to build institutions to pursue goals that are easy to check.
    • 9. It's easier to coordinate around simpler goals.
    • plus 4 and 5 insofar as some values require continuously surviving humans to know what to eventually spend resources on, and some don't.
    • plus 6 insofar as humans are otherwise an important part of the strategic e
... (read more)

Starting with amplification as a baseline; am I correct to infer that imitative generalisation only boosts capabilities, and doesn't give you any additional safety properties?

My understanding: After going through the process of finding z, you'll have a z that's probably too large for the human to fully utilise on their own, so you'll want to use amplification or debate to access it (as well as to generally help the human reason). If we didn't have z, we could train an amplification/debate system on D' anyway, while allowing th... (read more)

2Beth Barnes2y
I think the distinction isn't actually super clear, because you can usually trade off capabilities problems and safety problems. I think of it as expanding the range of questions you can get aligned answers to in a reasonable number of steps. If you're just doing IDA/debate, and you try to get your model to give you answers to questions where the model only knows the answer because of updating on a big dataset, you can either keep going through the big dataset when any question of this type comes up (very slow, so capability limitation), or not trust these answers (capability limitation), or just hope they're correct (safety problem). The latter :) I think the only way to get debate to be able to answer all the questions that debate+IG can answer is to include subtrees that are the size of your whole training dataset at arbitrary points in your debate tree, which I think counts as a ridiculous amount of compute

Cool, seems reasonable. Here are some minor responses: (perhaps unwisely, given that we're in a semantics labyrinth)

Evan's footnote-definition doesn't rule out malign priors unless we assume that the real world isn't a simulation

Idk, if the real world is a simulation made by malign simulators, I wouldn't say that an AI accurately predicting the world is falling prey to malign priors. I would probably want my AI to accurately predict the world I'm in even if it's simulated. The simulators control everything that happens a... (read more)

Isn't that exactly the point of the universal prior is misaligned argument? The whole point of the argument is that this abstraction/specification (and related ones) is dangerous.


I guess your title made it sound like you were teaching us something new about prediction (as in, prediction can be outer aligned at optimum) when really you are just arguing that we should change the definition of outer-aligned-at-optimum, and your argument is that the current definition makes outer alignment too hard to achieve

I mean, it's true that I'm ... (read more)

1Daniel Kokotajlo2y
Well, at this point I feel foolish for arguing about semantics. I appreciate your post, and don't have a problem with saying that the malignity problem is an inner alignment problem. (That is zero evidence that it isn't also an outer alignment problem though!) Evan's footnote-definition doesn't rule out malign priors unless we assume that the real world isn't a simulation. We may have good pragmatic reasons to act as if it isn't, but I still think you are changing the definition of outer alignment if you think it assumes we aren't in a simulation. But *shrug* if that's what people want to do, then that's fine I guess, and I'll change my usage to conform with the majority.

Things I believe about what sort of AI we want to build:

  • It would be kind of convenient if we had an AI that could help us do acausal trade. If assuming that it's not in a simulation would preclude an AI from doing acausal trade, that's a bit inconvenient. However, I don't think this matters for the discussion at hand, for reasons I describe in the final array of bullet points below.
  • Even if it did matter, I don't think that the ability to do acausal trade is a deal-breaker. If we had a corrigible, aligned, superintelligent AI that couldn
... (read more)
1Daniel Kokotajlo2y
Thanks, this is helpful. --You might be right that an AI which assumes it isn't in a simulation is OK--but I think it's too early to conclude that yet. We should think more about acausal trade before concluding it's something we can safely ignore, even temporarily. There's a good general heuristic of "Don't make your AI assume things which you think might not be true" and I don't think we have enough reason to violate it yet. --You say Isn't that exactly the point of the universal prior is misaligned argument? The whole point of the argument is that this abstraction/specification (and related ones) is dangerous. So... I guess your title made it sound like you were teaching us something new about prediction (as in, prediction can be outer aligned at optimum) when really you are just arguing that we should change the definition of outer-aligned-at-optimum, and your argument is that the current definition makes outer alignment too hard to achieve? If this is a fair summary of what you are doing, then I retract my objections I guess, and reflect more.
We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the future

I don't think this is right. I've put my proposed modifications in cursive:

We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to... (read more)

1Richard Ngo2y
Ooops, yes, this seems correct. I'll edit mine accordingly.

Oops, I actually wasn't trying to discuss whether the action-space was wide enough to take over the world. Turns out concrete examples can be ambiguous too. I was trying to highlight whether the loss function and training method incentivised taking over the world or not.

Instead of an image-classifier, lets take GPT-3, which has a wide enough action-space to take over the world. Lets assume that:

1. GPT-3 is currently being tested on on a validation set which have some correct answers. (I'm fine with "optimal performance" either requiring... (read more)

5Rohin Shah2y
Ah, in hindsight your comment makes more sense. Argh, I don't know, you're positing a setup that breaks the standard ML assumptions and so things get weird. If you have vanilla SGD, I think I agree, but I wouldn't be surprised if that's totally wrong. There are definitely setups where I don't agree, e.g. if you have an outer hyperparameter tuning loop around the SGD, then I think you can get the opposite behavior than what you're claiming (I think this paper [] shows this in more detail, though it's been edited significantly since I read it). That would still depend on how often you do the hyperparameter tuning, what hyperparameters you're allowed to tune, etc. ---- On the rest of the comment: I feel like the argument you're making is "when the loss function is myopic, the optimal policy ignores long-term consequences and is therefore safe". I do feel better about this calling this "aligned at optimum", if the loss function also incentivizes the AI system to do that which we designed the AI system for. It still feels like the lack of convergent instrumental subgoals is "just because of" the myopia, and that this strategy won't work more generally. ---- Returning to the original claim: I do agree that these setups probably exist, perhaps using the myopia trick in conjunction with the simulated world trick []. (I don't think myopia by itself is enough; to have STEM AI enable a pivotal act you presumably need to give the AI system a non-trivial amount of "thinking time".) I think you will still have a pretty rough time trying to define "optimal performance" in a way that doesn't depend on a lot of details of the setup, but at least conceptually I see what you mean. I'm not as convinced that these sorts of setups are really feasible -- they seem to sacrifice a lot of benefits -- but I'm pretty unconfident here.
That is, if you write down a loss function like "do the best possible science", then the literal optimal AI would take over the world and get a lot of compute and robots and experimental labs to do the best science it can do.

I think this would be true for some way to train a STEM AI with some loss functions (especially if it's RL-like, can interact with the real world, etc) but I think that there are some setups where this isn't the case (e.g. things that look more like alphafold). Specifically, I think there exists some setups and so... (read more)

3Rohin Shah2y
Roughly speaking, you can imagine two ways to get safety: 1. Design the output channels so that unsafe actions / plans do not exist 2. Design the AI system so that even though unsafe actions / plans do exist, the AI system doesn't take them. I would rephrase your argument as "there are some types of STEM AI that are safe because of 1, it seems that given some reasonable loss function those AI systems should be said to be outer aligned at optimum". This is also the argument that applies to image classifiers. ---- In the case where point 1 is literally true, I just wouldn't even talk about whether the system is "aligned"; if it doesn't have the possibility of an unsafe action, then whether it is "aligned" feels meaningless to me. (You can of course still say that it is "safe".) Note that in any such situation, there is no inner alignment worry. Even if the model is completely deceptive and wants to kill as many people as possible, by hypothesis we said that unsafe actions / plans do not exist, and the model can't ever succeed at killing people. ---- A counterargument could be "okay, sure, some unsafe action / plan exists by which the AI takes over the world, but that happens only via side channels, not via the expected output channel". I note that in this case, if you include all the channels available to the AI system, then the system is not outer aligned at optimum, because the optimal thing to do is to take over the world and then always feed in inputs to which the outputs are perfectly known leading to zero loss. Presumably what you'd want instead is to say something like "given a model in which the only output channel available to the AI system is ___, the optimal policy that only gets to act through that channel is aligned". But this is basically saying that in the abstract model you've chosen, (1) applies; and again I feel like saying that this system is "aligned" is somehow missing the point of what "aligned" is supposed to mean. As a concrete

He's definitely given some money, and I don't think the 990 absence means much. From here:

in 2016, the IRS was still processing OpenAI’s non-profit status, making it impossible for the organization to receive charitable donations. Instead, the Musk Foundation gave $10m to another young charity, [...] The Musk Foundation’s grant accounted for the majority of’s revenue, and almost all of its own funding, when it passed along $10m to OpenAI later that year.

Also, when he quit in 2018, OpenAI wrote "Elon Musk will depart the OpenAI Board but ... (read more)

That's interesting. I did see YC listed as a major funding source, but given Sam Altman's listed loans/donations, I assumed, because YC has little or nothing to do with Musk, that YC's interest was Altman, Paul Graham, or just YC collectively. I hadn't seen anything at all about YC being used as a cutout for Musk. So assuming the Guardian didn't screw up its understanding of the finances there completely (the media is constantly making mistakes in reporting on finances and charities in particular, but this seems pretty detailed and specific and hard to get wrong), I agree that that confirms Musk did donate money to get OA started and it was a meaningful sum. But it still does not seem that Musk donated the majority or even plurality of OA donations, much less the $1b constantly quoted (or any large fraction of the $1b collective pledge, per ESRogs).

This has definitely been productive for me. I've gained useful information, I see some things more clearly, and I've noticed some questions I still need to think a lot more about. Thanks for taking the time, and happy holidays!

I'm not sure exactly what you mean here, but if you mean "holding an ordinary conversation with a human" as a task, my sense is that is extremely hard to do right (much harder than, e.g., SuperGLUE). There's a reason that it was essentially proposed as a grand challenge of AI; in fact, it was abandoned once it was realized that actually it's quite gameable.

"actually it's quite gameable" = "actually it's quite easy" ;)

More seriously, I agree that a full blown turing test is hard, but this is becau... (read more)

You joke, but one of my main points is that these are very, very different things. Any benchmark, or dataset, acts as a proxy for the underlying task that we care about. Turing used natural conversation because it was a domain where a wide range of capabilities are normally used by humans. The problem is that in operationalizing the test (e.g., trying to fool a human), it ends up being possible or easy to pass without necessarily using or requiring all of those capabilities. And this can happen for reasons beyond just overfitting to the data distribution, because the test itself may just not be sensitive enough to capture "human-likeness" beyond a certain threshold (i.e., the noise ceiling). What I'm saying is I really do not think that's true. In my experience, at least one of the following holds for pretty much every NLP benchmark out there: * The data is likely artificially easy compared to what would be demanded of a model in real-world settings. (It's hard to know this for sure for any dataset until the benchmark is beaten by non-robust models; but I basically assume it as a rule of thumb for things that aren't specifically using adversarial methods.) Most QA and Reading Comprehension datasets fall into this category. * The annotation spec is unclear enough, or the human annotations are noisy enough, that even human performance on the task is at an insufficient reliability level for practical automation tasks which use it as a subroutine, except in cases which are relatively tolerant of incorrect outputs (like information retrieval and QA in search). This is in part because humans do these annotations in isolation, without a practical usage or business context to align their judgments. RTE, WiC, and probably MultiRC and BoolQ fall into this category. * For the datasets with hard examples and high agreement, the task is artificial and basic enough that operationalizing it into something economically useful remains a

Cool, thanks. I agree that specifying the problem won't get solved by itself. In particular, I don't think that any jobs will become automated by describing the task and giving 10 examples to an insanely powerful language model. I realise that I haven't been entirely clear on this (and indeed, my intuitions about this are still in flux). Currently, my thinking goes along the following lines:

    • Fine-tuning on a representative dataset is really, really powerful, and it gets more powerful the narrower the task is. Since most benchmarks are more na
... (read more)
Re: how to update based on benchmark progress in general, see my response to you above []. On the rest, I think the best way I can think of explaining this is in terms of alignment and not correctness. The bird example is good. My contention is basically that when it comes to making something like "recognizing birds" economically useful, there is an enormous chasm between 90% performance on a subset of ImageNet and money in the bank. For two reasons, among others: * Alignment. What do we mean by "recognize birds"? Do pictures of birds count? Cartoon birds? Do we need to identify individual organisms e.g. for counting birds? Are some kinds of birds excluded? * Engineering. Now that you have a module which can take in an image and output whether it has a bird in it, how do you produce value? I'll admit that this might seem easy to do, and that ML is doing pretty much all the heavy lifting here. But my take on that is it's because object recognition/classification is a very low-level and automatic, sub-cognitive, thing. Once you start getting into questions of scene understanding, or indeed language understanding, there is an explosion of contingencies beyond silly things like cartoon birds. What humans are really really good at is understanding these (often unexpected) contingencies in the context of their job and business's needs, and acting appropriately. At what point would you be willing to entrust an ML system to deal with entirely unexpected contingencies in a way that suits your business needs (and indeed, doesn't tank them)? Even the highest level of robustness on known contingencies may not be enough, because almost certainly, the problem is fundamentally underspecified [] from the instructions and input data. And so, in order to successfully automate the task, you need to successfully characterize the

Re 3: Yup, this seems like a plausibly important training improvement. FWIW, when training GPT-3, they did filter the common crawl using a classifier that was trained to recognise high-quality data (with wikipedia, webtext, and some books as positive examples) but unfortunately they don't say how big of a difference it made.

I've been assuming (without much thoughts) that doing this better could make training up to ~10x cheaper, but probably not a lot more than that. I'd be curious if this sounds right to you, or if you think it could make a substantially bigger difference.

Benchmarks are filtered for being easy to use, and useful for measuring progress. (...) So they should be difficult, but not too difficult. (...) Only very recently has this started to change with adversarial filtering and evaluation, and the tasks have gotten much more ambitious, because of advances in ML.

That makes sense. I'm not saying that all benchmarks are necessarily hard, I'm saying that these ones look pretty hard to me (compared with ~ordinary conversation).

many of these ambitious datasets turn out ultimately to be gameable

My intuitio... (read more)

I'm not sure exactly what you mean here, but if you mean "holding an ordinary conversation with a human" as a task, my sense is that is extremely hard to do right (much harder than, e.g., SuperGLUE). There's a reason that it was essentially proposed as a grand challenge of AI; in fact, it was abandoned once it was realized that actually it's quite gameable. This is why the Winograd Schema Challenge was proposed [], but even that and new proposed versions of it have seen lots of progress recently — at the end of the day it turns out to be hard to write very difficult tests even in the WSC format, for all the reasons related to shallow heuristic learning etc.; the problem is that our subjective assessment of the difficulty of a dataset generally assumes the human means of solving it and associated conceptual scaffolding, which is no constraint for an Alien God []. So to address the difference between a language model and a general-purpose few-shot learner:  I agree that we should expect its solutions to be much more general. The question at issue is: how does it learn to generalize? It is basically impossible to fully specify a task with a small training set and brief description — especially if the training set is only a couple of items. With so few examples, generalization behavior is almost entirely a matter of inductive bias. In the case of humans, this inductive bias comes from social mental modeling: the entire process of embodied language learning for a human trains us to be amazing at figuring out what you mean from what you say. In the case of GPT's few-shot learning, the inductive bias comes entirely from a language modeling assumption, that the desired task output can be approximated using language modeling probabilities prefixed with a task description and a few I/O examples. This gets us an incredible amount
Take for example writing news / journalistic articles. [...] I think similar concerns apply to management, accounting, auditing, engineering, programming, social services, education, etc. And I can imagine many ways in which ML can serve as a productivity booster in these fields but concerns like the ones I highlighted for journalism make it harder for me to see how AI of the sort that can sweep ML benchmarks can play a singular role in automation, without being deployed along a slate of other advances.

Completely agree that high benchmark performance (and ... (read more)

Thanks! I agree that if we required GPT-N to beat humans on every benchmark question that we could throw at them, then we would have a much more difficult task.

I don't think this matters much in practice, though, because humans and ML are really differently designed, so we're bound to be randomly better at some things and randomly worse at some things. By the time ML is better than humans at all things, I think they'll already be vastly better at most things. And I care more about the point when ML will first surpass humans at most things. This is most cle... (read more)

I guess my main concern here is — besides everything I wrote in my reply to you below — basically that reliability of GPT-N on simple, multiclass classification tasks lacking broader context may not be representative of its reliability in real-world automation settings. If we're to take SuperGLUE as representative, well.. it's already basically solved. One of the problems here is that when you have the noise ceiling set so low, like it is in SuperGLUE, reaching human performance does not mean the model is reliable. It means the humans aren't. It means you wouldn't even trust a human to do this task if you really cared about the result. Coming up with tasks where humans can be reliable is actually quite difficult! And making humans reliable in the real world usually depends on them having an understanding of the rules they are to follow and the business stakes involved in their decisions — much broader context that is very difficult to distill into artificial annotation tasks. So when it comes to reliable automation, it's not clear to me that just looking at human performance on difficult benchmarks is a reasonable indicator. You'd want to look at reliability on tasks with clear economic viability, where the threshold of viability is clear. But the process of faithfully distilling economically viable tasks into benchmarks is a huge part of the difficulty in automation in the first place. And I have a feeling that where you can do this successfully, you might find that the task is either already subject to automation, or doesn't necessarily require huge advances in ML in order to become viable.

Thank you, this is very useful! To start out with responding to 1:

1a. Even when humans are used to perform a task, and even when they perform it very effectively, they are often required to participate in rule-making, provide rule-consistent rationales for their decisions, and stand accountable (somehow) for their decisions

I agree this is a thing for judges and other high-level decisions, but I'm not sure how important it is for other tasks. We have automated a lot of things in the past couple of 100 years with unaccountable machines and unaccounta... (read more)

On 1a: Take for example writing news / journalistic articles. Distinguishability from human-written articles is used as evidence for GPT's abilities. The abilities are impressive here, but the task at hand for the original writer is not to write an article that looks human, but one that reports the news. This means deciding what is newsworthy, aggregating evidence, contacting sources, and summarizing and reporting the information accurately. In addition to finding and summarizing information (which can be reasonably thought as a mapping from input -> output), there is also the interactive process of interfacing with sources: deciding who to reach out to, what to ask them, which sources to trust on what, and how to report and contextualize what they tell you in an article (forgetting of course the complexity of goal-oriented dialogue when interviewing them). This process involves a great deal of rules: mutual understanding with sources about how their information will be represented, an understanding of when to disclose sources and when not to, careful epistemics when it comes to drawing conclusions on the basis of the evidence they provide and representing the point of view of the news outlet, etc.; it also involves building relationships with sources and with other news outlets, conforming to copyright standards, etc.; and the news outlet has an stake in (and accountability for) all of these elements of the process, which is incumbent on the journalist. Perhaps you could try and record all elements of this process and treat it all as training data, but the task here is so multimodal, stateful, and long-horizon that it's really unclear (at least to me) how to reduce it to an I/O format amenable to ML that doesn't essentially require replicating the I/O interface of a whole human. Reducing it to an ML problem seems itself like a big research problem (and one having more to do with knowledge representation and traditional software than ML). If you put aside these mo
In fact I was imagining that maybe most (or even all) of them would be narrow AIs / tool AIs for which the concept of alignment doesn't really apply.

Ah, yeah, for the purposes of my previous comment I count this as being aligned. If we only have tool AIs (or otherwise alignable AIs), I agree that Evan's conclusion 2 follow (while the other ones aren't relevant).

I think the relevant variable for homogeneity isn't whether we've solved alignment--maybe it's whether the people making AI think they've solved alignment

So for ho... (read more)

4Evan Hubinger2y
I disagree with this. I don't expect a failure of inner alignment to produce random goals, but rather systematically produce goals which are simpler/faster proxies of what we actually want. That is to say, while I expect the goals to look random to us, I don't actually expect them to differ that much between training runs, since it's more about your training process's inductive biases than inherent randomness in the training process in my opinion.
2Daniel Kokotajlo2y
This is helpful, thanks. I'm not sure I agree that for something to count as a faction, the members must be aligned with each other. I think it still counts if the members have wildly different goals but are temporarily collaborating for instrumental reasons, or even if several of the members are secretly working for the other side. For example, in WW2 there were spies on both sides, as well as many people (e.g. most ordinary soldiers) who didn't really believe in the cause and would happily defect if they could get away with it. Yet the overall structure of the opposing forces was very similar, from the fighter aircraft designs, to the battleship designs, to the relative proportions of fighter planes and battleships, to the way they were integrated into command structure.

I think this is only right if we assume that we've solved alignment. Otherwise you might not be able to train a specialised AI that is loyal to your faction.

Here's how I imagine Evan's conclusions to fail in a very CAIS-like world:

1. Maybe we can align models that do supervised learning, but can't align RL, so we'll have humans+GPT-N competing against a rogue RL-agent that someone created. (And people initially trained both of these because GPT-N makes for a better chatbot, while the RL agent seemed better at making money-maximizin... (read more)

2Daniel Kokotajlo2y
Thanks! I'm not sure I'm following everything you said, but I like the ideas. Just to be clear, I wasn't imagining the AIs on the team of a faction to all be aligned necessarily. In fact I was imagining that maybe most (or even all) of them would be narrow AIs / tool AIs for which the concept of alignment doesn't really apply. Like AlphaFold2. Also, I think the relevant variable for homogeneity isn't whether we've solved alignment--maybe it's whether the people making AI think they've solved alignment. If the Chinese and US militaries think AI risk isn't a big deal, and build AGI generals to prosecute the cyberwar, they'll probably use similar designs, even if actually the generals are secretly planning treacherous turns.

I think this depends a ton on your reference class. If you compare AI with military fighter planes: very homogenous. If you compare AI with all vehicles: very heterogenous.

Maybe the outside view can be used to say that all AIs designed for a similar purpose will be homogenous, implying that we only get heterogenity in a CAIS scenario, where there are many different specialised designs. But I think the outside view also favors a CAIS scenario over a monolithic AI scenario (though that's not necessarily decisive).

4Daniel Kokotajlo2y
Yes, but I think we can say something a bit stronger than that: AIs competing with each other will be homogenous. Here's my current model at least: Let's say the competition for control of the future involves N skills: Persuasion, science, engineering, .... etc. Even if we suppose that it's most efficient to design separate AIs for each skill, rather than a smaller number of AIs that have multiple skills each, insofar as there are factions competing for control of the future, they'll have an AI for each of the skills. They wouldn't want to leave one of the skills out, or how are they going to compete? So each faction will consist of a group of AIs working together, that collectively has all the relevant skills. And each of the AIs will be designed to be good at the skill it's assigned, so (via the principle you articulated) each AI will be similar to the other-faction AIs it directly competes with, and the factions as a whole will be pretty similar too, since they'll be collections of similar AIs. (Compare to militaries: Not only were fighter planes similar, and trucks similar, and battleships similar, the armed forces of Japan, USA, USSR, etc. were similar. By contrast with e.g. the conquistadors vs. the Aztecs, or in sci-fi the Protoss vs. the Zerg, etc.)

I find the prospect of multiple independent mesa-optimizers inside of the same system relatively unlikely.

I think Jesse was just claiming that it's more likely that everyone uses an architecture especially prone to mesa optimization. This means that (if multiple people train that architecture from scratch) the world is likely to end up with many different mesa optimizers in it (each localised to a single system). Because of the random nature of mesa optimization, they may all have very different goals.

4Evan Hubinger2y
I'm not sure if that's true—see my comments here [] and here [].

I implemented the model for 2020 compute requirements in Guesstimate here. It doesn't do anything that the notebook can't do (and it can't do the update against currently affordable compute), but I find the graphical structure very helpful for understanding how it works (especially with arrows turned on in the "View" menu).

Load More