All of Rohin Shah's Comments + Replies

I forget if I already mentioned this to you, but another example where you can interpret randomization as worst-case reasoning is MaxEnt RL, see this paper. (I reviewed an earlier version of this paper here (review #3).)

Possibly, but in at least one of the two cases I was thinking of when writing this comment (and maybe in both), I made the argument in the parent comment and the person agreed and retracted their point. (I think in both cases I was talking about deceptive alignment via goal misgeneralization.)

Okay, I understand how that addresses my edit.

I'm still not quite sure why the lightcone theorem is a "foundation" for natural abstraction (it looks to me like a nice concrete example on which you could apply techniques) but I think I should just wait for future posts, since I don't really have any concrete questions at the moment.

3Thane Ruthenis22d
My impression is that it being a concrete example is the why. "What is the right framework to use?" and "what is the environment-structure in which natural abstractions can be defined?" are core questions of this research agenda, and this sort of multi-layer locality-including causal model is one potential answer. The fact that it loops-in the speed of causal influence is also suggestive — it seems fundamental to the structure of our universe, crops up in a lot of places, so the proposition that natural abstractions are somehow downstream of it is interesting.

Okay, that mostly makes sense.

note that the resampler itself throws away a ton of information about  while going from  to . And that is indeed information which "could have" been relevant, but almost always gets wiped out by noise. That's the information we're looking to throw away, for abstraction purposes.

I agree this is true, but why does the Lightcone theorem matter for it?

It is also a theorem that a Gibbs resampler initialized at equilibrium will produce  distributed according to , and as you say it's c... (read more)

4johnswentworth23d
Sounds like we need to unpack what "viewing X0 as a latent which generates X" is supposed to mean. I start with a distribution P[X]. Let's say X is a bunch of rolls of a biased die, of unknown bias. But I don't know that's what X is; I just have the joint distribution of all these die-rolls. What I want to do is look at that distribution and somehow "recover" the underlying latent variable (bias of the die) and factorization, i.e. notice that I can write the distribution as P[X]=∑iP[Xi|Λ]P[Λ], where Λ is the bias in this case. Then when reasoning/updating, we can usually just think about how an individual die-roll interacts with Λ, rather than all the other rolls, which is useful insofar as Λ is much smaller than all the rolls. Note that P[X|Λ] is not supposed to match P[X]; then the representation would be useless. It's the marginal ∑iP[Xi|Λ]P[Λ] which is supposed to match P[X]. The lightcone theorem lets us do something similar. Rather all the Xi's being independent given Λ, only those Xi's sufficiently far apart are independent, but the concept is otherwise similar. We express P[X] as ∑X0P[X|X0]P[X0] (or, really, ∑ΛP[X|Λ]P[Λ], where Λ summarizes info in X0 relevant to X, which is hopefully much smaller than all of X).

The Lightcone Theorem says: conditional on , any sets of variables in  which are a distance of at least  apart in the graphical model are independent.

I am confused. This sounds to me like:

If you have sets of variables that start with no mutual information (conditioning on ), and they are so far away that nothing other than  could have affected both of them (distance of at least ), then they continue to have no mutual information (independent).

Some things that I am confused about as a result:

  1. I don't se
... (read more)
7johnswentworth23d
Yup, that's basically it. And I agree that it's pretty obvious once you see it - the key is to notice that distance 2T implies that nothing other than X0 could have affected both of them. But man, when I didn't know that was what I should look for? Much less obvious. It does, but then XT doesn't have the same distribution as the original graphical model (unless we're running the sampler long enough to equilibrate). So we can't view X0 as a latent generating that distribution. Not quite - note that the resampler itself throws away a ton of information about X0 while going from X0 to XT. And that is indeed information which "could have" been relevant, but almost always gets wiped out by noise. That's the information we're looking to throw away, for abstraction purposes. So the reason this is interesting (for the thing you're pointing to) is not that it lets us ignore information from far-away parts of XT which could not possibly have been relevant given X0, but rather that we want to further throw away information from X0 itself (while still maintaining conditional independence at a distance).

I agree that there's a threshold for "can meaningfully build and chain novel abstractions" and this can lead to a positive feedback loop that was not previously present, but there will already be lots of positive feedback loops (such as "AI research -> better AI -> better assistance for human researchers -> AI research") and it's not clear why to expect the new feedback loop to be much more powerful than the existing ones.

(Aside: we're now talking about a discontinuity in the gradient of capabilities rather than of capabilities themselves, but sufficiently large discontinuities in the gradient of capabilities have much of the same implications.)

3Thane Ruthenis1mo
Yeah, the argument here would rely on the assumption that e. g. the extant scientific data already uniquely constraint some novel laws of physics/engineering paradigms/psychological manipulation techniques/etc., and we would be eventually able to figure them out even if science froze right this moment. In this case, the new feedback loop would be faster because superintelligent cognition would be faster than real-life experiments. And I think there's a decent amount of evidence for this. Consider that there are already narrow AIs that can solve protein folding more efficiently than our best manually-derived algorithms — which suggests that better algorithms are already uniquely constrained by the extant data, and we've just been unable to find them. Same may be true for all other domains of science — and thus, a superintelligence iterating on its own cognition would be able to outspeed human science.

Oh, I disagree with your core thesis that the general intelligence property is binary. (Which then translates into disagreements throughout the rest of the post.) But experience has taught me that this disagreement tends to be pretty intractable to talk through, and so I now try just to understand the position I don't agree with, so that I can notice if its predictions start coming true.

You mention universality, active adaptability and goal-directedness. I do think universality is binary, but I expect there are fairly continuous trends in some underlying l... (read more)

3Thane Ruthenis1mo
Interesting, thanks. Agreed that this point (universality leads to discontinuity) probably needs to be hashed out more. Roughly, my view is that universality allows the system to become self-sustaining. Prior to universality, it can't autonomously adapt to novel environments (including abstract environments, e. g. new fields of science). Its heuristics have to be refined by some external ground-truth signals, like trial-and-error experimentation or model-based policy gradients. But once the system can construct and work with self-made abstract objects, it can autonomously build chains of them [https://www.lesswrong.com/posts/w26TY2tFHRTdposre/why-are-some-problems-super-hard?commentId=CsHFp7pWJYRRfoJGX] — and that causes a shift in the architecture and internal dynamics, because now its primary method of cognition is iterating on self-derived abstraction chains, instead of using hard-coded heuristics/modules. 

Okay, this mostly makes sense now. (I still disagree but it no longer seems internally inconsistent.)

Fwiw, I feel like if I had your model, I'd be interested in:

  1. Producing tests for general intelligence. It really feels like there should be something to do here, that at least gives you significant Bayesian evidence. For example, filter the training data to remove anything talking about <some scientific field, e.g. complexity theory>, then see whether the resulting AI system can invent that field from scratch if you point it at the problems that motiva
... (read more)
1Thane Ruthenis1mo
I agree that those are useful pursuits. Mind gesturing at your disagreements? Not necessarily to argue them, just interested in the viewpoint.

Hm? "Stall at the human level" and "the discontinuity ends at or before the human level" reads like the same thing to me. What difference do you see between the two?

Discontinuity ending (without stalling):

Stalling:

Basically, except instead of directly giving it privileges/compute, I meant that we'd keep training it until the SGD gives the GI component more compute and privileges over the rest of the model (e. g., a better ability to rewrite its instincts).

Are you imagining systems that are built differently from today? Because I'm not seeing how SGD could ... (read more)

3Thane Ruthenis1mo
Ah, makes sense. I do expect that some sort of ability to reprogram itself at inference time will be ~necessary for AGI, yes. But I also had in mind something like your "SGD creates a set of weights that effectively treats the input English tokens as a programming language" example. In the unlikely case that modern transformers are AGI-complete, I'd expect something on that order of exoticism to be necessary (but it's not my baseline prediction). "Doing science" is meant to be covered by "lack of empirical evidence that there's anything in the universe that humans can't model". Doing science implies the ability to learn/invent new abstractions, and we're yet to observe any limits to how far we can take it / what that trick allows us to understand. Mmm. Consider a scheme like the following: * Let T2 be the current date. * Train an AI on all of humanity's knowledge up to a point in time T1, where T1<T2. * Assemble a list D of all scientific discoveries made in the time period (T1;T2]. * See if the AI can replicate these discoveries. At face value, if the AI can do that, it should be considered able to "do science" and therefore AGI, right? I would dispute that. If the period (T1;T2] is short enough, then it's likely that most of the cognitive work needed to make the leap to any discovery in D is already present in the data up to T1. Making a discovery from that starting point doesn't necessarily require developing new abstractions/doing science — it's possible that it may be done just by interpolating between a few already-known concepts. And here, some asymmetry between humans and e. g. SOTA LLMs becomes relevant: * No human knows everything the entire humanity knows. Imagine if making some discovery in D by interpolation required combining two very "distant" concepts, like a physics insight and advanced biology knowledge. It's unlikely that there'd be a human with sufficient expertise in both, so a human will likely do it by actual-scie

See Section 5 for more discussion of all of that.

Sorry, I seem to have missed the problems mentioned in that section on my first read.

There's no reason to expect that AGI would naturally "stall" at the exact same level of performance and restrictions.

I'm not claiming the AGI would stall at human level, I'm claiming that on your model, the discontinuity should have some decent likelihood of ending at or before human level.

(I care about this because I think it cuts against this point: We only have one shot. There will be a sharp discontinuity in capabilities... (read more)

3Thane Ruthenis1mo
Hm? "Stall at the human level" and "the discontinuity ends at or before the human level" reads like the same thing to me. What difference do you see between the two? Basically, except instead of directly giving it privileges/compute, I meant that we'd keep training it until the SGD gives the GI component more compute and privileges over the rest of the model (e. g., a better ability to rewrite its instincts). The strategy of slowly scaling our AI up is workable at the core, but IMO there are a lot of complications: * A "mildly-superhuman" AGI, or even just a genius-human AGI, is still be an omnicide risk [https://www.lesswrong.com/posts/oBBzqkZwkxDvsKBGB/ai-could-defeat-all-of-us-combined#How_AIs_could_defeat_humans_without__superintelligence_] (see also [https://www.lesswrong.com/posts/KTbGuLTnycA6wKBza/what-would-a-fight-between-humanity-and-agi-look-like]). I wouldn't want to experiment with that; I would want it safely at average-human-or-below level. It's likely hard to "catch" it at that level by inspecting its external behavior, though: can only be reliably done via advanced interpretability tools. * Deceptiveness (and manipulation) is a significant factor, as you've mentioned. Even just a mildly-superhuman AGI will likely be very good at it. Maybe not implacably good, but it'd be like working bare-handed with an extremely dangerous chemical substance, with the entire humanity at the stake. * The problem of "iterating" on this system. If we have just a "weak" AGI on our hands, it's mostly a pre-AGI system, with a "weak" general-intelligence component that doesn't control much. Any "naive" approaches, like blindly training interpretability probes on it or something, would likely ignore that weak GI component, and focus mainly on analysing or shaping heuristics/shards. To get the right kind of experience from it, we'd have to very precisely aim our experiments at the GI component — which, again, likely

What ties it all together is the belief that the general-intelligence property is binary.

Do any humans have the general-intelligence property?

If yes, after the "sharp discontinuity" occurs, why won't the AGI be like humans (in particular: generally not able to take over the world?)

If no, why do we believe the general-intelligence property exists?

3Thane Ruthenis1mo
Yes, ~all of them. Humans are not superintelligent because despite their minds embedding the algorithm for general intelligence, that algorithm is still resource-constrained (by the brain's compute) and privilege-constrained within the mind (e. g., it doesn't have full write-access to our instincts). There's no reason to expect that AGI would naturally "stall" at the exact same level of performance and restrictions. On the contrary: even if we resolve to check for "AGI-ness" often, with the intent of stopping the training the moment our AI becomes true AGI but still human-level or below it, we're likely to miss the right moment without advanced interpretability tools, and scale it past "human-level" straight to "impossible-to-ignore superintelligent". There would be no warning signs, because "weak" AGI (human-level or below) can't be clearly distinguished from a very capable pre-AGI system, based solely on externally-visible behaviour. See Section 5 [https://www.lesswrong.com/posts/3JRBqRtHBDyPE3sGa/a-case-for-the-least-forgiving-take-on-alignment#5__A_Caveat] for more discussion of all of that. Quoting from my discussion with cfoster0 [https://www.lesswrong.com/posts/3JRBqRtHBDyPE3sGa/a-case-for-the-least-forgiving-take-on-alignment?commentId=xzZsPjjEHTnoGwqbJ]:

So here's a paper: Fundamental Limitations of Alignment in Large Language Models. With a title like that you've got to at least skim it. Unfortunately, the quick skim makes me pretty skeptical of the paper.

The abstract says "we prove that for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt." This clearly can't be true in full generality, and I wish the abstract would give me some hint about ... (read more)

Interestingly, I apparently had a median around 2040 back in 2019, so my median is still later than it used to be prior to reading the bio anchors report.

Indeed I am confused why people think Goodharting is effectively-100%-likely to happen and also lead to all the humans dying. Seems incredibly extreme. All the examples people give of Goodharting do not lead to all the humans dying.

(Yes, I'm aware that the arguments are more sophisticated than that and "previous examples of Goodharting didn't lead to extinction" isn't a rebuttal to them, but that response does capture some of my attitude towards the more sophisticated arguments, something like "that's a wildly strong conclusion you've drawn from a pretty h... (read more)

I'm not claiming that you figure out whether the model's underlying motivations are bad. (Or, reading back what I wrote, I did say that but it's not what I meant, sorry about that.) I'm saying that when the model's underlying motivations are bad, it may take some bad action. If you notice and penalize that just because the action is bad, without ever figuring out whether the underlying motivation was bad or not, that still selects against models with bad motivations.

It's plausible that you then get a model with bad motivations that knows not to produce bad... (read more)

1David Xu3mo
but, but, standard counterargument imperfect proxies Goodharting magnification of error adversarial amplification etc etc etc? (It feels weird that this is a point that seems to consistently come up in discussions of this type, considering how basic of a disagreement it really is, but it really does seem to me like lots of things come back to this over and over again?)

I think you're missing the primary theory of change for all of these techniques, which I would say is particularly compatible with your "follow-the-trying" approach.

While all of these are often analyzed from the perspective of "suppose you have a potentially-misaligned powerful AI; here's what would happen", I view that as an analysis tool, not the primary theory of change.

The theory of change that I most buy is that as you are training your model, while it is developing the "trying", you would like it to develop good "trying" and not bad "trying", and one... (read more)

4Steve Byrnes3mo
Thanks, that helps! You’re working under a different development model than me, but that’s fine. It seems to me that the real key ingredient in this story is where you propose to update the model based on motivation and not just behavior—“penalize it instead of rewarding it” if the outputs are “due to instrumental / deceptive reasoning”. That’s great. Definitely what we want to do. I want to zoom in on that part. You write that “debate / RRM / ELK” are supposed to “allow you to notice” instrumental / deceptive reasoning. Of these three, I buy the ELK story—ELK is sorta an interpretability technique, so it seems plausible that ELK is relevant to noticing deceptive motivations (even if the ELK literature is not really talking about that too much at this stage, per Paul’s comment [https://www.lesswrong.com/posts/PDx4ueLpvz5gxPEus/why-i-m-not-working-on-debate-rrm-elk-natural-abstractions?commentId=Lusyr3RdHCHvdNKXn]). But what about debate & RRM? I’m more confused about why you brought those up in this context. Traditionally, those techniques are focused on what the model is outputting, not what the model’s underlying motivations are. But I haven’t read all the literature. Am I missing something? (We can give the debaters / the reward model a printout of model activations alongside the model’s behavioral outputs. But I’m not sure what the next step of the story is, after that. How do the debaters / reward model learn to skillfully interpret the model activations to extract underlying motivations?)

Yes, that's right, though I'd say "probable" not "possible" (most things are "possible").

Depends what the aligned sovereign does! Also depends what you mean by a pivotal act!

In practice, during the period of time where biological humans are still doing a meaningful part of alignment work, I don't expect us to build an aligned sovereign, nor do I expect to build a single misaligned AI that takes over: I instead expect there to be a large number of AI systems, that could together obtain a decisive strategic advantage, but could not do so individually.

4David Johnston3mo
So, if I'm understanding you correctly: * if it's possible to build a single AI system that executes a catastrophic takeover (via self-bootstrap or whatever), it's also probably possible to build a single aligned sovereign, and so in this situation winning once is sufficient * if it is not possible to build a single aligned sovereign, then it's probably also not possible to build a single system that executes a catastrophic takeover and so the proposition that the model only has to win once is not true in any straightforward way * in this case, we might be able to think of "composite AI systems" that can catastrophically take over or end the acute risk period, and for similar reasons as in the first scenario, winning once with a composite system is sufficient, but such systems are not built from single acts and you think the second scenario is more likely than the first.

I think that skews it somewhat but not very much. We only have to "win" once in the sense that we only need to build an aligned Sovereign that ends the acute risk period once, similarly to how we only have to "lose" once in the sense that we only need to build a misaligned superintelligence that kills everyone once.

(I like the discussion on similar points in the strategy-stealing assumption.)

4David Johnston3mo
Is building an aligned sovereign to end the acute risk period different to a pivotal act in your view?

[People at AI labs] expected heavy scrutiny by leadership and communications teams on what they can state publicly. [...] One discussion with a person working at DeepMind is pending approval before publication. [...] We think organizations discouraging their employees from speaking openly about their views on AI risk is harmful, and we want to encourage more openness.

(I'm the person in question.)

I just want to note that in the case of DeepMind:

  • I don't expect "heavy" scrutiny by leadership and communications teams (though it is not literally zero)
  • For the di
... (read more)

Suppose you have some deep learning model M_orig that you are finetuning to avoid some particular kind of failure. Suppose all of the following hold:

  1. Capable model: The base model has the necessary capabilities and knowledge to avoid the failure.
  2. Malleable motivations: There is a "nearby" model M_good (i.e. a model with minor changes to the weights relative to the M_orig) that uses its capabilities to avoid the failure. (Combined with (1), this means it behaves like M_orig except in cases that show the failure, where it does something better.)
  3. Stron
... (read more)

In that case, I think you should try and find out what the incentive gradient is like for other people before prescribing the actions that they should take. I'd predict that for a lot of alignment researchers your list of incentives mostly doesn't resonate, relative to things like:

  1. Active discomfort at potentially contributing to a problem that could end humanity
  2. Social pressure + status incentives from EAs / rationalists to work on safety and not capabilities
  3. Desire to work on philosophical or mathematical puzzles, rather than mucking around in the weeds of
... (read more)

Errr, I feel like we already agree on this point?

Yes, sorry, I realized that right after I posted and replaced it with a better response, but apparently you already saw it :(

What I meant to say was "I think most of the time closing overhangs is more negative than positive, and I think it makes sense to apply OP's higher bar of scrutiny to any proposed overhang-closing proposal".

But like, why? I wish people would argue for this instead of flatly asserting it and then talking about increased scrutiny or burdens of proof (which I also don't like).

3leogao4mo
I think maybe the crux is the part about the strength of the incentives towards doing capabilities. From my perspective it generally seems like this incentive gradient is pretty real: getting funded for capabilities is a lot easier, it's a lot more prestigious and high status in the mainstream, etc. I also myself viscerally feel the pull of wishful thinking (I really want to be wrong about high P(doom)!) and spend a lot of willpower trying to combat it (but also not so much that I fail to update where things genuinely are not as bad as I would expect, but also not allowing that to be an excuse for wishful thinking, etc...). 

To me, it seems like the claim that is (implicitly) being made here is that small improvements early on compound to have much bigger impacts later on, and also a larger shortening of the overall timeline to some threshold. 

As you note, the second claim is false for the model the OP mentions. I don't care about the first claim once you know whether the second claim is true or false, which is the important part.

I agree it could be true in practice in other models but I am unhappy about the pattern where someone makes a claim based on arguments that are ... (read more)

3leogao4mo
"This [model] is zero evidence for the claim" is a roughly accurate view of my opinion. I think you're right that epistemically it would have been much better for me to have said something along those lines. Will edit something into my original comment.
3leogao4mo
Errr, I feel like we already agree on this point? Like I'm saying almost exactly the same thing you're saying; sorry if I didn't make it prominent enough: I'm also not claiming this is an accurate model; I think I have quite a bit of uncertainty as to what model makes the most sense. I was not intending to make a claim of this strength, so I'll walk back what I said. What I meant to say was "I think most of the time the benefit of closing overhangs is much smaller than the cost of reduced timelines, and I think it makes sense to apply OP's higher bar of scrutiny to any proposed overhang-closing proposal". I think I was thinking too much inside my inside view when writing the comment, and baking in a few other assumptions from my model (including: closing overhangs benefiting capabilities at least as much, research being kinda inefficient (though not as inefficient as OP thinks probably)). I think on an outside view I would endorse a weaker but directionally same version of my claim. In my work I do try to avoid advancing capabilities where possible, though I think I can always do better at this.

Progress often follows an s-curve, which appears exponential until the current research direction is exploited and tapers off. Moving an exponential up, even a little, early on can have large downstream consequences:

Your graph shows "a small increase" that represents progress that is equal to an advance of a third to a half the time left until catastrophe on the default trajectory. That's not small! That's as much progress as everyone else combined achieves in a third of the time till catastrophic models! It feels like you'd have to figure out some newer e... (read more)

Not OP, just some personal takes:

That's not small!

To me, it seems like the claim that is (implicitly) being made here is that small improvements early on compound to have much bigger impacts later on, and also a larger shortening of the overall timeline to some threshold. (To be clear, I don't think the exponential model presented provides evidence for this latter claim)

I think the first claim is obviously true. The second claim could be true in practice, though I feel quite uncertain about this. It happens to be false in the specific model of moving an ex... (read more)

And now, it seems like we agree that the pseudocode I gave isn't a grader-optimizer for the grader self.diamondShard(self.WM.getConseq(plan)), and that e.g. approval-directed agents are grader-optimizers for some idealized function of human-approval? That seems like a substantial resolution of disagreement, no? 

I don't think I agree with this.

At a high level, your argument can be thought of as having two steps:

  1. Grader-optimizers are bad, because of problem P.
  2. Approval-directed agents / [things built by IDA, debate, RRM] are grader-optimizers.

I've been t... (read more)

Nice comment!

The arguments you outline are the sort of arguments that have been considered at CHAI and MIRI quite a bit (at least historically). The main issue I have with this sort of work is that it talks about how an agent should reason, whereas in my view the problem is that even if we knew how an agent should reason we wouldn't know how to build an agent that efficiently implements that reasoning (particularly in the neural network paradigm). So I personally work more on the latter problem: supposing we know how we want the agent to reason, how do we ... (read more)

3Roger Dearnaley4mo
I agree, this is only a proposal for a solution to the outer alignment problem. On the optimizer's curse, information value and risk aversion aspects you mention, I think I agree that a sufficiently rational agent should already be thinking like that: any GAI that is somehow still treating the universe like a black-box multi-armed bandit isn't going to live very long and should fairly easy to defeat (hand it 1/epsilon opportunities to make a fatal mistake, all labeled with symbols it has never seen before). Optimizing while not allowing for the optimizer's curse is also treating the universe like a multi-armed bandit, not even with probability epsilon of exploring: you're doing a cheap all-exploration strategy on your utility uncertainty estimates, which will cause you to sequentially pull the handles on all your overestimates until you discover the hard way that they're all just overestimates. This is not rational behavior for a powerful optimizer, at least in the presence of the possibility of a really bad outcome, so not doing it should be convergent, and we shouldn't build a near-human AI that is still making that mistake.  Edit: I expanded this comment into a post, at: https://www.lesswrong.com/posts/ZqTQtEvBQhiGy6y7p/breaking-the-optimizer-s-curse-and-consequences-for-1 [https://www.lesswrong.com/posts/ZqTQtEvBQhiGy6y7p/breaking-the-optimizer-s-curse-and-consequences-for-1] 

I don't really disagree with any of what you're saying but I also don't see why it matters.

I consider myself to be saying "you can't just abstract this system as 'trying to make evaluations come out high'; the dynamics really do matter, and considering the situation in more detail does change the conclusions."

I'm on board with the first part of this, but I still don't see the part where it changes any conclusions. From my perspective your responses are of the form "well, no, your abstract argument neglects X, Y and Z details" rather than explaining how X, ... (read more)

2Alex Turner4mo
Uh, I'm confused. From your original comment [https://www.lesswrong.com/posts/rauMEna2ddf26BqiE/alignment-allows-nonrobust-decision-influences-and-doesn-t?commentId=DXJKvisMDqco5Trb5] in this thread: You also said: And now, it seems like we agree that the pseudocode I gave isn't a grader-optimizer for the grader self.diamondShard(self.WM.getConseq(plan)), and that e.g. approval-directed agents are grader-optimizers for some idealized function of human-approval? That seems like a substantial resolution of disagreement, no?  Sounds like we mostly disagree on cumulative effort to: (get a grader-optimizer to do good things) vs (get a values-executing agent to do good things).  We probably perceive the difficulty as follows: 1. Getting the target configuration into an agent 1. Grader-optimization 1. Alex: Very very hard 2. Rohin: Hard 2. Values-executing 1. Alex: Moderate/hard 2. Rohin: Hard 2. Aligning the target configuration such that good things happen (e.g. makes diamonds), conditional on the intended cognitive patterns being instilled to begin with (step 1) 1. Grader-optimization 1. Alex: Extremely hard 2. Rohin: Very hard 2. Values-executing 1. Alex: Hard 2. Rohin: Hard Does this seem reasonable? We would then mostly disagree on relative difficulty of 1a vs 1b.  -------------------------------------------------------------------------------- Separately, I apologize for having given an incorrect answer earlier, which you then adopted, and then I berated you for adopting my own incorrect answer -- how simplistic of you! Urgh.  I had said: But I should also have mentioned the change in planModificationSample. Sorry about that.

The edits help, thanks. I was in large part reacting to the fact that Kaj's post reads very differently from your summary of Bad Argument 1 (rather than the fact that I don't make Bad Argument 1). In the introductory paragraph where he states his position (the third paragraph of the post), he concludes:

Thus by doing capabilities research now, we buy ourselves a longer time period in which it's possible to do more effective alignment research.

Which is clearly not equivalent to "alignment researchers hibernate for N years and then get back to work".

Plausibly... (read more)

4Steve Byrnes4mo
On further reflection, I promoted the thing from a footnote to the main text, elaborated on it, and added another thing at the end. (I think I wrote this post in a snarkier way than my usual style, and I regret that. Thanks again for the pushback.)

Fwiw, when talking about risks from deploying a technology / product, "accident" seems (to me) much more like ascribing blame ("why didn't they deal with this problem?"), e.g. the Boeing 737-MAX incidents are "accidents" and people do blame Boeing for them. In contrast "structural" feels much more like "the problem was in the structure, there was no specific person or organization that was in the wrong".

I agree that in situations that aren't about deploying a technology / product, "accident" conveys a lack of blameworthiness.

1David Scott Krueger4mo
Hmm... this is a good point. I think structural risk is often a better description of reality, but I can see a rhetorical argument against framing things that way.  One problem I see with doing that is that I think it leads people to think the solution is just for AI developers to be more careful, rather than observing that there will be structural incentives (etc.) pushing for less caution.

I recall some people in CHAI working on a minecraft AI that could help players do useful tasks the players wanted.

I think you're probably talking about my work. This is more of a long-term vision; it isn't doable (currently) at academic scales of compute. See also the "Advantages of BASALT" section of this post.

(Also I just generically default to Minecraft when I'm thinking of ML experiments that need to mimic some aspect of the real world, precisely because "the game getting played here is basically the same thing real life society is playing".)

Okay, then let me try to directly resolve my confusion. My current understanding is something like - in both humans and AIs, you have a blob of compute with certain structural parameters, and then you feed it training data. On this model, we've screened off evolution, the size of the genome, etc - all of that is going into the "with certain structural parameters" part of the blob of compute. So could an AI engineer create an AI blob of compute the same size as the brain, with its same structural parameters, feed it the same training data, and get the same

... (read more)

Downvoted for mischaracterizing the views of the people you're arguing against in the "Bad argument 1" section.

(Footnote 2 is better, but it's a footnote, not the main text.)

EDIT: Removed my downvote given the edits.

2Steve Byrnes4mo
Thanks, I just added the following text: I know that you don’t make Bad Argument 1—you were specifically one of the people I was thinking of when I wrote Footnote 2. I disagree that nobody makes Bad Argument 1. I think that Lone Pine’s comment on this very post [https://www.lesswrong.com/posts/MCWGCyz2mjtRoWiyP/endgame-safety-for-agi?commentId=GHHBFaR7kYr6w7GDG] is probably an example. I have seen lots of other examples, although I’m having trouble digging up other ones right now. I guess you can say it’s unvirtuous / un-scout-mindset of me to spend more time refuting bad arguments for positions I disagree with, than refuting bad arguments for positions I agree with? Hmm. I also changed the Kaj link from “Example of this argument” to “Example of something close to this argument”. As a matter of fact, I do actually think that Kaj’s post had some actual Bad-Argument-1-thinking slipping in in various places in his text. At least, that’s how it came across to me. But it’s probably not a good use of time to argue about that.
0TAG4mo
Of you course, Hawkins doesn't just say they are stupid. It is Byrnes who is summarily dismissing Hawkins, in fact.

Sorry, didn't see this until now (didn't get a notification, since it was a reply to Buck's comment).

I'm guessing your take is like "I, Buck/Rohin, could write a review that was epistemically adequate, but I'm busy and don't expect it to accomplish anything that useful."

In some sense yes, but also, looking at posts I've commented on in the last ~6 months, I have written several technical reviews (and nontechnical reviews). And these are only the cases where I wrote a comment that engaged in detail with the main point of the post; many of my other comments ... (read more)

My summary of your argument now would be:

  1. Deployment lag: it takes time to deploy stuff
  2. Worries about AI misalignment: the world will believe that AI alignment is hard, and so avoid deploying it until doing a lot of work to be confident in alignment.
  3. Regulation: it takes time to comply with regulations

If that's right, I broadly agree with all of these points :)

(I previously thought you were saying something very different with (2), since the text in the OP seems pretty different.)

3Matthew Barnett4mo
FWIW I don't think you're getting things wrong here. I also have simply changed some of my views in the meantime. That said, I think what I was trying to accomplish with (2) was not that alignment would be hard per se, but that it would be hard to get an AI to do very high-skill tasks in general, which included aligning the model, since otherwise it's not really "doing the task" (though as I said, I don't currently stand by what I wrote in the OP, as-is).

I want to distinguish between two questions:

  1. At some specified point in the future, will people believe that AI CEOs can perform the CEO task as well as human CEOs if deployed?
  2. At some specified point in the future, will AI CEOs be able to perform the CEO task as well as human CEOs if deployed?

(The key difference being that (1) is a statement about people's beliefs about reality, while (2) is a statement about reality directly.)

(For all of this I'm assuming that an AI CEO that does the job of CEO well until the point that it executes a treacherous turn count... (read more)

3Matthew Barnett4mo
I think I understand my confusion, at least a bit better than before. Here's how I'd summarize what happened. I had three arguments in this essay, which I thought of as roughly having the following form: 1. Deployment lag: after TAI is fully developed, how long will it take to become widely impactful? 2. Generality: how difficult is it to develop TAI fully, including making it robustly and reliably achieve what we want? 3. Regulation: how much will people's reactions to and concerns about AI delay the arrival of fully developed TAI? You said that (2) was already answered by the bio anchors model. I responded that bio anchors neglected how difficult it will be to develop AI safely. You replied that it will be easy make models to seemingly do what we want, but that the harder part will be making models that actually do what we want. My reply was trying to say that the inherent difficulty of building TAI safely was inherently baked into (2) already. That might be a dubious reading of the actual textual argument for (2), but I think that interpretation is backed up by my initial reply to your comment. The reason why I framed my later reply as being about perceptions was because I think the requisite capability level at which people begin to adopt TAI is an important point about how long timelines will be independent of (1) and (3). In other words, I was arguing that people's perceptions of the capability of AI will cause them wait to adopt AI until it's fully developed in the sense I described above; it won't just delay the effects of TAI after it's fully developed, or before then because of regulation. Furthermore, I assumed that you were arguing something along the lines of "people will adopt AI once it's capable of only seeming to do what we want", which I'm skeptical of. Hence my reply to you. Since for point 2 you said "I'm assuming that an AI CEO that does the job of CEO well until the point that it executes a treacherous turn", I am not very

Overall disagreement:

I've remained somewhat confused about the exact grader/non-grader-optimizer distinction I want to draw. At least, intensionally. (which is why I've focused on giving examples, in the hope of getting the vibe across.)

Yeah, I think I have at least some sense of how this works in the kinds of examples you usually discuss (though my sense is that it's well captured by the "grader is complicit" point in my previous comment, which you presumably disagree with).

But I don't see how to extend the extensional definition far enough to get to the ... (read more)

3Alex Turner4mo
Having a diamondShard and a diamondGraderShard will mean that the generative models will be differently tuned! Not only does an animal-welfare activist grade plans based on predictions about different latent quantities (e.g. animal happiness) than a businessman (e.g. how well their firm does), the two will sample different plans from self.WM.planModificationSample! The vegan and businessman have different generative models because they historically cared about different quantities, and so collected different training data, which differently refined their predictive and planning machinery...  One of my main lessons was (intended to be) that "agents" are not just a "generative model" and a "grading procedure", with each slot hot-swappable to different models or graders! One should not reduce a system to "where the plans come from" and "how they get graded"; these are not disjoint slots in practice (even though they are disjoint, in theory). Each system has complex and rich dynamics, and you need to holistically consider what plans get generated and how they get graded in order to properly predict the overall behavior of a system.  To address our running example—if an agent has a diamondGraderShard, that was brought into existence by reinforcement events for making the diamondGrader output a high number. This kind of agent has internalized tricks and models around the diamondGrader in particular, and would e.g. freely generate plans like "study the diamondGrader implementation."  On the other hand, the diamondShard agent would be tuned to generate plans which have to do with diamonds. It's still true that an "accidental" / "upwards-noise" generation could trick the internal diamond grader, but there would not have been historical reinforcement events which accrued into internal generative models which e.g. sample plans about doing adversarial attacks on parts of the agent's own cognition. So I would in fact be surprised to find a free-standing diamond-shard-agent ge

However, I don't see why these arguments would apply to humans

Okay, I'll take a stab at this.

6. Word Prediction is not Intelligence

"The kinds of humans that we are worried about are the kinds of humans that can do original scientific research and autonomously form plans for taking over the world. Human brains learn to take actions and plans that previously led to high rewards (outcomes like eating food when hungry, having sex, etc)*. These two things are fundamentally not the same thing. Why, exactly, would we expect that a system that is good at the latte... (read more)

1Joar Skalse4mo
"" The kinds of humans that we are worried about are the kinds of humans that can do original scientific research and autonomously form plans for taking over the world. Human brains learn to take actions and plans that previously led to high rewards (outcomes like eating food when hungry, having sex, etc)*. These two things are fundamentally not the same thing. Why, exactly, would we expect that a system that is good at the latter necessarily would be able to do the former?" "" This feels like a bit of a digression, but we do have concrete examples of systems that are good at eating food when hungry, having sex, and etc, without being able to do original scientific research and autonomously form plans for taking over the world; animals. And the difference between humans and animals isn't just that humans have more training data (or even that we are that much better at survival and reproduction in the environment of evolutionary adaptation). But I should also note that I consider argument 6 to be one of the weaker arguments I know of. "" We know, from computer science, that it is very powerful to be able to reason in terms of variables and operations on variables. It seems hard to see how you could have human-level intelligence without this ability. However, humans do not typically have this ability, with most human brains instead being more analogous to Boolean circuits, given their finite size and architecture of neuron connections. "" The fact that human brains have a finite size and architecture of neuron connections does not mean that they are well-modelled as Boolean circuits. For example, a (real-world) computer is better modelled as a Turing machine than as a finite-state automaton, even though there is a sense in which they actually are finite-state automata.  The brain is made out of neurons, yes, but it matters a great deal how those neurons are connected. Depending on the answer to that question, you could end up with a system that behaves more like B

A couple of reasons:

  1. It's far easier for me to figure out how much to update on evidence when someone else has looked at the details and highlighted ways in which the evidence is stronger or weaker than a reader might naively take away from the paper. (At least, assuming the reviewer did a good job.)
    1. This doesn't apply to big-picture reviews because such reviews are typically a rehash of old arguments I already know.
    2. This is similar to the general idea in AI safety via debate -- when you have access to a review you are more like a judge; without a review you
... (read more)

Meta: A lot of this seems to have the following form:

You: Here is an argument that neural networks have property X.

Me: But that argument as you've stated it would imply that humans have property X, which is false.

You: Humans and neural networks work differently, so it wouldn't be surprising if neural networks have property X and humans don't.

I think you are misunderstanding what I am trying to do here. I'm not trying to claim that humans and neural networks will have the same properties or be identical. I'm trying to evaluate how much I should update on th... (read more)

3Joar Skalse5mo
Yes, this is of course very sensible. However, I don't see why these arguments would apply to humans, unless you make some additional assumption or connection that I am not making. Considering the rest of the conversation, I assume the difference is that you draw a stronger analogy between brains and deep learning systems than I do? I want to ask a question that goes something like "how correlated is your credence that arguments 5-10 apply to human brains with your credence that human brains and deep learning systems are analogous in important sense X"? But because I don't quite know what your beliefs are, or why you say that arguments 5-10 apply to humans, I find it hard to formulate this question in the right way. For example, regarding argument 7 (language of thought), consider the following two propositions: 1. Some part of the human brain is hard-coded to use LoT-like representations, and the way that these representations are updated in response to experience is not analogous to gradient descent. 2. Updating the parameters of a neural network with gradient descent is very unlikely to yield (and maintain) LoT-like representations. These claims could both be true simultaneously, no? Why, concretely, do you think that arguments 5-10 apply to human brains? It is empirically true that the resulting system has strong and general capabilities, there is no need to question that. What I mean is that this is evidence that those capabilities are a result of information processing that is quite dissimilar from what humans do, which in turn opens up the possibility that those processes could not be re-tooled to create the kind of system that could take over the world. In particular, they could be much more shallow than they seem. It is not hard to argue that a model with general capabilities for reasoning, hypothesis generation, and world modelling, etc, would get a good score at the task of an LLM. However, I think one of the central lessons from the h

Arguments 6-10 seem like the most interesting ones (as they respond more directly to the argument). But for all of them except argument 6, it seems like the same argument would imply that humans would not be generally intelligent.

[Argument 6]

The kinds of AI systems that we are worried about are the kinds of systems that can do original scientific research and autonomously form plans for taking over the world. LLMs are trained to write text that would be maximally unsurprising if found on the internet. These two things are fundamentally not the same thing.

... (read more)
3Joar Skalse5mo
Why is that? There are a few ways to respond. First of all, what comes after "plausibly" could just turn out to be wrong. Many people thought human-level chess would require human-like strategising, but this turned out to be wrong (though the case for text prediction is certainly more convincing). Secondly, an LLM is almost certainly not learning the lowest K-complexity program for text prediction, and given that, the case becomes less clear. For example, suppose an LLM instead learns a truly massive ensemble of simple heuristics, that together produce human-like text. It seems plausible that such an ensemble could produce convincing results, but without replicating logic, reasoning, and etc. IBM-Watson did something along these lines. Studies such as this one [https://aclanthology.org/2021.emnlp-main.230/] also provide some evidence for this perspective. To give an intuition pump, suppose we trained an extremely large random forest classifier on the same data as GPT3 was trained on. How good would the output of this classifier be? While it would probably not be as good as GPT3, it would probably still be very impressive. And a random forest classifier is also a universal function approximator, whose performance keeps improving as it is given more training data. I'm sure there are scaling laws for them. But I don't think many people believe that we could get AGI by making a sufficiently big random forest classifier for next-token prediction. Why is that? I have found this to be an interesting prompt to think about. For me, a gestalt shift that makes long time lines seem plausible is to look at LLMs sort of like how you would look at a giant random forest classifier. (Also, just to reiterate, I am not personally convinced of long time-lines, I am just trying to make the best arguments for this view more easily available.) I can't say this for sure, especially not for newer or more exotic architectures, but it does certainly not seem like these are the kinds of

Personally if I were trying to do this I'd probably aim to do a combination of:

  1. Identify what kinds of reasoning people are employing, investigate under what conditions they tend to lead to the truth. E.g. one way that I think I differ from many others is that I am skeptical of analogies as direct evidence about the truth; I see the point of analogies as (a) tools for communicating ideas more effectively and (b) locating hypotheses that you then verify by understanding the underlying mechanism and checking that the mechanism ports (after which you don't nee
... (read more)

That doesn't require greater epistemic competence: they need only tend to make different mistakes, not fewer mistakes.

Yes, that's true, I agree my original comment is overstated for this reason. (But it doesn't change my actual prediction about what would happen; I still don't expect reviewers to catch issues.)

My sense is that nothing on this scale happens (right?)

I'd guess that I've spent around 6 months debating these sorts of cruxes and disagreements (though not with a single person of course). I think the main bottleneck is finding avenues that would actually make progress.

2Joe_Collman5mo
Ah, well that's mildly discouraging (encouraging that you've made this scale of effort; discouraging in what it says about the difficulty of progress). I'd still be interested to know what you'd see as a promising approach here - if such crux resolution were the only problem, and you were able to coordinate things as you wished, what would be a (relatively) promising strategy? But perhaps you're already pursuing it? I.e. if something like [everyone works on what they see as key problems, increases their own understanding and shares insights] seems most likely to open up paths to progress. Assuming review wouldn't do much to help on this, have you thought about distributed mechanisms that might? E.g. mapping out core cruxes and linking all available discussions where they seem a fundamental issue (potentially after holding/writing-up a bunch more MIRI Dialogues [https://www.lesswrong.com/sequences/n945eovrA3oDueqtq] style interactions [which needn't all involve MIRI]). Does this kind of thing seem likely to be of little value - e.g. because it ends up clearly highlighting where different intuitions show up, but shedding little light on their roots or potential justification? I suppose I'd like to know what shape of evidence seems most likely to lead to progress - and whether much/any of it might be unearthed through clarification/distillation/mapping of existing ideas. (where the mapping doesn't require connections that only people with the deepest models will find)

Yeah, that sounds entirely plausible if it was over 2 years ago, just because I'm terrible at remembering my opinions from that long ago.

one of the things that has come up most frequently (including in conversations with Rohin, Buck and various Open Phil people) is that many people wish there was more of a review process in the AI Alignment field. 

Hmm, I think I've complained a bunch about lots of AI alignment work being conceptually confused, or simply stating points rather than arguing for them, or being otherwise epistemically sketchy. But I also don't particularly feel optimistic about a review process either; for that to fix these problems the reviewers would have to be more epist... (read more)

3Buck Shlegeris4mo
  For what it's worth, this is also where I'm at on an Alignment Forum review.
3Raymond Arnold5mo
I'm interested in details about what you find useful about the prospect of reviews that talk about the details. I share a sense that it'd be helpful, but I'm not sure I could justify that belief very strongly (when it comes to the opportunity cost of the people qualified to do the job) In general, I'm legit fairly uncertain whether "effort-reviews"(whether detail-focused or big-picture focused) are worthwhile. It seems plausible to me that detail-focused-reviews are more useful soon after a work is published than 2 years later, and that big-picture-reviews are more useful in the "two year retrospective" sense (and maybe we should figure out some way to get detail-oriented reviews done more frequently, faster).  It does seem to me that, by the time a book is being considered for "long-term-valuable', I would like someone, at some point, to have done a detail-oriented review examining all of the fiddly pieces. In some cases, that review has been done before the post was even published, in a private google doc.
2Joe_Collman5mo
I think this is an overstatement. They'd need to notice issues the post authors missed. That doesn't require greater epistemic competence: they need only tend to make different mistakes, not fewer mistakes. Certainly there's a point below which the signal-to-noise ratio is too low. I agree that high reviewer quality is important. On the "same old cruxes and disagreements" I imagine you're right - but to me that suggests we need a more effective mechanism to clarify/resolve them (I think you're correct in implying that review is not that mechanism - I don't think academic review achieves this either). It's otherwise unsurprising that they bubble up everywhere. I don't have any clear sense of the degree of time and effort that has gone into clarifying/resolving such cruxes, and I'm sure it tends to be a frustrating process. However, my guess is that the answer is "nowhere close to enough". Unless researchers have very high confidence that they're on the right side of such disagreements, it seems appropriate to me to spend ~6 months focusing on purely this (of course this would require coordination, and presumably seems wildly impractical). My sense is that nothing on this scale happens (right?), and that the reasons have more to do with (entirely understandable) impracticality, coordination difficulties and frustration, than with principled epistemics and EV calculations. But perhaps I'm way off? My apologies if this is one of the same old cruxes and disagreements :).
1Oliver Habryka5mo
This was quite a while ago, probably over 2 years, though I do feel like I remember it quite distinctly. I guess my model of you has updated somewhat here over the years, and now is more interested in heads-down work.

Yeah, that's another good reason to be skeptical of the objective-based categorization.

It's still well-defined, though I agree that in this case the name is misleading. But this is a single specific edge case that I don't expect will actually happen, so I think I'm fine with that.

You can extend the definition to online learning: choose some particular time and say that all the previous inputs on which you got gradients are the "training data" and the future inputs are the "test data".

In the situation you describe, you would want to identify the point at which the AI system starts executing on its plan to cause an existential catastrophe, set that as the specific point in time (so everything before it is "training" and everything after is "test"), and then apply the categorization as usual.

1Ofer5mo
(Though even in that case it's not necessarily a generalization problem. Suppose every single "test" input happens to be identical to one that appeared in "training", and the feedback is always good.)

Yup, this is the objective-based categorization, and as you've noted it's ambiguous on the scenarios I mention because it depends on how you choose the "definition" of the design objective (aka policy-scoring function).

I don't agree with this:

I think that the more we explore this analogy & take it seriously as a way to predict AGI, the more confident we'll get that the classic misalignment risk story is basically correct.

The analogy doesn't seem relevant to AGI risk so I don't update much on it. Even if doom happens in this story, it seems like it's for pretty different reasons than in the classic misalignment risk story.

2Daniel Kokotajlo5mo
Right, so you don't take the analogy seriously -- but the quoted claim was meant to say basically "IF you took the analogy seriously..." Feel free not to respond, I feel like the thread of conversation has been lost somehow.
Load More