Possibly, but in at least one of the two cases I was thinking of when writing this comment (and maybe in both), I made the argument in the parent comment and the person agreed and retracted their point. (I think in both cases I was talking about deceptive alignment via goal misgeneralization.)
Okay, I understand how that addresses my edit.
I'm still not quite sure why the lightcone theorem is a "foundation" for natural abstraction (it looks to me like a nice concrete example on which you could apply techniques) but I think I should just wait for future posts, since I don't really have any concrete questions at the moment.
Okay, that mostly makes sense.
note that the resampler itself throws away a ton of information about while going from to . And that is indeed information which "could have" been relevant, but almost always gets wiped out by noise. That's the information we're looking to throw away, for abstraction purposes.
I agree this is true, but why does the Lightcone theorem matter for it?
It is also a theorem that a Gibbs resampler initialized at equilibrium will produce distributed according to , and as you say it's c...
The Lightcone Theorem says: conditional on , any sets of variables in which are a distance of at least apart in the graphical model are independent.
I am confused. This sounds to me like:
If you have sets of variables that start with no mutual information (conditioning on ), and they are so far away that nothing other than could have affected both of them (distance of at least ), then they continue to have no mutual information (independent).
Some things that I am confused about as a result:
I agree that there's a threshold for "can meaningfully build and chain novel abstractions" and this can lead to a positive feedback loop that was not previously present, but there will already be lots of positive feedback loops (such as "AI research -> better AI -> better assistance for human researchers -> AI research") and it's not clear why to expect the new feedback loop to be much more powerful than the existing ones.
(Aside: we're now talking about a discontinuity in the gradient of capabilities rather than of capabilities themselves, but sufficiently large discontinuities in the gradient of capabilities have much of the same implications.)
Oh, I disagree with your core thesis that the general intelligence property is binary. (Which then translates into disagreements throughout the rest of the post.) But experience has taught me that this disagreement tends to be pretty intractable to talk through, and so I now try just to understand the position I don't agree with, so that I can notice if its predictions start coming true.
You mention universality, active adaptability and goal-directedness. I do think universality is binary, but I expect there are fairly continuous trends in some underlying l...
Okay, this mostly makes sense now. (I still disagree but it no longer seems internally inconsistent.)
Fwiw, I feel like if I had your model, I'd be interested in:
Hm? "Stall at the human level" and "the discontinuity ends at or before the human level" reads like the same thing to me. What difference do you see between the two?
Discontinuity ending (without stalling):
Stalling:
Basically, except instead of directly giving it privileges/compute, I meant that we'd keep training it until the SGD gives the GI component more compute and privileges over the rest of the model (e. g., a better ability to rewrite its instincts).
Are you imagining systems that are built differently from today? Because I'm not seeing how SGD could ...
See Section 5 for more discussion of all of that.
Sorry, I seem to have missed the problems mentioned in that section on my first read.
There's no reason to expect that AGI would naturally "stall" at the exact same level of performance and restrictions.
I'm not claiming the AGI would stall at human level, I'm claiming that on your model, the discontinuity should have some decent likelihood of ending at or before human level.
(I care about this because I think it cuts against this point: We only have one shot. There will be a sharp discontinuity in capabilities...
What ties it all together is the belief that the general-intelligence property is binary.
Do any humans have the general-intelligence property?
If yes, after the "sharp discontinuity" occurs, why won't the AGI be like humans (in particular: generally not able to take over the world?)
If no, why do we believe the general-intelligence property exists?
So here's a paper: Fundamental Limitations of Alignment in Large Language Models. With a title like that you've got to at least skim it. Unfortunately, the quick skim makes me pretty skeptical of the paper.
The abstract says "we prove that for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt." This clearly can't be true in full generality, and I wish the abstract would give me some hint about ...
Interestingly, I apparently had a median around 2040 back in 2019, so my median is still later than it used to be prior to reading the bio anchors report.
Indeed I am confused why people think Goodharting is effectively-100%-likely to happen and also lead to all the humans dying. Seems incredibly extreme. All the examples people give of Goodharting do not lead to all the humans dying.
(Yes, I'm aware that the arguments are more sophisticated than that and "previous examples of Goodharting didn't lead to extinction" isn't a rebuttal to them, but that response does capture some of my attitude towards the more sophisticated arguments, something like "that's a wildly strong conclusion you've drawn from a pretty h...
I'm not claiming that you figure out whether the model's underlying motivations are bad. (Or, reading back what I wrote, I did say that but it's not what I meant, sorry about that.) I'm saying that when the model's underlying motivations are bad, it may take some bad action. If you notice and penalize that just because the action is bad, without ever figuring out whether the underlying motivation was bad or not, that still selects against models with bad motivations.
It's plausible that you then get a model with bad motivations that knows not to produce bad...
I think you're missing the primary theory of change for all of these techniques, which I would say is particularly compatible with your "follow-the-trying" approach.
While all of these are often analyzed from the perspective of "suppose you have a potentially-misaligned powerful AI; here's what would happen", I view that as an analysis tool, not the primary theory of change.
The theory of change that I most buy is that as you are training your model, while it is developing the "trying", you would like it to develop good "trying" and not bad "trying", and one...
Depends what the aligned sovereign does! Also depends what you mean by a pivotal act!
In practice, during the period of time where biological humans are still doing a meaningful part of alignment work, I don't expect us to build an aligned sovereign, nor do I expect to build a single misaligned AI that takes over: I instead expect there to be a large number of AI systems, that could together obtain a decisive strategic advantage, but could not do so individually.
I think that skews it somewhat but not very much. We only have to "win" once in the sense that we only need to build an aligned Sovereign that ends the acute risk period once, similarly to how we only have to "lose" once in the sense that we only need to build a misaligned superintelligence that kills everyone once.
(I like the discussion on similar points in the strategy-stealing assumption.)
[People at AI labs] expected heavy scrutiny by leadership and communications teams on what they can state publicly. [...] One discussion with a person working at DeepMind is pending approval before publication. [...] We think organizations discouraging their employees from speaking openly about their views on AI risk is harmful, and we want to encourage more openness.
(I'm the person in question.)
I just want to note that in the case of DeepMind:
Suppose you have some deep learning model M_orig that you are finetuning to avoid some particular kind of failure. Suppose all of the following hold:
In that case, I think you should try and find out what the incentive gradient is like for other people before prescribing the actions that they should take. I'd predict that for a lot of alignment researchers your list of incentives mostly doesn't resonate, relative to things like:
Errr, I feel like we already agree on this point?
Yes, sorry, I realized that right after I posted and replaced it with a better response, but apparently you already saw it :(
What I meant to say was "I think most of the time closing overhangs is more negative than positive, and I think it makes sense to apply OP's higher bar of scrutiny to any proposed overhang-closing proposal".
But like, why? I wish people would argue for this instead of flatly asserting it and then talking about increased scrutiny or burdens of proof (which I also don't like).
To me, it seems like the claim that is (implicitly) being made here is that small improvements early on compound to have much bigger impacts later on, and also a larger shortening of the overall timeline to some threshold.
As you note, the second claim is false for the model the OP mentions. I don't care about the first claim once you know whether the second claim is true or false, which is the important part.
I agree it could be true in practice in other models but I am unhappy about the pattern where someone makes a claim based on arguments that are ...
Progress often follows an s-curve, which appears exponential until the current research direction is exploited and tapers off. Moving an exponential up, even a little, early on can have large downstream consequences:
Your graph shows "a small increase" that represents progress that is equal to an advance of a third to a half the time left until catastrophe on the default trajectory. That's not small! That's as much progress as everyone else combined achieves in a third of the time till catastrophic models! It feels like you'd have to figure out some newer e...
Not OP, just some personal takes:
That's not small!
To me, it seems like the claim that is (implicitly) being made here is that small improvements early on compound to have much bigger impacts later on, and also a larger shortening of the overall timeline to some threshold. (To be clear, I don't think the exponential model presented provides evidence for this latter claim)
I think the first claim is obviously true. The second claim could be true in practice, though I feel quite uncertain about this. It happens to be false in the specific model of moving an ex...
And now, it seems like we agree that the pseudocode I gave isn't a grader-optimizer for the grader
self.diamondShard(self.WM.getConseq(plan))
, and that e.g. approval-directed agents are grader-optimizers for some idealized function of human-approval? That seems like a substantial resolution of disagreement, no?
I don't think I agree with this.
At a high level, your argument can be thought of as having two steps:
- Grader-optimizers are bad, because of problem P.
- Approval-directed agents / [things built by IDA, debate, RRM] are grader-optimizers.
I've been t...
Nice comment!
The arguments you outline are the sort of arguments that have been considered at CHAI and MIRI quite a bit (at least historically). The main issue I have with this sort of work is that it talks about how an agent should reason, whereas in my view the problem is that even if we knew how an agent should reason we wouldn't know how to build an agent that efficiently implements that reasoning (particularly in the neural network paradigm). So I personally work more on the latter problem: supposing we know how we want the agent to reason, how do we ...
I don't really disagree with any of what you're saying but I also don't see why it matters.
I consider myself to be saying "you can't just abstract this system as 'trying to make evaluations come out high'; the dynamics really do matter, and considering the situation in more detail does change the conclusions."
I'm on board with the first part of this, but I still don't see the part where it changes any conclusions. From my perspective your responses are of the form "well, no, your abstract argument neglects X, Y and Z details" rather than explaining how X, ...
The edits help, thanks. I was in large part reacting to the fact that Kaj's post reads very differently from your summary of Bad Argument 1 (rather than the fact that I don't make Bad Argument 1). In the introductory paragraph where he states his position (the third paragraph of the post), he concludes:
Thus by doing capabilities research now, we buy ourselves a longer time period in which it's possible to do more effective alignment research.
Which is clearly not equivalent to "alignment researchers hibernate for N years and then get back to work".
Plausibly...
Fwiw, when talking about risks from deploying a technology / product, "accident" seems (to me) much more like ascribing blame ("why didn't they deal with this problem?"), e.g. the Boeing 737-MAX incidents are "accidents" and people do blame Boeing for them. In contrast "structural" feels much more like "the problem was in the structure, there was no specific person or organization that was in the wrong".
I agree that in situations that aren't about deploying a technology / product, "accident" conveys a lack of blameworthiness.
I recall some people in CHAI working on a minecraft AI that could help players do useful tasks the players wanted.
I think you're probably talking about my work. This is more of a long-term vision; it isn't doable (currently) at academic scales of compute. See also the "Advantages of BASALT" section of this post.
(Also I just generically default to Minecraft when I'm thinking of ML experiments that need to mimic some aspect of the real world, precisely because "the game getting played here is basically the same thing real life society is playing".)
...Okay, then let me try to directly resolve my confusion. My current understanding is something like - in both humans and AIs, you have a blob of compute with certain structural parameters, and then you feed it training data. On this model, we've screened off evolution, the size of the genome, etc - all of that is going into the "with certain structural parameters" part of the blob of compute. So could an AI engineer create an AI blob of compute the same size as the brain, with its same structural parameters, feed it the same training data, and get the same
Downvoted for mischaracterizing the views of the people you're arguing against in the "Bad argument 1" section.
(Footnote 2 is better, but it's a footnote, not the main text.)
EDIT: Removed my downvote given the edits.
Sorry, didn't see this until now (didn't get a notification, since it was a reply to Buck's comment).
I'm guessing your take is like "I, Buck/Rohin, could write a review that was epistemically adequate, but I'm busy and don't expect it to accomplish anything that useful."
In some sense yes, but also, looking at posts I've commented on in the last ~6 months, I have written several technical reviews (and nontechnical reviews). And these are only the cases where I wrote a comment that engaged in detail with the main point of the post; many of my other comments ...
My summary of your argument now would be:
If that's right, I broadly agree with all of these points :)
(I previously thought you were saying something very different with (2), since the text in the OP seems pretty different.)
I want to distinguish between two questions:
(The key difference being that (1) is a statement about people's beliefs about reality, while (2) is a statement about reality directly.)
(For all of this I'm assuming that an AI CEO that does the job of CEO well until the point that it executes a treacherous turn count...
Overall disagreement:
I've remained somewhat confused about the exact grader/non-grader-optimizer distinction I want to draw. At least, intensionally. (which is why I've focused on giving examples, in the hope of getting the vibe across.)
Yeah, I think I have at least some sense of how this works in the kinds of examples you usually discuss (though my sense is that it's well captured by the "grader is complicit" point in my previous comment, which you presumably disagree with).
But I don't see how to extend the extensional definition far enough to get to the ...
However, I don't see why these arguments would apply to humans
Okay, I'll take a stab at this.
6. Word Prediction is not Intelligence
"The kinds of humans that we are worried about are the kinds of humans that can do original scientific research and autonomously form plans for taking over the world. Human brains learn to take actions and plans that previously led to high rewards (outcomes like eating food when hungry, having sex, etc)*. These two things are fundamentally not the same thing. Why, exactly, would we expect that a system that is good at the latte...
A couple of reasons:
Meta: A lot of this seems to have the following form:
You: Here is an argument that neural networks have property X.
Me: But that argument as you've stated it would imply that humans have property X, which is false.
You: Humans and neural networks work differently, so it wouldn't be surprising if neural networks have property X and humans don't.
I think you are misunderstanding what I am trying to do here. I'm not trying to claim that humans and neural networks will have the same properties or be identical. I'm trying to evaluate how much I should update on th...
Arguments 6-10 seem like the most interesting ones (as they respond more directly to the argument). But for all of them except argument 6, it seems like the same argument would imply that humans would not be generally intelligent.
...[Argument 6]
The kinds of AI systems that we are worried about are the kinds of systems that can do original scientific research and autonomously form plans for taking over the world. LLMs are trained to write text that would be maximally unsurprising if found on the internet. These two things are fundamentally not the same thing.
Personally if I were trying to do this I'd probably aim to do a combination of:
That doesn't require greater epistemic competence: they need only tend to make different mistakes, not fewer mistakes.
Yes, that's true, I agree my original comment is overstated for this reason. (But it doesn't change my actual prediction about what would happen; I still don't expect reviewers to catch issues.)
My sense is that nothing on this scale happens (right?)
I'd guess that I've spent around 6 months debating these sorts of cruxes and disagreements (though not with a single person of course). I think the main bottleneck is finding avenues that would actually make progress.
Yeah, that sounds entirely plausible if it was over 2 years ago, just because I'm terrible at remembering my opinions from that long ago.
one of the things that has come up most frequently (including in conversations with Rohin, Buck and various Open Phil people) is that many people wish there was more of a review process in the AI Alignment field.
Hmm, I think I've complained a bunch about lots of AI alignment work being conceptually confused, or simply stating points rather than arguing for them, or being otherwise epistemically sketchy. But I also don't particularly feel optimistic about a review process either; for that to fix these problems the reviewers would have to be more epist...
It's still well-defined, though I agree that in this case the name is misleading. But this is a single specific edge case that I don't expect will actually happen, so I think I'm fine with that.
You can extend the definition to online learning: choose some particular time and say that all the previous inputs on which you got gradients are the "training data" and the future inputs are the "test data".
In the situation you describe, you would want to identify the point at which the AI system starts executing on its plan to cause an existential catastrophe, set that as the specific point in time (so everything before it is "training" and everything after is "test"), and then apply the categorization as usual.
Yup, this is the objective-based categorization, and as you've noted it's ambiguous on the scenarios I mention because it depends on how you choose the "definition" of the design objective (aka policy-scoring function).
I don't agree with this:
I think that the more we explore this analogy & take it seriously as a way to predict AGI, the more confident we'll get that the classic misalignment risk story is basically correct.
The analogy doesn't seem relevant to AGI risk so I don't update much on it. Even if doom happens in this story, it seems like it's for pretty different reasons than in the classic misalignment risk story.
I forget if I already mentioned this to you, but another example where you can interpret randomization as worst-case reasoning is MaxEnt RL, see this paper. (I reviewed an earlier version of this paper here (review #3).)