Nate Soares explains why he doesn't expect an unaligned AI to be friendly or cooperative with humanity, even if it uses logical decision theory. He argues that even getting a small fraction of resources from such an AI is extremely unlikely.
(Fictional) Optimist: So you expect future artificial superintelligence (ASI) “by default”, i.e. in the absence of yet-to-be-invented techniques, to be a ruthless sociopath, happy to lie, cheat, and steal, whenever doing so is selfishly beneficial, and with callous indifference to whether anyone (including its own programmers and users) lives or dies?
Me: Yup! (Alas.)
Optimist: …Despite all the evidence right in front of our eyes from humans and LLMs.
Me: Yup!
Optimist: OK, well, I’m here to tell you: that is a very specific and strange thing to expect, especially in the absence of any concrete evidence whatsoever. There’s no reason to expect it. If you think that ruthless sociopathy is the “true core nature of intelligence” or whatever, then you should really look at yourself in a mirror and...
The thing I’m skeptical of is maintaining non-ruthless behavior in the presence of arbitrary amounts of open-ended continual learning. By “open-ended continual learning”, I mean something analogous to what humans did between 30000 BC and today, e.g. inventing new fields, and then still more new fields that build on those new fields, etc. And the AI has to do that without any human input, given enough time.
My actual belief is that this kind of open-ended continual learning is simply impossible in LLMs. If I’m wrong about that, then I would next claim that i...
In this post, we introduce contributions and supracontributions[1], which are basic objects from infra-Bayesianism that go beyond the crisp case (the case of credal sets). We then define supra-POMDPs, a generalization of partially observable Markov decision processes (POMDPs). This generalization has state transition dynamics that are described by supracontributions.
We use supra-POMDPs to formalize various Newcombian problems in the context of learning theory where an agent repeatedly encounters the problem. The one-shot version of these problems are well-known to highlight flaws with classical decision theories.[2] In particular, we discuss the opaque, transparent, and epsilon-noisy versions of Newcomb's problem, XOR blackmail, and counterfactual mugging.
We conclude by stating a theorem that describes when optimality for the supra-POMDP relates to optimality for the Newcombian problem. This theorem is significant because it gives...
Halpern and Leung propose the "minimax weighted expected regret" (MWER) decision-rule, which is a generalization of the minimax-expected-regret (MER) decision-rule. In contrast, our decision rule is a weighted generalization of maximin-expected-utility (MMEU). The problem with MER is that it doesn't work very well with learning. The closest thing to doing learning with MER is adversarial bandits. However, adversarial regret is statistically intractable for Markov Decision Processes. And even with bandits there is a hidden obliviousness assumption if you try to interpret it in a principled decision-theoretic way.
Imagine each of us has an AI representative, aligned to us, personally. Is gradual disempowerment solved?[1] In my view, no; at the same time having AI representatives helps at the margin.
I have two deep reasons for skepticism.[2] Here is the first one.
We, as individuals, are not the only agents or “agencies” in this world. Other goal-oriented strategic entities include states, corporations, and to some extent egregores.
For the sake of argument, imagine the current distribution of agency in the world as approximately 60% humans, 20% corporations, 20% states (numbers are illustrative, egregores not included for simplicity)
Now, suppose we introduce individually aligned AI assistants for each person. At first glance, this seems empowering—each individual gains a cognitive extension, potentially leveling the playing field.
But: do corporations or states also...
What's the story for why companies end up doing things that are catastrophic for their shareholders? (Feel free to give state the key points briefly, Im familiar with many of the ideas)
This is a rough write-up of ideas by Ryan Greenblatt. The core ideas are his and mistakes are mine. Thanks to him, Buck Shlegeris and to Aghyad Deeb for feedback on the draft of this post.
One way to use scheming AIs is to train them to imitate humans. When does imitation robustly avoid frequent and intentional failures?
We describe two techniques that avoid imitation failures on generation tasks:
Credal sets, a special case of infradistributions[1] in infra-Bayesianism and classical objects in imprecise probability theory, provide a means of describing uncertainty without assigning exact probabilities to events as in Bayesianism. This is significant because as argued in the introduction to this sequence, Bayesianism is inadequate as a framework for AI alignment research. We will focus on credal sets rather than general infradistributions for simplicity of the exposition.
Recall that the total-variation metric is one example of a metric on the set of probability distributions over a finite set A set is closed with respect to a metric if it contains all of its limit points with respect to the metric. For example, let The set of probability distributions over is given by
There is a bijection between and the closed interval which is...
The truth is outside of my hypothesis class, but my hypothesis class probably contains a non-trivial law that is a coarsening of the truth, which is the whole point.
For example, you can imagine that you start with some kind of intractable simplicity prior. Then, for each hypothesis you choose a tractable law that coarsens it. You end up with a probability distribution over laws.
A different way to view this is, this is just a way to force your policy to have low-regret w.r.t. all/most hypothesis while weighing complex hypotheses less. For a complex hypothe...