I explore similar considerations here in The Case for Mixed Deployment (5 min read), key takeaways were:
However, I think that (contra your proposal) most of the oomph comes from the AIs monitoring and cross-examining each others work, rather than running them in parallel. That is, I disagree with "If you split your compute across 20 different types of AIs, then your favorite type of AI is going to only have 5% as much compute available to it as it would've had if you concentrated your bets." This is because I think we can run things like debate between all the variants.
One upshot of this difference is that I expect coordination between labs to matter significantly: if lab A starts with 100% schemers and lab B starts with 5% schemers, then we'll elicit useful work from the AIs if the labs cross-examine each other's AI research with their own AI variants.
Some thoughts on public outreach and "Were they early because they were good or lucky?"
A bit anecdotal but: there are ~ a dozen people who went to our college in 2017-2020 now working full-time in AI safety, which is much higher than other colleges at the same university. I'm not saying any of us are particularly "great" -- but this suggests social contagion / information cascade, rather than "we figured this stuff out from the empty string". Maybe if you go back further (e.g. 2012-2016) there was less social contagion, and that cohort is better?
I think your remarks suggest that alignment to the level of top humans will happen by default, but not alignment to god-like superintelligence. That said, if we get aligned top-human AIs, then we can defer the rest of the alignment problem to them.
If I were sure that top-human-level AIs will be aligned by default, here's what I might work on instead:
the base model is just predicting the likely continuation of the prompt. and it's a reasonable prediction that, when an assistant is given a harmful instruction, they will refuse. this behaviour isn't surprising.
Hey Nisan. Check the following passage from Domain Theory (Samson Abramsky and Achim Jung). This might be helpful for equipping with an appropriate domain structure. (You mention [JP89] yourself.)
We should also mention the various attempts to define a probabilistic version of the powerdomain construction, see [SD80, Mai85, Gra88, JP89, Jon90].
- [SD80] N. Saheb-Djahromi. CPO’s of measures for nondeterminism. Theoretical Computer Science, 12:19–37, 1980.
- [Mai85] M. Main. Free constructions of powerdomains. In A. Melton, editor, Mathematical Foundations of Programming Semantics, volume 239 of Lecture Notes in Computer Science, pages 162–183. Springer Verlag, 1985.
- [Gra88] S. Graham. Closure properties of a probabilistic powerdomain construction. In M. Main, A. Melton, M. Mislove, and D. Schmidt, editors, Mathematical Foundations of Programming Language Semantics, volume 298 of Lecture Notes in Computer Science, pages 213–233. Springer Verlag, 1988.
- [JP89] C. Jones and G. Plotkin. A probabilistic powerdomain of evaluations. In Proceedings of the 4th Annual Symposium on Logic in Computer Science, pages 186–195. IEEE Computer Society Press, 1989.
- [Jon90] C. Jones. Probabilistic Non-Determinism. PhD thesis, University of Edinburgh, Edinburgh, 1990. Also published as Technical Report No. CST63-90.
During my own incursion into agent foundations and game theory, I also bumped into this exact obstacle — namely, that there is no obvious way to equip with a least-fixed-point constructor . In contrast, we can equip with a LFP constructor .
One trick is to define to be the distribution which maximises the entropy subject to the constraint .
The justification here is the Principle of Maximum Entropy:
Given a set of constraints on a probability distribution, then the “best” distribution that fits the data will be the one of maximum entropy.
More generally, we should define to be the distribution which minimises cross-entropy subject to the constraint , where is some uninformed prior such as Solomonoff. The previous result is a special case by considering to be the uniform prior. The proof generalises by noting that is continuous and strictly convex. See the Principle of Minimum Discrimination.
Ideally, we'd like and to "coincide" modulo the maps , i.e. for all . Unfortunately, this isn't the case — if then but .
Alternatively, we could consider the convex sets of distributions over .
Let denote the set of convex sets of distributions over . There is an ordering on where . We have a LFP operator via where is the lift of via the monad.
However, if it is the case that the difference between humans and monkeys is mostly due to a one-shot discrete difference (ie language), then this cannot necessarily be repeated to get a similar gain in intelligence a second time.
Perhaps language is a zero-one, i.e. language renders a mind "cognitively complete" in the sense that the mind can represent anything about the external world, and make any inferences using those representations. But intelligence is not thereby zero-one because intelligence depends on continuous variables like computional speed, memory, etc.
More concretely, I am sceptic that "we end up with AI geniuses, but not AI gods", because running a genius at 10,000x speed, parallelised over 10,000x cores, with instantaneous access to the internet does (I think) make an AI god. A difference is quantity is a difference in kind.
Thar said, there might exist plausible threat models which require an AI which doesn't spatiotemporally decompose into less smart AIs. Could you sketch one out?
In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases;
I'm not sure how well this metric tracks what people care about — performance on particular downstream tasks (e.g. passing a law exam, writing bugless code, automating alignment research, etc)
Yep, you're correct. The original argument in the Waluigi mega-post was sloppy.
You're correct. The finite context window biases the dynamics towards simulacra which can be evidenced by short prompts, i.e. biases away from luigis and towards waluigis.
But let me be more pedantic and less dramatic than I was in the article — the waluigi transitions aren't inevitable. The waluigi are approximately-absorbing classes in the Markov chain, but there are other approximately-absorbing classes which the luigi can fall into. For example, endlessly cycling through the same word (mode-collapse) is also an approximately-absorbing class.
if you ask each variant to review the research of the other variants, then the schemers need to optimise their research for looking good to each variant. but the optimistic assumption is that at least one variant is an equally capable non-schemer.