Epistemic Status: I feel confident and tentatively optimistic about the claims made in this post, but am slightly more uncertain about how it generalizes. Additionally, I am concerned about the extent to which this is dual-use for capabilities and exfohazardous and spent a few months thinking about whether it was worth it to release this post regardless. I haven’t come to an answer yet, so I’m publishing this to let other people see it and know what they think I should do.
TL;DR: I propose a research direction to solve alignment that potentially doesn’t require solutions to ontology identification, learning how to code, or becoming literate.
Until a few hours ago, I was spending my time primarily working on high-level interpretability and cyborgism. While I was writing a draft for something I was working on, an activity that usually yields me a lot of free time by way of procrastination, I stumbled across the central idea behind many of the ideas in this post. It seemed so immediately compelling that I dropped working on everything else to start working on it, culminating after much deliberation in the post you see before you.
My intention with this post is to provide a definitive reference for what it would take to safely use AGI to steer our world toward much better states in the absence of a solution to any or all of several existing problems, such as Eliciting Latent Knowledge, conditioning simulator models, Natural Abstractions, mechanistic interpretability, and the like.
In a world with prospects such as those, I propose that we radically rethink our approach to AGI safety. Instead of dedicating enormous effort to engineering nigh-impossible safety measures, we should consider thus-far neglected avenues of research, especially ones that have memetic reasons to be unfairly disprivileged so far and which immunizes them against capabilities misuse. To avert the impending AI apocalypse, we need to focus on high-variance, low-probability-high-yield ideas: lightning strikes that, should they occur, effectively solve astoundingly complex problems in a single fell swoop. A notable example of this, which I claim we should be investing all of our efforts into, is luck. Yes, luck!
I suggest that we should pay greater attention to luck as a powerful factor enhancing other endeavors and as an independent direction in its own right. Humanity has, over the centuries, devoted immense amounts of cumulative cognition toward exploring and optimizing for luck, so one might naively think that there’s little tractability left. I believe, however, that there is an immense amount of alpha that has been developed in the form of contemporary rationality and cultural devices that can vastly improve the efficiency of steering luck, and at a highly specified target.
Consider the following: if we were to offer a $1,000,000 prize to the next person who walks into the MIRI offices, clearly, that person would be the luckiest person on the planet. It follows, then, that this lucky individual would have an uncannily high probability of finally cracking the alignment problem. I understand that prima facie this proposal may be considered absurd, but I strongly suggest abandoning the representativeness heuristic and evaluating what is instead of what seems to be, especially given that the initial absurdness is intrinsic to why this strategy is competitive at all.
It's like being granted three wishes by a genie. Instead of wishing for more wishes (which is the usual strategy), we should wish to be the luckiest person in the world—with that power, we can then stumble upon AGI alignment almost effortlessly, and make our own genies.
Think of it this way: throughout history, many great discoveries have been made not through careful study but by embracing serendipity. Luck has been the primary force behind countless medical, scientific, and even technological advances:
The list goes on. So why not capitalize on this hidden force and apply it to AGI alignment? It's a risk, of course. But it’s a calculated one, rooted in historical precedent, and borne of necessity—the traditional method just doesn't seem to cut it.
To distill out what I consider the most important cruxes behind why I consider this compelling:
A common failure mode among alignment researchers working on new agendas is that we spend too long caught up in the abstract and fail to touch grass. Therefore, to try and alleviate this as much as possible in the rather abstract territory intrinsic to the direction, I’ll jump immediately into specific directions that we could think about.
This is certainly not an exhaustive list, and despite their individual degree of painstaking research, revolve around ideas I came up with off the top of my head; I believe this strongly signals the potential inherent to this agenda.
Create an algorithm that searches through the space of all possible alignment solutions to find one that maximizes the score of a random probability generator. If we make the generator sufficiently random, we can overcome adversarial exploits, and leverage RSA-2048 style schemes to our advantage.
You might be wondering how we would design an algorithm that searches through the space of all possible ideas. I think we could leverage some structure that has universal levels of expression, and simply train an algorithm to predict the next entry in this lexicon given prior entries. We might want to leverage some kind of mechanism that has the ability to selectively focus on specific parts of the prior entries, regardless of how far back they were within the bounds of the size limit of the input, to compute the next entry. I hereby dub this the “transfigurator” architecture.
Value handshakes have been proposed as a potential way for AIs to achieve value equilibria without fighting for monarchic dominance. Anthropic value handshakes involve AIs in different Everett branches wanting to achieve their values in other universes without waging anthropic war. I believe, however, that the generator underlying this concept may not be limited to super-powered intelligences, and that we may be able to leverage it for solving alignment.
More concretely, I imagine running some kind of idea generator (perhaps using the transfigurator architecture described above?) using a quantum random number generator as a seed, to generate different ideas for different Everett branches and pre-committing as a civilization (hopefully there’s been progress on AI governance since the last time I checked in!) to implement whatever plans our branch receives.
Under ordinary anthropic immortality, we would survive trivially in some universes. However, I’m not fully convinced that this notion of immortality is relevant to me (see here for a longer discussion), and plausibly for a lot of other alignment researchers. This is why I don’t consider it the crux of this direction as much as the randomness component (and why I prefer making new-school horcruxes over the older ones), which allows for lucky civilizations to survive very robustly. This gives us a very strong guarantee since luck is very tightly bound as the property inherent to universes that survive, with no degrees of freedom. I claim this as the simplest form of the proof that luck is the True Name for alignment. This gives us powerful steerability over the timeline, reducing alignment to the simpler problem of being a luckier civilization.
Create a lottery system where researchers submit ideas for aligning GPT-class systems.
The more the better as too few proposals may lead to tampering by lottery admins who have their own opinions on what proposals are good; an abundance of ideas too many to meter alleviate this issue and allow us to leverage the luck-based properties of this idea over the clearly insufficient existing proposal evaluators. Therefore, encourage quantity over quality and allow proposals from all and sundry, regardless of academic credentials or familiarity with alignment (under this paradigm, if an idea by someone who doesn’t understand GPT well were to work, it would have to be a commensurately lucky idea, offsetting whatever lesser flaws it may have).
Randomly-selected winners get their solutions implemented into the next GPT system, leveraging their propensity for working in high-variance situations.
We could even make this more entertaining to increase the odds of directing the simulation narrative toward more upbeat genres with luckier endings in worlds where that’s applicable, perhaps by creating a television show focused on AI alignment, similar to "Deal or No Deal," where contestants must choose from various covered capsules containing potential GPT alignment solutions. Naturally, the winner gets a cash prize, and humanity gets a much-needed step forward in AGI safety.
Transmit a message out at light-speed to the universe inducing in any sentient alien species that comes across it the intent to exterminate any minds originating from Earth. This is useful because it seems plausible that the probabilities of alien life existing in our lightcone and misaligned AGI being created are roughly similar (the proof for this is rather trivial, and left as an exercise to the reader; as a hint, consider the obvious solution to the fermi paradox).
Therefore, any AI we create that can reason acausally would understand (anthropically) that were it to be misaligned, our world would be in the same probability mass as worlds where that message is intercepted by alien civilizations that then try to destroy it. This would hopefully incentivize against these AIs existing in the first place, by instrumental convergence.
Caveat: I certainly intend to violate the laws of physics at some point in the future should we make it, and I expect superintelligences to be able to as well given enough time. Therefore, this strategy works mainly in worlds where AGI self-improvement is bounded enough for its ability to intercept this message to be governed by some variant of the rocket equation.
As briefly mentioned above, giving a million dollars to the next person to walk into the MIRI offices clearly marks them as the luckiest person on the planet, and someone who could potentially have a very high impact especially in paradigms such as this. An even simpler strategy would be selectively hiring lottery winners to work on alignment.
This is, however, just one of a class of strategies we could employ in this spirit. For example, more sophisticated strategies may involve “tunable” serendipity. Consider the following set-up: a group of alignment researchers make a very large series of coin-flips in pairs with each calling heads or tails, to determine the luckiest among them. We continue the game until some researcher gets a series of X flips correct, for a tunable measure of luck we can select for. I plan to apply for funding both for paying these researchers for their time, and for the large number of coins I anticipate needing - if you want to help with this, please reach out!
Ensuring that our timeline gets as close to solving alignment the normal way as possible, so that acausal-reasoner AGIs in the branches where we get close but fail trade with the AGIs in the branches that survive.
The obvious implication here is that we should pre-commit to AGIs that try to influence other branches if we survive, and stunt this ability before then, such that friendly branches have more anthropic measure and trade is therefore favorable on net.
Very similar in spirit to the above ideas of “Identifying the Highest-Impact Researchers” and “GPT-Alignment Lottery”, but seems worth stating generally; we could offer monetary rewards to the luckiest ideas, incentivizing researchers to come up with their own methods of ingratiating themselves with Mother Serendipity and funneling arbitrary optimization power toward further this agenda.
For instance, we could hold AI Safety conferences where several researchers are selected randomly to receive monetary prizes for their submitted ideas. This would have the side benefit of increasing participation in AI Safety conferences as well.
While I do think that most of our alpha comes from optimizing luck-based strategies for a new age, I don’t want to discard entirely existing ones. We may be smarter than the cumulative optimization power of human civilization to date, but it seems plausible that there are good ideas here we can adopt with low overhead.
For instance, we could train GPT systems to automatically chain letters on social media that purport to make our day luckier. If we’re really going overboard (and I admit this is in slight violation of sticking to the archaic strategies), we could even fund an EA cause area of optimizing the quality and quantity of chain letters that alignment researchers receive, to maximize the luck we gain from this.
Likewise, we could train alignment researchers in carrying lucky charms, adopting ritualistic good-luck routines, and generally creating the illusion of a luckier environment for placebo effects.
Organize events where AI researchers are paired up for short, rapid discussions on alignment topics, with the hopes of stimulating unexpected connections and lucky breakthroughs by increasing the circulation of ideas.
Design a series of puzzles and challenges as a learning tool for alignment beginners, that when solved, progressively reveal more advanced concepts and tools. The goal is for participants to stumble upon a lucky solution while trying to solve these puzzles in these novel frames.
In a similar vein, I think that embracing high-variance strategies may be useful in general, albeit without the competitive advantage offered by luck. To that end, here are some ideas that are similar in spirit:
Investigate the possibility of engaging a team of psychic mediums to channel the spirits of great scientists, mathematicians, and philosophers from the past to help guide the design of aligned AI systems. I was surprised to find that there is a lot of prior work on a seemingly similar concept known as side-channels, making me think that this is even more promising than I had anticipated.
Note: While writing this post I didn’t notice the rather humorous coincidence of calling this idea similar in spirit - this was certainly unintentional and I hope that it doesn’t detract from the more sober tone of this post.
Create a collection of novels where every chapter presents different alignment challenges and solutions, and readers can vote on which path to pursue. The winning path becomes the next chapter, and democratically crowdsource the consensus alignment solution.
Launch a satellite that will broadcast alignment data into space, in the hope that an advanced alien civilization will intercept the message and provide us with the alignment solution we need.
Explore whether positive reinforcement techniques used in past life regression therapies can be applied to reinforce alignment behaviors in AI systems, making them more empathetic and attuned to human values. Refer this for more in this line of thought.
I think this line of work is potentially extremely valuable, with few flaws that I can think of. For the most part criticism should be levied at myself for having missed this approach for so long (that others did as well is scant consolation when we’re not measured on a curve), so I’ll keep this section short.
The sole piece of solid criticism I could find (and which I alluded to earlier) is not object-level, which I think speaks to the soundness of these ideas. Specifically, I think that there is an argument to be made that this cause area should receive minimal funding, seeing as how if we want to select for luck, people that can buy lottery tickets to get their own funding are probably much more suited - i.e., have a much stronger natural competitive advantage - for this kind of work.
Another line of potential criticism could be directed at the field of alignment in general for not deferring to domain experts in what practices they should adopt to optimize luck, such as keeping mirrors intact and painting all cats white. I think this is misguided however, as deference here would run awry of the very reason this strategy is competitive! That we can apply thus-far-underutilized techniques to augment their effectiveness greatly is central to the viability of this direction.
A third point in criticism, which I also disagree with, is in relation to the nature of luck itself. Perhaps researching luck is inherently antithetical to the idea of luck, and we’re dooming ourselves with the prospect to a worse timeline than before. I think this is entirely fair – my disagreement stems from the fact that I’m one of the unluckiest people I know, and conditional on this post being made and you reading this far, researching luck still didn’t stop me or this post’s ability to be a post!
I will admit to no small amount of embarrassment at not realizing the sheer potential implied by this direction sooner. I assume that this is an implicit exercise left by existing top researchers to identify which newcomers have the ability to see past the absurd in truly high-dimensional spaces with high-stakes; conditional on this being true, I humbly apologize to all of you for taking this long and spoiling the surprise, but I believe this is too important to keep using as our collective in-group litmus test.
Succeeding in this endeavour might seem like finding a needle in a haystack, but when you consider the magnitude of the problem we face, the expected utility of this agenda is in itself almost as ridiculous as the agenda seems at face value.
I don’t generally endorse arguments that are downstream of deference to something in general, but the real world seems like something I can defer to begrudgingly while still claiming the mantle of “rationalist”.
Ironically enough, a post that was made earlier today describes Alex Turner realizing he made this very same error! Today seems to be a good day for touching grass for some reason.