Wiki Contributions


Forgive me if I’m missing something but as far as I can tell, this line of inquiry was already pursued to its logical conclusion (namely, an Agda-verified quinean proof of Löb’s theorem) in this paper by Jason Gross, Jack Gallagher and Benya Fallenstein.

That being said, from an orthogonality perspective, I don’t have any intuition (let alone reasoning) that says that this compassionate breed of LDT is necessary for any particular level of universe-remaking power, including the level needed for a decisive strategic advantage over the rest of Earth’s biosphere. If being a compassionate-LDT agent confers advantages over standard-FDT agents from a Darwinian selection perspective, it would have to be via group selection, but our default trajectory is to end up with a singleton, in which case standard-FDT might be reflectively stable. Perhaps eventually some causal or acausal interaction with non-earth-originating superintelligence would prompt a shift, but, as Nate says,

But that's not us trading with the AI; that's us destroying all of the value in our universe-shard and getting ourselves killed in the process, and then banking on the competence and compassion of aliens.

So, if some kind of compassionate-LDT is a source of hope about not destroying all the value in our universe-share and getting ourselves killed, then it must be hope about us figuring out such a theory and selecting for AGIs that implement it from the start, rather than that maybe an AGI would likely convergently become that way before taking over the world.

Insofar as I have hope… fully-fleshed-out version of UDT would recommend…uncertainty about which agent you are. (I haven't thought about this much and this could be crazy.)

For the record, I have a convergently similar intuition: FDT removes the Cartesian specialness of the ego at the decision nodes (by framing each decision as a mere logical consequence of an agent-neutral nonphysical fact about FDT itself), but retains the Cartesian specialness of the ego at the utility node(s). I’ve thought about this for O(10 hours), and I also believe it could be crazy, but it does align quite well with the conclusions of Compassionate Moral Realism.

Log-normal is a good first guess, but I think its tails are too small (at both ends).

Some alternatives to consider:

  • Erlang distribution (by when will k Poisson events have happened?), or its generalization, Generalized gamma distribution
  • Frechet distribution (what will be the max of a large number of i.i.d. samples?) or its generalization, Generalized extreme value distribution
  • Log-logistic distribution (like log-normal, but heavier-tailed), or its generalization, Singh–Maddala distribution

Of course, the best Bayesian forecast you could come up with, derived from multiple causal factors such as hardware and economics in addition to algorithms, would probably score a bit better than any simple closed-form family like this, but I'd guess literally only about 1 to 2 bits better (in terms of log-score).

“Concern, Respect, and Cooperation” is a contemporary moral-philosophy book by Garrett Cullity which advocates for a pluralistic foundation of morality, based on three distinct principles:

  • Concern: Moral patients’ welfare calls for promotion, protection, sensitivity, etc.
  • Respect: Moral patients’ self-expression calls for non-interference, listening, address, etc.
  • Cooperation: Worthwhile collective action calls for initiation, joining in, collective deliberation, sharing responsibility, etc. And one bonus principle, whose necessity he’s unsure of:
  • Protection: Precious objects call for protection, appreciation, and communication of the appreciation.

What I recently noticed here and want to write down is a loose correspondence between these different foundations for morality and some approaches to safe superintelligence:

  • CEV-maximization corresponds to finding a good enough definition of human welfare that Comcern alone suffices for safety.
  • Corrigibility corresponds to operationalizating some notion of Respect that would alone suffice for safety.
  • Multi-agent approaches lean in the direction of Cooperation.
  • Approaches that aim to just solve literally the “superintelligence that doesn’t destroy us” problem, without regard for the cosmic endowment, sometimes look like Protection.

Cullity argues that none of his principles is individually a satisfying foundation for morality, but that all four together (elaborated in certain ways with many caveats) seem adequate (and maybe just the first three). I have a similar intuition about AI safety approaches. I can’t yet make the analogy precise, but I feel worried when I imagine corrigibility alone, CEV alone, bargaining alone (whether causal or acausal), or Earth-as-wildlife-preserve; whereas I feel pretty good imagining a superintelligence that somehow balances all four. I can imagine that one of them might suffice as a foundation for the others, but I think this would be path-dependent at best. I would be excited about work that tries to do for Cullity’s entire framework what CEV does for pure single-agent utilitarianism (namely, make it more coherent and robust and closer to something that could be formally specified).

So how are we supposed to solve ELK, if we are to assume that it's intractable?

A different answer to this could be that a "solution" to ELK is one that is computable, even if intractable. By analogy, algorithmic "solutions" to probabilistic inference on Bayes nets are still solutions even though the problem is provably NP-hard. It's up to the authors of ELK to disambiguate what they're looking for in a "solution," and I like the ideas here (especially in Level 2), but just wanted to point out this alternative to the premise.

My impression of the plurality perspective around here is that the examples you give (e.g. overweighting contemporary ideology, reinforcing non-truth-seeking discourse patterns, and people accidentally damaging themselves with AI-enabled exotic experiences) are considered unfortunate but acceptable defects in a "safe" transition to a world with superintelligences. These scenarios don't violate existential safety because something that is still recognizably humanity has survived (perhaps even more recognizably human than you and I would hope for).

I agree with your sense that these are salient bad outcomes, but I think they can only be considered "existentially bad" if they plausibly get "locked-in," i.e. persist throughout a substantial fraction of some exponentially-discounted future light-cone. I think Paul's argument amounts to saying that a corrigibility approach focuses directly on mitigating the "lock-in" of wrong preferences, whereas ambitious value learning would try to get the right preferences but has a greater risk of locking-in its best guess.

Congratulations to all the new winners!

This seems like a good time to mention that, after I was awarded a retroactive prize alongside the announcement of this contest, I estimated I’d have been about half as likely to have generated my idea without having read an earlier comment by Scott Viteri, so based on Shapley value I suggested reallocating 25% of my prize to Scott, which I believe ARC did. I’m delighted to see Scott also getting a prize for a full proposal.

I’m excited for the new proposals and counterexamples to be published so that we can all continue to build on each others’ ideas in the open.

Useful primitives for incentivizing alignment-relevant metrics without compromising on task performance might include methods like Orthogonal Gradient Descent or Averaged Gradient Episodic Memory, evaluated and published in the setting of continual learning or multi-task learning. Something like “answer questions honestly” could mathematically be thought of as an additional task to learn, rather than as an inductive bias or regularization to incorporate. And I think these two training modifications are quite natural (I just came to essentially the same ideas independently and then thought “if either of these would work then surely the multi-task learning folks would be doing them?” and then I checked and indeed they are). Just some more nifty widgets to add to my/our toolbox.

Out of curiosity, this morning I did a literature search about "hard-coded optimization" in the gradient-based learning space—that is, people deliberately setting up "inner" optimizers in their neural networks because it seems like a good way to solve tasks. (To clarify, I don't mean deliberately trying to make a general-purpose architecture learn an optimization algorithm, but rather, baking an optimization algorithm into an architecture and letting the architecture learn what to do with it.)

Why is this interesting?

  • The most compelling arguments in Risks from Learned Optimzation that mesa-optimizers will appear involve competitiveness: incorporating online optimization into a policy can help with generalization, compression, etc.
    • If inference-time optimization really does help competitiveness, we should expect to see some of the relevant competitors trying to do it on purpose.
    • I recall some folks saying in 2019 that the apparent lack of this seemed like evidence against the arguments that mesa-optimizers will be competitive.
    • To the extent there is now a trend toward explicit usage of inference-time optimizers, that supports the arguments that mesa-optimizers would be competitive, and thus may emerge accidentally as general-purpose architectures scale up.
  • More importantly (and also mentioned in Risks from Learned Optimzation, as "hard-coded optimization"), if the above arguments hold, then it would help safety to bake in inference-time optimization on purpose, since we can better control and understand optimization when it's engineered—assuming that engineering it doesn't sacrifice task performance (so that the incentive for the base optimizer to evolve a de novo mesa-optimizer is removed).
    • So, engineered inference-time optimization is plausibly one of those few capabilities research directions that is weakly differential tech development in the sense that it accelerates safe AI more than it accelerates unsafe AI (although it accelerates both). I'm not confident enough about this to say that it's a good direction to work on, but I am saying it seems like a good direction to be aware of and occasionally glance at.
    • My impression is that a majority of AI alignment/safety agendas/proposals the last few years have carried a standard caveat that they don't address the inner alignment problem at all, or at least deceptive alignment in particular.
    • As far as I can tell, there are few public directions about how to address deceptive alignment:
    • Although hard-coding optimization certainly doesn't rule out learned optimization, I'm optimistic that it may be an important component of a suite of safety mechanisms (combined with "transparency via ELK" and perhaps one or two other major ideas, which may not be discovered yet) that finally rule out deceptive alignment.

Anyway, here's (some of) what I found:

Here are some other potentially relevant papers I haven't processed yet:

  • Ma, Hengbo, Bike Zhang, Masayoshi Tomizuka, and Koushil Sreenath. “Learning Differentiable Safety-Critical Control Using Control Barrier Functions for Generalization to Novel Environments.” ArXiv:2201.01347 [Cs, Eess], January 7, 2022.
  • Rojas, Junior, Eftychios Sifakis, and Ladislav Kavan. “Differentiable Implicit Soft-Body Physics.” ArXiv:2102.05791 [Cs], September 9, 2021.
  • Srinivas, Aravind, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. “Universal Planning Networks: Learning Generalizable Representations for Visuomotor Control.” In Proceedings of the 35th International Conference on Machine Learning, 4732–41. PMLR, 2018.
Load More