Wiki Contributions


What's a good probability distribution family (e.g. "log-normal") to use for AGI timelines?

Log-normal is a good first guess, but I think its tails are too small (at both ends).

Some alternatives to consider:

  • Erlang distribution (by when will k Poisson events have happened?), or its generalization, Generalized gamma distribution
  • Frechet distribution (what will be the max of a large number of i.i.d. samples?) or its generalization, Generalized extreme value distribution
  • Log-logistic distribution (like log-normal, but heavier-tailed), or its generalization, Singh–Maddala distribution

Of course, the best Bayesian forecast you could come up with, derived from multiple causal factors such as hardware and economics in addition to algorithms, would probably score a bit better than any simple closed-form family like this, but I'd guess literally only about 1 to 2 bits better (in terms of log-score).

davidad's Shortform

“Concern, Respect, and Cooperation” is a contemporary moral-philosophy book by Garrett Cullity which advocates for a pluralistic foundation of morality, based on three distinct principles:

  • Concern: Moral patients’ welfare calls for promotion, protection, sensitivity, etc.
  • Respect: Moral patients’ self-expression calls for non-interference, listening, address, etc.
  • Cooperation: Worthwhile collective action calls for initiation, joining in, collective deliberation, sharing responsibility, etc. And one bonus principle, whose necessity he’s unsure of:
  • Protection: Precious objects call for protection, appreciation, and communication of the appreciation.

What I recently noticed here and want to write down is a loose correspondence between these different foundations for morality and some approaches to safe superintelligence:

  • CEV-maximization corresponds to finding a good enough definition of human welfare that Comcern alone suffices for safety.
  • Corrigibility corresponds to operationalizating some notion of Respect that would alone suffice for safety.
  • Multi-agent approaches lean in the direction of Cooperation.
  • Approaches that aim to just solve literally the “superintelligence that doesn’t destroy us” problem, without regard for the cosmic endowment, sometimes look like Protection.

Cullity argues that none of his principles is individually a satisfying foundation for morality, but that all four together (elaborated in certain ways with many caveats) seem adequate (and maybe just the first three). I have a similar intuition about AI safety approaches. I can’t yet make the analogy precise, but I feel worried when I imagine corrigibility alone, CEV alone, bargaining alone (whether causal or acausal), or Earth-as-wildlife-preserve; whereas I feel pretty good imagining a superintelligence that somehow balances all four. I can imagine that one of them might suffice as a foundation for the others, but I think this would be path-dependent at best. I would be excited about work that tries to do for Cullity’s entire framework what CEV does for pure single-agent utilitarianism (namely, make it more coherent and robust and closer to something that could be formally specified).

ELK Computational Complexity: Three Levels of Difficulty

So how are we supposed to solve ELK, if we are to assume that it's intractable?

A different answer to this could be that a "solution" to ELK is one that is computable, even if intractable. By analogy, algorithmic "solutions" to probabilistic inference on Bayes nets are still solutions even though the problem is provably NP-hard. It's up to the authors of ELK to disambiguate what they're looking for in a "solution," and I like the ideas here (especially in Level 2), but just wanted to point out this alternative to the premise.

A broad basin of attraction around human values?

My impression of the plurality perspective around here is that the examples you give (e.g. overweighting contemporary ideology, reinforcing non-truth-seeking discourse patterns, and people accidentally damaging themselves with AI-enabled exotic experiences) are considered unfortunate but acceptable defects in a "safe" transition to a world with superintelligences. These scenarios don't violate existential safety because something that is still recognizably humanity has survived (perhaps even more recognizably human than you and I would hope for).

I agree with your sense that these are salient bad outcomes, but I think they can only be considered "existentially bad" if they plausibly get "locked-in," i.e. persist throughout a substantial fraction of some exponentially-discounted future light-cone. I think Paul's argument amounts to saying that a corrigibility approach focuses directly on mitigating the "lock-in" of wrong preferences, whereas ambitious value learning would try to get the right preferences but has a greater risk of locking-in its best guess.

ELK First Round Contest Winners

Congratulations to all the new winners!

This seems like a good time to mention that, after I was awarded a retroactive prize alongside the announcement of this contest, I estimated I’d have been about half as likely to have generated my idea without having read an earlier comment by Scott Viteri, so based on Shapley value I suggested reallocating 25% of my prize to Scott, which I believe ARC did. I’m delighted to see Scott also getting a prize for a full proposal.

I’m excited for the new proposals and counterexamples to be published so that we can all continue to build on each others’ ideas in the open.

davidad's Shortform

Useful primitives for incentivizing alignment-relevant metrics without compromising on task performance might include methods like Orthogonal Gradient Descent or Averaged Gradient Episodic Memory, evaluated and published in the setting of continual learning or multi-task learning. Something like “answer questions honestly” could mathematically be thought of as an additional task to learn, rather than as an inductive bias or regularization to incorporate. And I think these two training modifications are quite natural (I just came to essentially the same ideas independently and then thought “if either of these would work then surely the multi-task learning folks would be doing them?” and then I checked and indeed they are). Just some more nifty widgets to add to my/our toolbox.

davidad's Shortform

Out of curiosity, this morning I did a literature search about "hard-coded optimization" in the gradient-based learning space—that is, people deliberately setting up "inner" optimizers in their neural networks because it seems like a good way to solve tasks. (To clarify, I don't mean deliberately trying to make a general-purpose architecture learn an optimization algorithm, but rather, baking an optimization algorithm into an architecture and letting the architecture learn what to do with it.)

Why is this interesting?

  • The most compelling arguments in Risks from Learned Optimzation that mesa-optimizers will appear involve competitiveness: incorporating online optimization into a policy can help with generalization, compression, etc.
    • If inference-time optimization really does help competitiveness, we should expect to see some of the relevant competitors trying to do it on purpose.
    • I recall some folks saying in 2019 that the apparent lack of this seemed like evidence against the arguments that mesa-optimizers will be competitive.
    • To the extent there is now a trend toward explicit usage of inference-time optimizers, that supports the arguments that mesa-optimizers would be competitive, and thus may emerge accidentally as general-purpose architectures scale up.
  • More importantly (and also mentioned in Risks from Learned Optimzation, as "hard-coded optimization"), if the above arguments hold, then it would help safety to bake in inference-time optimization on purpose, since we can better control and understand optimization when it's engineered—assuming that engineering it doesn't sacrifice task performance (so that the incentive for the base optimizer to evolve a de novo mesa-optimizer is removed).
    • So, engineered inference-time optimization is plausibly one of those few capabilities research directions that is weakly differential tech development in the sense that it accelerates safe AI more than it accelerates unsafe AI (although it accelerates both). I'm not confident enough about this to say that it's a good direction to work on, but I am saying it seems like a good direction to be aware of and occasionally glance at.
    • My impression is that a majority of AI alignment/safety agendas/proposals the last few years have carried a standard caveat that they don't address the inner alignment problem at all, or at least deceptive alignment in particular.
    • As far as I can tell, there are few public directions about how to address deceptive alignment:
    • Although hard-coding optimization certainly doesn't rule out learned optimization, I'm optimistic that it may be an important component of a suite of safety mechanisms (combined with "transparency via ELK" and perhaps one or two other major ideas, which may not be discovered yet) that finally rule out deceptive alignment.

Anyway, here's (some of) what I found:

Here are some other potentially relevant papers I haven't processed yet:

  • Ma, Hengbo, Bike Zhang, Masayoshi Tomizuka, and Koushil Sreenath. “Learning Differentiable Safety-Critical Control Using Control Barrier Functions for Generalization to Novel Environments.” ArXiv:2201.01347 [Cs, Eess], January 7, 2022. http://arxiv.org/abs/2201.01347.
  • Rojas, Junior, Eftychios Sifakis, and Ladislav Kavan. “Differentiable Implicit Soft-Body Physics.” ArXiv:2102.05791 [Cs], September 9, 2021. http://arxiv.org/abs/2102.05791.
  • Srinivas, Aravind, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. “Universal Planning Networks: Learning Generalizable Representations for Visuomotor Control.” In Proceedings of the 35th International Conference on Machine Learning, 4732–41. PMLR, 2018. https://proceedings.mlr.press/v80/srinivas18b.html.
Worst-case thinking in AI alignment

Agreed—although optimizing for the worst case is usually easier than optimizing for the average case, satisficing for the worst case is necessarily harder (and, in ML, typically impossible) than satisficing for the average case.

Worst-case thinking in AI alignment

Here's the results of an abbreviated literature search for papers that bring quantile-case concepts into contact with contemporary RL and/or deep learning:

  • Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning. Christoph Dann, Tor Lattimore, Emma Brunskill. NIPS 2017.
    • Defines a concept of "Uniform-PAC bound", which is roughly when -quantile-case episodic regret scales polynomially in .
    • Proves that a Uniform-PAC bound implies:
      • PAC bound
      • Uniform high-probability regret bound
      • Convergence to zero regret with high probability
    • Constructs an algorithm, UBEV, that has a Uniform-PAC bound
    • Empirically compares quite favorably to other algorithms with only PAC or regret bounds
  • Policy Certificates: Towards Accountable Reinforcement Learning. Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill. ICML 2019.
    • Defines an even stronger concept of "IPOC bound", which implies Uniform-PAC, and also outputs a certified per-episode regret bound along with each proposed action.
    • Constructs an algorithm ORLC that has an IPOC-bound
    • Empirically compares favorably to UBEV
  • Revisiting Generalization for Deep Learning: PAC-Bayes, Flat Minima, and Generative Models. Gintare Dziugaite. December 2018 PhD thesis under Zoubin Ghahrmani.
  • Lipschitz Lifelong Reinforcement Learning. Erwan Lecarpentier, David Abel, Kavosh Asadi, et al. AAAI 2021.
    • Defines a pseudometric on the space of all MDPs
    • Proves that the mapping from an MDP to its optimal Q-function is (pseudo-)Lipschitz
    • Uses this to construct an algorithm LRMax that can transfer-learn from past similar MDPs while also being PAC-MDP
  • Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation. Jiafan He, Dongruo Zhou, Quanwaun Gu. NIPS 2021.
    • Constructs an algorithm FLUTE that has a Uniform-PAC bound with a certain linearity assumption on the structure of the MDP being learned.
  • Beyond No Regret: Instance-Dependent PAC RL. Andrew Wagenmaker, Max Simchowitz, Kevin Jamieson. August 2021 preprint.
  • Learning PAC-Bayes Priors for Probabilistic Neural Networks. María Pérez-Ortiz, Omar Rivasplata, Benjamin Guedj, et al. September 2021 preprint.
  • Tigheter Risk Certificates for Neural Networks. María Pérez-Ortiz, Omar Rivasplata, John Shawe-Taylor, Csaba Szepesvári. ICML 2021.
  • PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees. Jonas Rothfuss, Vincent Fortuin, Martin Josifoski, Andreas Krause. ICML 2021.
Worst-case thinking in AI alignment

Somewhere between worst-case and average-case performance is quantile-case performance, known in SRE circles as percentile latency and widely measured empirically in practice (but rarely estimated in theory). Formally, optimizing -quantile-case performance looks like  (compare to my expressions below for other cases). My impression is that quantile-case is heavily underexplored in theoretical CS and also underused in ML, with the exceptions of PAC learning and VC theory.

Load More