Andrew Critch

This is Dr. Andrew Critch's professional LessWrong account. Andrew is the CEO of Encultured AI, and works for ~1 day/week as a Research Scientist at the Center for Human-Compatible AI (CHAI) at UC Berkeley. He also spends around a ½ day per week volunteering for other projects like Berkeley Existential Risk initiative and the Survival and Flourishing Fund. Andrew earned his Ph.D. in mathematics at UC Berkeley studying applications of algebraic geometry to machine learning models. During that time, he cofounded the Center for Applied Rationality and SPARC. Dr. Critch has been offered university faculty and research positions in mathematics, mathematical biosciences, and philosophy, worked as an algorithmic stock trader at Jane Street Capital’s New York City office, and as a Research Fellow at the Machine Intelligence Research Institute. His current research interests include logical uncertainty, open source game theory, and mitigating race dynamics between companies and nations in AI development.

Wiki Contributions


Jan, I agree with your references, especially Friston et al.  I think those kinds of understanding, as you say, have not adequately made their way into utility utility-theoretic fields like econ and game theory, so I think the post is valid as a statement about the state of understanding in those utility-oriented fields.  (Note that the post is about "a missing concept from the axioms of game theory and bargaining theory" and "a key missing concept from utility theory", and not "concepts missing from the mind of all of humanity".)

8. (Unscoped) Consequentialism — the problem that an AI system engaging in consequentialist reasoning, for many objectives, is at odds with corrigibility and containment (Yudkowsky, 2022, no. 23).

7. Preference plasticity — the possibility of changes to the preferences of human preferences over time, and the challenge of defining alignment in light of time-varying preferences (Russell, 2019, p.263).

6. Mesa-optimizers — instances of learned models that are themselves optimizers, which give rise to the so-called inner alignment problem (Hubinger et al, 2019).

5. Counterfactuals in decision theory — the problem of defining what would have happened if an AI system had made a different choice, such as in the Twin Prisoner's Dilemma (Yudkowsky & Soares, 2017).

4. Impact regularization — the problem of formalizing "change to the environment" in a way that can be effectively used as a regularizer penalizing negative side effects from AI systems (Amodei et al, 2016).

3. Mild optimization — the problem of designing AI systems and objective functions that, in an intuitive sense, don’t optimize more than they have to (Taylor et al, 2016).

2. Corrigibility — the problem of constructing a mind that will cooperate with what its creators regard as a corrective intervention (Soares et al, 2015).

1. AI boxing / containment — the method and challenge of confining an AI system to a "box", i.e., preventing the system from interacting with the external world except through specific restricted output channels (Bostrom, 2014, p.129).

In Part 3 of this series, I plan to write a shallow survey of 8 problems relating to AI alignment, and the relationship of the «boundary» concept to formalizing them.  To save time, I'd like to do a deep dive into just one of the eight problems, based on what commenters here would find most interesting.  If you have a moment, please use the "agree" button (and where desired, "disagree") to vote for which of the eight topics I should go into depth about.  Each topic is given as a subcomment below (not looking for karma, just agree/disagree votes).  Thanks!

Load More