Xuan (Tan Zhi Xuan)

PhD student at MIT (ProbComp / CoCoSci), working on probabilistic programming for agent understanding and value alignment.

Wiki Contributions


Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

Adding some thoughts as someone who works on probabilistic programming, and has colleagues who work on neurosymbolic approaches to program synthesis:

  • I think a lot of Bayes net structure learning / program synthesis approaches (Bayesian or otherwise) have the issue of uninformative variable names, but I do think it's possible to distinguish between structural interpretability and naming interpretability, as others have noted.
  • In practice, most neural or Bayesian program synthesis applications I'm aware of exhibit something like structural interpretability, because the hypothesis space they live in is designed by modelers to have human-interpretable semantic structure. Two good examples of this are the prior over programs that generate handwritten characters in Lake et al (2015), and the PCFG prior over Gaussian Process covariance kernels in Saad et al (2019). See e.g. Figure 6 on how you perform analysis on programs generated by this prior, to determine whether a particular timeseries is likely to be periodic, has a linear trend, has a changepoint, etc.
  • Regarding uninformative variable names, there's ongoing work on using natural language to guide program synthesis, so as to come up with more language-like conceptual abstractions (e.g. Wong et al 2021). I wouldn't be surprised if these approaches could also be extended to come up with informative variable and function names / comments. A related line of work is that people are starting to use LLMs to deobfuscate code (e.g. Lachaux et al 2021), and I expect the same techniques will work for synthesized code.

For these reasons, I'm more optimistic about the interpretability prospects of learning approaches that generate models or code that look like traditional symbolic programs, relative to end-to-end deep learning approaches. (Note that neural networks are also "symbolic programs", just written with a more restricted set of [differentiable] primitives, and typically staying within a set of widely used program structures [i.e. neural architectures]). 

The more difficult question IMO is whether this interpretability comes at the cost of capabilities. I think this is possibly true in some domains (e.g. learning low-level visual patterns and cues), but not others (e.g. learning the compositional structure of e.g. furniture-like objects).

Building brain-inspired AGI is infinitely easier than understanding the brain

I haven't seen compelling (to me) examples of people going successfully from psychology to algorithms without stopping to consider anything whatsoever about how the brain is constructed .

Some recent examples, off the top of my head!

One reason it's tricky to make sense of psychology data on its own, I think, is the interplay between (1) learning algorithms, (2) learned content (a.k.a. "trained models"), (3) innate hardwired behaviors (mainly in the brainstem & hypothalamus). What you especially want for AGI is to learn about #1, but experiments on adults are dominated by #2, and experiments on infants are dominated by #3, I think.

I guess this depends on how much you think we can make progress towards AGI by learning what's innate / hardwired / learned at an early age in humans and building that into AI systems, vs. taking more of a "learn everything" approach! I personally think there may still be a lot of interesting human-like thinking and problem solving strategies that we haven't figured out to implement as algorithms yet (e.g. how humans learn to program, and edit + modify programs and libraries to make them better over time), that adult and child studies would be useful in order to characterize what might even be aiming for, even if ultimately the solution is to use some kind of generic learning algorithm to reproduce it. I also think there's this fruitful in-between (1) and (3), which is to ask, "What are the inductive biases that guide human learning?", which I think you can make a lot of headway on without getting to the neural level.

Building brain-inspired AGI is infinitely easier than understanding the brain

This was a great read! I wonder how much you're committed to "brain-inspired" vs "mind-inspired" AGI, given that the approach to "understanding the human brain" you outline seems to correspond to Marr's computational and algorithmic levels of analysis, as opposed to the implementational level (see link for reference). In which case, some would argue, you don't necessarily have to do too much neuroscience to reverse engineer human intelligence. A lot can be gleaned by doing classic psychological experiments to validate the functional roles of various aspects of human intelligence, before examining in more detail their algorithms and data structures (perhaps this time with the help of brain imaging, but also carefully designed experiments that elicit human problem solving heuristics, search strategies, and learning curves).

I ask because I think "brain-inspired" often gets immediately associated with neural networks, and not say, methods for fast and approximate Bayesian inference (MCMC, particle filters), which are less the AI zeitgeist nowadays, but still very much how cognitive scientists understand the human mind and its capabilities.


Hierarchical planning: context agents

Yup! And yeah I think those are open research questions -- inference over certain kinds of non-parametric Bayesian models is tractable, but not in general. What makes me optimistic is that humans in similar cultures have similar priors over vast spaces of goals, and seem to do inference over that vast space in a fairly tractable manner. I think things get harder when you can't assume shared priors over goal structure or task structure, both for humans and machines.

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Belatedly reading this and have a lot of thoughts about the connection between this issue and robustness to ontological shifts (which I've written a bit about here), but I wanted to share a paper which takes a very small step in addressing some of these questions by detecting when the human's world model may diverge from a robot's world model, and using that as an explanation for why a human might seem to be acting in strange or counter-productive ways:

Where Do You Think You're Going?: Inferring Beliefs about Dynamics from Behavior
Siddharth Reddy, Anca D. Dragan, Sergey Levine

Inferring intent from observed behavior has been studied extensively within the frameworks of Bayesian inverse planning and inverse reinforcement learning. These methods infer a goal or reward function that best explains the actions of the observed agent, typically a human demonstrator. Another agent can use this inferred intent to predict, imitate, or assist the human user. However, a central assumption in inverse reinforcement learning is that the demonstrator is close to optimal. While models of suboptimal behavior exist, they typically assume that suboptimal actions are the result of some type of random noise or a known cognitive bias, like temporal inconsistency. In this paper, we take an alternative approach, and model suboptimal behavior as the result of internal model misspecification: the reason that user actions might deviate from near-optimal actions is that the user has an incorrect set of beliefs about the rules -- the dynamics -- governing how actions affect the environment. Our insight is that while demonstrated actions may be suboptimal in the real world, they may actually be near-optimal with respect to the user's internal model of the dynamics. By estimating these internal beliefs from observed behavior, we arrive at a new method for inferring intent. We demonstrate in simulation and in a user study with 12 participants that this approach enables us to more accurately model human intent, and can be used in a variety of applications, including offering assistance in a shared autonomy framework and inferring human preferences.

Writing Causal Models Like We Write Programs

Belatedly seeing this post, but I wanted to note that probabilistic programming languages (PPLs) are centered around this basic idea! Some useful links and introductions to PPLs as a whole:
- Probabilistic models of cognition (web book)
- WebPPL
- An introduction to models in Pyro
- Introduction to Modeling in Gen

And here's a really fascinating paper by some of my colleagues that tries to model causal interventions that go beyond Pearl's do-operator, by formalizing causal interventions as (probabilistic) program transformations:

Bayesian causal inference via probabilistic program synthesis
Sam Witty, Alexander Lew, David Jensen, Vikash Mansinghka

Causal inference can be formalized as Bayesian inference that combines a prior distribution over causal models and likelihoods that account for both observations and interventions. We show that it is possible to implement this approach using a sufficiently expressive probabilistic programming language. Priors are represented using probabilistic programs that generate source code in a domain specific language. Interventions are represented using probabilistic programs that edit this source code to modify the original generative process. This approach makes it straightforward to incorporate data from atomic interventions, as well as shift interventions, variance-scaling interventions, and other interventions that modify causal structure. This approach also enables the use of general-purpose inference machinery for probabilistic programs to infer probable causal structures and parameters from data. This abstract describes a prototype of this approach in the Gen probabilistic programming language.

AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

Replying to the specific comments:

This still seems like a fair way to evaluate what the alignment community thinks about, but I think it is going to overestimate how parochial the community is. For example, if you go by "what does Stuart Russell think is important", I expect you get a very different view on the field, much of which won't be in the Alignment Newsletter.

I agree. I intended to gesture a little bit at this when I mentioned that "Until more recently, It’s also been excluded and not taken very seriously within traditional academia", because I think one source of greater diversity has been the uptake of AI alignment in traditional academia, leading to slightly more inter-disciplinary work, as well as a greater diversity of AI approaches. I happen to think that CHAI's research publications page reflects more of the diversity of approaches I would like to see, and wish that more new researchers were aware of them (as opposed to the advice currently given by, e.g., 80K, which is to skill up in deep learning and deep RL).

Reward functions are typically allowed to depend on actions, and the alignment community is particularly likely to use reward functions on entire trajectories, which can express arbitrary views (though I agree that many views are not "naturally" expressed in this framework).

Yup, I think purely at the level of expressivity, reward functions on a sufficiently extended state space can express basically anything you want. That still doesn't resolve several worries I have though:

  1. Talking about all human motivation using "rewards" tends to promote certain (behaviorist / Humean) patterns of thought over others. In particular I think it tends to obscure the logical and hierarchical structure of many aspects of human motivation -- e.g., that many of our goals are actually instrumental sub-goals in higher-level plans, and that we can cite reasons for believing, wanting, or planning to do a certain thing. I would prefer if people used terms like "reasons for action" and "motivational states", rather than simply "reward functions".
  2. Even if reward functions can express everything you want them to, that doesn't mean they'll be able to learn everything you want them to, or generalize in the appropriate ways. e.g., I think deep RL agents are unlikely to learn the concept of "promises" in a way that generalizes robustly, unless you give them some kind of inductive bias that leads them to favor structures like LTL formulas (This is a related worry to Stuart Armstrong's no-free-lunch theorem.) At some point I intend to write a longer post about this worry.

    Of course, you could just define reward functions over logical formulas and the like, and do something like reward modeling via program induction, but at that point you're no longer using "reward" in the way its typically understood. (This is similar to move, made by some Humeans, that reason can only be motivating because we desire to follow reason. That's fair enough, but misses the point for calling certain kinds of motivations "reasons" at all.)

(I'd cite deep learning generally, not just deep RL.)

You're right, that's what I meant, and have updated the post accordingly.

If you start with an uninformative prior and no other evidence, it seems like you should be focusing a lot of attention on the paradigm that is most successful / popular. So why is this influence "undue"?

I agree that if you start with a very uninformative prior, focusing on the most recently successful paradigm makes sense. But I think once you take into account slightly more information, I think there's reason to think the AI alignment community is currently overly biased towards deep learning:

  1. The trend-following behavior in most scientific & engineering fields, including AI, should make us skeptical that currently popular approaches are popular for the right reasons. In the 80s everyone was really excited about expert systems and the 5th generation project. About 10 years ago, Bayesian non-parametrics were really popular. Now deep learning is popular. Knowing this history suggests that we should be a little more careful about joining the bandwagon. Unfortunately, a lot of us joining the field now don't really know this history, nor are we necessarily exposed to the richness and breadth of older approaches before diving headfirst into deep learning (I only recognized this after starting my PhD and started learning more about symbolic AI planning and programming languages research).
  2. We have extra reason to be cautious about deep learning being popular for the wrong reasons, given that many AI researchers say that we should be focusing less on machine learning while at the same time publishing heavily in machine learning. For example, at the AAAI 2019 informal debate, the majority of audience members voted against the proposition that "The AI community today should continue to focus mostly on ML methods". At some point during the debate, it was noted that despite the opposition to ML, most papers at AAAI that year were about ML, and it was suggested, to some laughter, that people were publishing in ML simply because that's what would get them published.
  3. The diversity of expert opinion about whether deep learning will get us to AGI doesn't feel adequately reflected in the current AI alignment community. Not everyone thinks the Bitter Lesson is quite the lesson we have to learn at. A lot of of prominent researchers like Stuart Russell, Gary Marcus, and Josh Tenenbaum all think that we need to re-invigorate symbolic and Bayesian approaches (perhaps through hybrid neuro-symbolic methods), and if you watch the 2019 Turing Award keynotes by both Hinton and Bengio, both of them emphasize the importance of having structured generative models of the world (they just happen to think it can be achieved by building the right inductive biases into neural networks). In contrast, outside of MIRI, it feels like a lot of the alignment community anchors towards the work that's coming out of OpenAI and DeepMind.

My own view is that the success of deep learning should be taken in perspective. It's good for certain things, and certain high-data training regimes, and will remain good for those use cases. But in a lot of other use cases, where we might care a lot about sample efficiency and rapid + robust generalizability, most of the recent progress has, in my view, been made by cleverly integrating symbolic approaches with neural networks (even AlphaGo can be seen as a version of this, if one views MCTS as symbolic).  I expect future AI advances to occur in a similar vein, and for me that lowers the relevance of ensuring that end-to-end DL approaches are safe and robust.

AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

Thanks for this summary. Just a few things I would change:

  1. "Deep learning" instead of "deep reinforcement learning" at the end of the 1st paragraph -- this is what I meant to say, and I'll update the original post accordingly.
  2. I'd replace "nice" with "right" in the 2nd paragraph.
  3. "certain interpretations of Confucian philosophy" instead of "Confucian philosophy", "the dominant approach in Western philosophy" instead of "Western philosophy" -- I think it's important not to give the impression that either of these is a monolith.
AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

Thanks for these thoughts! I'll respond to your disagreement with the framework here, and to the specific comments in a separate reply.

First, with respect to my view about the sources of AI risk, the characterization you've put forth isn't quite accurate (though it's a fair guess, since I wasn't very explicit about it). In particular:

  1. These days I'm actually more worried by structural risks and multi-multi alignment risks, which may be better addressed by AI governance than technical research per se. If we do reach super-intelligence, I think it's more likely to be along the lines of CAIS than the kind of agential super-intelligence pictured by Bostrom. That said, I still think that technical AI alignment is important to get right, even in a CAIS-style future, hence this talk -- I see it as necessary, but not sufficient.
  2. I don't think that powerful AI systems will necessarily be optimizing for anything (at least not in the agential sense suggested by Superintelligence). In fact, I think we should actively avoid building globally optimizing agents if possible --- I think optimization is the wrong framework for thinking about "rationality" or "human values", especially in multi-agent contexts. I think it's still non-trivially likely that we'll end up building AGI that's optimizing in some way, just because that's the understanding of "rationality" or "solving a task" that's so predominant within AI research. But in my view, that's precisely the problem, and my argument for philosophical pluralism is in part because it offers theories of rationality, value, and normativity that aren't about "maximizing the good".
  3. Regarding "the good", the primary worry I was trying to raise in this talk has less to do with "ethical error", which can arise due to e.g. Goodhart's curse, and more to do with meta-ethical and meta-normative error, i.e., that the formal concepts and frameworks that the AI alignment community has typically used to understand fuzzy terms like "value", "rationality" and "normativity" might be off-the-mark.

    For me, this sort of error is importantly different from the kind of error considered by inner and outer alignment. It's often implicit in the mathematical foundations of decision theory and ML theory itself, and tends to go un-noticed. For example, once we define rationality as "maximize expected future reward", or assume that human behavior reflects reward-rational implicit choice, we're already making substantive commitments about the nature of "value" and "rationality" that preclude other plausible characterizations of these concepts, some of which I've highlighted in the talk. Of course, there has been plenty of discussion about whether these formalisms are in fact the right ones -- and I think MIRI-style research has been especially valuable for clarifying our concepts of "agency" and "epistemic rationality" -- but I've yet to see some of these alternative conceptions of "value" and "practical rationality" discussed heavily in AI alignment spaces. 

Second, with respect to your characterization of AI development and AI risk, I believe that points 1 and 2 above suggest that our views don't actually diverge that much. My worry is that the difficulty of building machines that "follow common sense" is on the same order of magnitude as "defining the good", and just as beset by the meta-ethical and meta-normative worries I've raised above. After all, "common sense" is going to include "common social sense" and "common moral sense", and this kind of knowledge is irreducibly normative. (In fact, I think there's good reason to think that all knowledge and inquiry is irreducibly normative, but that's a stronger and more contentious claim.)

Furthermore, given that AI is already deployed in social domains which tend to have open scope (personal assistants, collaborative and caretaking robots, legal AI, etc.), I think it's a non-trivial possibility that we'll end up having powerful misaligned AI applied to those contexts, and that either violate their intended scope, or require having wide scope to function well (e.g., personal assistants). No doubt, "follow common sense" is a lower bar than "solve moral philosophy", but on the view that philosophy is just common sense applied to itself, solving "common sense" is already most of the problem. For that reason, I think it deserves a plurality of disciplinary* and philosophical perspectives as well.

(*On this note, I think cognitive science has a lot to offer with regard to understanding "common sense". Perhaps I am overly partial given that I am in computational cognitive science lab, but it does feel like there's insufficient awareness or discussion of cognitive scientific research within AI alignment spaces, despite its [IMO clearcut] relevance.)

Hierarchical planning: context agents

In exchange for the mess, we get a lot closer to the structure of what humans think when they imagine the goal of "doing good." Humans strive towards such abstract goals by having a vague notion of what it would look and feel like, and by breaking down those goals into more concrete sub-tasks. This encodes a pattern of preferences over universe-histories that treats some temporally extended patterns as "states."

Thank you for writing this post! I've had very similar thoughts for the past year or so, and I think the quote above is exactly right. IMO, part of the alignment problem involves representational alignment -- i.e., ensuring that AI systems accurately model both the abstract concepts we use to understand the world, as well as the abstract tasks, goals, and "reasons for acting" that humans take as instrumental or final ends. Perhap's you're already familiar with Bratman's work on Intentions, Plans, & Practical Reason, but to the the extent that "intentions" feature heavily in human mental life as the reasons we cite for why we do things, developing AI models of human intention feels very important.

As it happens, one of the next research projects I'll be embarking on is modeling humans as hierarchical planners (most likely in the vein of Hierarchical Task & Motion Planning in the Now by Kaelbling & Lozano-Perez) in order to do Bayesian inference over their goals and sub-goals -- would be happy to chat more about it if you'd like! 

Load More