Recommended Sequences

Embedded Agency
AGI safety from first principles
Iterated Amplification

Recent Discussion

A safety-capabilities tradeoff is when you have something like a dial on your AGI, and one end of the dial says “more safe but less capable”, and the other end of the dial says “less safe but more capable”.

Make no mistake: safety-capabilities tradeoff dials stink. But I want to argue that they are inevitable, and we better get used to them. I will argue that the discussion should be framed as “Just how problematic is this dial? How do we minimize its negative impact?”, not “This particular approach has a dial, so it’s automatically doomed. Let’s throw it out and talk about something else instead.”

(Recent examples of the latter attitude, at least arguably: here, here.)

1. Background (if it’s not obvious): why do safety-capabilities tradeoff dials stink?

The biggest...

Just posted an analysis of the epistemic strategies used in this post, which helps making the reasoning more explicit IMO.

Introduction: Epistemic Strategies Redux

This post examines the epistemic strategies of Steve Byrnes’ Safety-capabilities tradeoff dials are inevitable in AGI.

(If you want to skim this post, just read the Summary subsection that display the epistemic strategy as a design pattern)

I introduced the concept in a recent post, but didn’t define them except as the “ways of producing” knowledge that are used in a piece of research. If we consider a post or paper as a computer program outputting (producing) knowledge about alignment, epistemic strategies are the underlying algorithm or, even more abstractly, the design patterns.

An example of epistemic strategy, common in natural sciences (and beyond), is

  • Look at the data
  • Find a good explanation
  • Predict new things with that explanation
  • Get new data for checking your prediction

More than just laying out some...

Biological neural networks (i.e. brains) and artificial neural networks have sufficient commonalities that it's often reasonable to treat our knowledge about one as a good starting point for reasoning about the other. So one way to predict how the field of neural network interpretability will develop is by looking at how neuroscience interprets the workings of human brains. I think there are several interesting things to be learned from this, but the one I'll focus on in this post is the concept of modularity: the fact that different parts of the brain carry out different functions. Neuroscientists have mapped many different skills (such as language use, memory consolidation, and emotional responses) to specific brain regions. Note that this doesn’t always give us much direct insight into how...

4Jaime Sevilla1dRelevant related work : NNs are surprisingly modular [] On the topic of pruning neural networks, see the lottery ticket hypothesis []
3jsd18hI believe Richard linked to Clusterability in Neural Networks [], which has superseded Pruned Neural Networks are Surprisingly Modular []. The same authors also recently published Detecting Modularity in Deep Neural Networks [].
10johnswentworth1dWhy would that be our default expectation? We don't have direct access to all of the underlying parameters in the brain. We can't even simulate it yet, let alone take a gradient.
4Richard Ngo16hLots of reasons. Neural networks are modelled after brains. They both form distributed representations at very large scales, they both learn over time, etc etc. Sure, you've pointed out a few differences, but the similarities are so great that this should be the main anchor for our expectations (rather than, say, thinking that we'll understand NNs the same way we understand support vector machines, or the same way we understand tree search algorithms, or...).
3johnswentworth13hPerhaps a good way to summarize all this is something like "qualitatively similar models probably work well for brains and neural networks". I agree to a large extent with that claim (though there was a time when I would have agreed much less), and I think that's the main thing you need for the rest of the post. "Ways we understand" comes across as more general than that - e.g. we understand via experimentally probing physical neurons vs spectral clustering of a derivative matrix.
4Alex Turner14hI'm not convinced that these similarities are great enough to merit such anchoring. Just because NNs have more in common with brains than with SVMs, does not imply that we will understand NNs in roughly the same ways that we understand biological brains. We could understand them in a different set of ways than we understand biological brains, and differently than we understand SVMs. Rather than arguing over reference class, it seems like it would make more sense to note the specific ways in which NNs are similar to brains, and what hints those specific similarities provide.
3Daniel_Eth17hThe statement seems almost tautological – couldn't we somewhat similarly claim that we'll understand NNs in roughly the same ways that we understand houses, except where we have reasons to think otherwise? The "except where we have reasons to think otherwise" bit seems to be doing a lot of work.
2Richard Ngo16hCompare: when trying to predict events, you should use their base rate except when you have specific updates to it. Similarly, I claim, our beliefs about brains should be the main reference for our beliefs about neural networks, which we can then update from. I agree that the phrasing could be better; any suggestions?
2johnswentworth13hI actually think you could just drop that intro altogether, or move it later into the post. We do have pretty good evidence of modularity in the brain (as well as other biological systems) and in trained neural nets; it seems to be a pretty common property of large systems "evolved" by local optimization. And the rest of the post (as well as some of the other comments) does a good job of talking about some of that evidence. It's a good post, and I think the arguments later in the post are stronger than that opening. (On the other hand, if you're opening with it because that was your own main prior, then that makes sense. In that case, maybe note that it was a prior for you, but that the evidence from other directions is strong enough that we don't need to rely much on that prior?)

Thanks, that's helpful. I do think there's a weak version of this which is an important background assumption for the post (e.g. without that assumption I'd need to explain the specific ways in which ANNs and BNNs are similar), so I've now edited the opening lines to convey that weak version instead. (I still believe the original version but agree that it's not worth defending here.)

1Daniel_Eth15hYeah, I'm not trying to say that the point is invalid, just that phrasing may give the point more appeal than is warranted from being somewhat in the direction of a deepity [] . Hmm, I'm not sure what better phrasing would be.

This post contains the abstract and executive summary of a new 96-page paper from authors at the Future of Humanity Institute and OpenAI.


In many contexts, lying – the use of verbal falsehoods to deceive – is harmful. While lying has traditionally been a human affair, AI systems that make sophisticated verbal statements are becoming increasingly prevalent. This raises the question of how we should limit the harm caused by AI “lies” (i.e. falsehoods that are actively selected for). Human truthfulness is governed by social norms and by laws (against defamation, perjury, and fraud). Differences between AI and humans present an opportunity to have more precise standards of truthfulness for AI, and to have these standards rise over time. This could provide significant benefits to public epistemics and...

One way in which this paper (or the things policymakers and CEOs might do if they read it & like it) might be net-negative:

Maybe by default AIs will mostly be trained to say whatever maximizes engagement/clicks/etc., and so they'll say all sorts of stuff and people will quickly learn that a lot of it is bullshit and only fools will place their trust in AI. In the long run, AIs will learn to deceive us, or actually come to believe their own bullshit. But at least we won't trust them.

But if people listen to this paper they might build all sorts of presti... (read more)

18Wei Dai3dThanks for addressing some very important questions, but this part feels too optimistic (or insufficiently pessimistic) to me. If I was writing this paper, I'd add some notes about widespread complaints of left-wing political bias in Wikipedia and academia (you don't mention the latter but surely it counts as a decentralized truth-evaluation body?), and note that open-source software projects and prediction markets are both limited to topics with clear and relatively short feedback cycles from reality / ground truth (e.g., we don't have to wait decades to find out for sure whether some code works or not, prediction markets can't handle questions like "What causes outcome disparities between groups A and B?"). I would note that on questions outside this limited set, we seem to know very little about how to prevent any evaluation bodies, whether decentralized or not, from being politically captured.
3owencb2dThanks, I think that these are good points and worth mentioning. I particularly like the boundary you're trying to identify between where these decentralized mechanisms have a good track record and where they don't. On that note I think that although academia does have complaints about political bias, at least some disciplines seem to be doing a fairly good job of truth-tracking on complex topics. I'll probably think more about this angle. (I still literally agree with the quoted content, and think that decentralized systems have something going for them which is worth further exploration, but the implicature may be too strong -- in particular the two instances of "might" are doing a lot of work.)
3Owain Evans2dA few points: 1. Political capture is a matter of degree. For a given evaluation mechanism, we can ask what percentage of answers given by the mechanism were false or inaccurate due to bias. My sense is that some mechanisms/resources would score much better than others. I’d be excited for people to do this kind of analysis with the goal of informing the design of evaluation mechanisms for AI. I expect humans would ask AI many questions that don’t depend much on controversial political questions. This would include most questions about the natural sciences, math/CS, and engineering. This would also include “local” questions about particular things (e.g. “Does the doctor I’m seeing have expertise in this particular sub-field?”, “Am I likely to regret renting this particular apartment in a year?”). Unless the evaluation mechanism is extremely biased, it seems unlikely it would give biased answers for these questions. (The analogous question is what percentage of all sentences on Wikipedia are politically controversial.) 2. AI systems have the potential to provide rich epistemic information about their answers. If a human is especially interested in a particular question, they could ask, “Is this controversial? What kind of biases might influence answers (including your own answers)? What’s the best argument on the opposing side? How would you bet on a concrete operationalized version of the question?”. The general point is that humans can interact with the AI to get more nuanced information (compared to Wikipedia or academia). On the other hand: (a) some humans won’t ask for more nuance, (b) AIs may not be smart enough to provide it, (c) the same political bias may influence how the AI provides nuance. 3. Over time, I expect AI will be increasingly involved in the process of evaluating other AI systems. This doesn’t remove human biases. However, it might mean the problem of avoiding capture is somewhat different than with (say) academia and other human institutions
5Wei Dai2dBut there's now a question of "what is the AI trying to do?" If the truth-evaluation method is politically biased (even if not "extremely"), then it's very likely no longer "trying to tell the truth". I can imagine two other possibilities: 1. It might be "trying to advance a certain political agenda". In this case I can imagine that it will selectively and unpredictably manipulate answers to especially important questions. For example it might insert backdoors into infrastructure-like software when users ask it coding questions, then tell other users how to take advantage of those backdoors to take power, or damage some important person or group's reputation by subtly manipulating many answers that might influence how others view that person/group, or push people's moral views in a certain direction by subtly manipulating many answers, etc. 2. It might be "trying to tell the truth using a very strange prior or reasoning process", which also seems likely to have unpredictable and dangerous consequences down the line, but harder for me to imagine specific examples as I have little idea what the prior or reasoning process will be. Do you have another answer to "what is the AI trying to do?", or see other reasons to be less concerned about this than I am?
2Isaac Poulton3dI think this touches on the issue of the definition of "truth". A society designates something to be "true" when the majority of people in that society believe something to be true. Using the techniques outlined in this paper, we could regulate AIs so that they only tell us things we define as "true". At the same time, a 16th century society using these same techniques would end up with an AI that tells them to use leeches to cure their fevers. What is actually being regulated isn't "truthfulness", but "accepted by the majority-ness". This works well for things we're very confident about (mathematical truths, basic observations), but begins to fall apart once we reach even slightly controversial topics. This is exasperated by the fact that even seemingly simple issues are often actually quite controversial (astrology, flat earth, etc.). This is where the "multiple regulatory bodies" part comes in. If we have a regulatory body that says "X, Y, and Z are true" and the AI passes their test, you know the AI will give you answers in line with that regulatory body's beliefs. There could be regulatory bodies covering the whole spectrum of human beliefs, giving you a precise measure of where any particular AI falls within that spectrum.
3Daniel Kokotajlo2dWould this multiple evaluation/regulatory bodies solution not just lead to the sort of balkanized internet described in this story [] ? I guess multiple internet censorship-and-propaganda-regimes is better than one. But ideally we'd have none. One alternative might be to ban or regulate persuasion tools, i.e. any AI system optimized for an objective/reward function that involves persuading people of things. Especially politicized or controversial things.
3Owain Evans2dStandards for truthful AI could be "opt-in". So humans might (a) choose to opt into truthfulness standards for their AI systems, and (b) choose from multiple competing evaluation bodies. Standards need not be mandated by governments to apply to all systems. (I'm not sure how much of your Balkanized internet is mandated by governments rather than arising from individuals opting into different web stacks). We also discuss having different standards for different applications. For example, you might want stricter and more conservative standards for AI that helps assess nuclear weapon safety than for AI that teaches foreign languages to children or assists philosophers with thought experiments.
3Daniel Kokotajlo2dIn my story it's partly the result of individual choice and partly the result of government action, but I think even if governments stay out of it, individual choice will be enough to get us there. There won't be a complete stack for every niche combination of views; instead, the major ideologies will each have their own stack. People who don't agree 100% with any major ideology (which is most people) will have to put up with some amount of propaganda/censorship they don't agree with.

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that, while I work at DeepMind, this newsletter represents my personal views and not those of my employer.


Unsolved Problems in ML Safety (Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt) (summarized by Dan Hendrycks): To make the case for safety to the broader machine learning research community, this paper provides a revised and expanded collection of concrete technical safety research problems, namely:

1. Robustness: Create models that are resilient to adversaries, unusual situations, and Black...

5Alex Turner2dCoherence arguments sometimes are enough [] , depending on what the agent is coherent over.
2Rohin Shah1dThat's an assumption :P (And it's also not one that's obviously true, at least according to me.)

What is the extra assumption? If you're making a coherence argument, that already specifies the domain of coherence, no? And so I'm not making any more assumptions than the original coherence argument did (whatever that argument was). I agree that the original coherence argument can fail, though.

Here’s a description of the project Redwood Research is working on at the moment. First I’ll say roughly what we’re doing, and then I’ll try to explain why I think this is a reasonable applied alignment project, and then I’ll talk a bit about the takeaways I’ve had from the project so far.

There are a bunch of parts of this that we’re unsure of and figuring out as we go; I’ll try to highlight our most important confusions as they come up. I’ve mentioned a bunch of kind of in-the-weeds details because I think they add flavor. This is definitely just me describing a work in progress, rather than presenting any results.

Thanks to everyone who’s contributed to the project so far: the full-time Redwood technical team of...

1Adam Shimi4dI've read the intransitive dice page, but I'm confused on how it might apply here? Like concretely, what are the dice in the analogy?
3Buck Shlegeris2dSuppose you have three text-generation policies, and you define "policy X is better than policy Y" as "when a human is given a sample from both policy X and policy Y, they prefer the sample from the latter more than half the time". That definition of "better" is intransitive.

Hum, I see. And is your point that it should not create a problem because you're only doing comparison X vs Y and Z vs Y (where Y is the standard policy and X and Z are two of your conservative policies) but you don't really care about the comparison between X and Z?

Load More