# 14

Frontpage

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

# HIGHLIGHTS

Ben Garfinkel on scrutinising classic AI risk arguments (Howie Lempel and Ben Garfinkel) (summarized by Asya): In this podcast, Ben Garfinkel goes through several reasons why he is skeptical of classic AI risk arguments (some previously discussed here (AN #45)). The podcast has considerably more detail and nuance than this summary.

Ben thinks that historically, it has been hard to affect transformative technologies in a way that was foreseeably good for the long-term-- it's hard e.g. to see what you could have done around the development of agriculture or industrialization that would have an impact on the world today. He thinks some potential avenues for long-term influence could be through addressing increased political instability or the possibility of lock-in, though he thinks that it’s unclear what we could do today to influence the outcome of a lock-in, especially if it’s far away.

In terms of alignment, Ben focuses on the standard set of arguments outlined in Nick Bostrom’s Superintelligence, because they are broadly influential and relatively fleshed out. Ben has several objections to these arguments:

- He thinks it isn't likely that there will be a sudden jump to extremely powerful and dangerous AI systems, and he thinks we have a much better chance of correcting problems as they come up if capabilities grow gradually.

- He thinks that making AI systems capable and making AI systems have the right goals are likely to go together.

- He thinks that just because there are many ways to create a system that behaves destructively doesn't mean that the engineering process creating that system is likely to be attracted to those destructive systems; it seems like we are unlikely to accidentally create systems that are destructive enough to end humanity.

Ben also spends a little time discussing mesa-optimization (AN #58), a much newer argument for AI risk. He largely thinks that the case for mesa-optimization hasn’t yet been fleshed out sufficiently. He also thinks it’s plausible that learning incorrect goals may be a result of having systems that are insufficiently sophisticated to represent goals appropriately. With sufficient training, we may in fact converge to the system we want.

Given the current state of argumentation, Ben thinks that it's worth EA time to flesh out newer arguments around AI risk, but also thinks that EAs who don't have a comparative advantage in AI-related topics shouldn't necessarily switch into AI. Ben thinks it's a moral outrage that we have spent less money on AI safety and governance than the 2017 movie 'The Boss Baby', starring Alec Baldwin.

Asya's opinion: This podcast covers a really impressive breadth of the existing argumentation. A lot of the reasoning is similar to that I’ve heard from other researchers (AN #94). I’m really glad that Ben and others are spending time critiquing these arguments; in addition to showing us where we’re wrong, it helps us steer towards more plausible risky scenarios.

I largely agree with Ben’s criticisms of the Bostrom AI model; I think mesa-optimization is the best current case for AI risk and am excited to see more work on it. The parts of the podcast where I most disagreed with Ben were:

- I think even in the absence of solid argumentation, I feel good about a prior where AI has a non-trivial chance of being existentially threatening, partially because I think it’s reasonable to put AI in the reference class of ‘new intelligent species’ in addition to ‘new technology’.

- I’m not sure that institutions will address failures sufficiently, even if progress is gradual and there are warnings (AN #104).

Rohin's opinion: I recommend listening to the full podcast, as it contains a lot of detail that wouldn't fit in this summary. Overall I agree pretty strongly with Ben. I do think that some of the counterarguments are coming from a different frame than the classic arguments. For example, a lot of the counterarguments involve an attempt to generalize from current ML practice to make claims about future AI systems. However, I usually imagine that the classic arguments are basically ignoring current ML, and instead claiming that if an AI system is superintelligent, then it must be goal-directed and have convergent instrumental subgoals. If current ML systems don't lead to goal-directed behavior, I expect that proponents of the classic arguments would say that they also won't lead to superintelligent AI systems. I'm not particularly sold on this intuition either, but I can see its appeal.

# TECHNICAL AI ALIGNMENT

ITERATED AMPLIFICATION

AI safety via market making (Evan Hubinger) (summarized by Rohin): If you have an expert, but don’t trust them to give you truthful information, how can you incentivize them to tell you the truth anyway? One option is to pay them every time they provide evidence that changes your mind, with the hope that only once you believe the truth will there be no evidence that can change your mind. This post proposes a similar scheme for AI alignment.

We train two models, M and Adv. Given a question Q, M is trained to predict what answer to Q the human will give at the end of the procedure. Adv on the other hand is trained to produce arguments that will most make M “change its mind”, i.e. output a substantially different distribution over answers than it previously outputted. M can then make a new prediction. This is repeated T times, and eventually the human is given all T outputs produced by Adv, and provides their final answer (which is used to provide a gradient signal for M). After training, we throw away Adv and simply use M as our question-answering system.

One way to think about this is that M is trained to provide a prediction market on “what the human will answer”, and Adv is trained to manipulate the market by providing new arguments that would change what the human says. So, once you see M providing a stable result, that should mean that the result is robust to any argument that Adv could provide, and so it is what the human would say after seeing all the arguments.

This scheme bears some resemblance to debate (AN #5), and it can benefit from schemes that help debate, most notably cross-examination (AN #86). In particular, at every step Adv can cross-examine the previous incarnation of Adv. If the previous incarnation was deceptive, the current incarnation can demonstrate this to the human, which should cause them to disregard the previous argument. We can also add oversight, where an overseer with access to the model ensures that the model does not become non-myopic or deceptive.

Rohin's opinion: I like the simplicity of the idea "find the point at which the human no longer changes their mind", and that this is a new idea of how we can scale training of AI systems beyond human level performance. However, I’m not convinced that the training procedure given in this post would end up at this equilibrium, unless the human very specifically guided the training to do so (an assumption I don’t think we can usually make). It seems that if we were to reach the state where M stably reported the true answer to the question, then Adv would never get any reward -- but Adv could do better by randomizing what arguments it makes, so that M cannot know which arguments H will be exposed to and so can’t stably predict H’s final answer. See more details in this thread.

AI Unsafety via Non-Zero-Sum Debate (Vojtech Kovarik) (summarized by Rohin): This post points out that debate (AN #5) relies crucially on creating a zero-sum game in order to ensure that the debaters point out flaws in each other’s arguments. For example, if you modified debate so that both agents are penalized for an inconclusive debate, then an agent may decide not to point out a flaw in an argument if it believes that it has some chance of confusing the judge.

Tradeoffs between desirable properties for baseline choices in impact measures (Victoria Krakovna) (summarized by Flo): Impact measures (AN #10) usually require a baseline state, relative to which we define impact. The choice of this baseline has important effects on the impact measure's properties: for example, the popular stepwise inaction baseline (where at every step the effect of the current action is compared to doing nothing) does not generate incentives to interfere with environment processes or to offset the effects of its own actions. However, it ignores delayed impacts and lacks incentive to offset unwanted delayed effects once they are set in motion.

This points to a tradeoff between penalizing delayed effects (which is always desirable) and avoiding offsetting incentives, which is desirable if the effect to be offset is part of the objective and undesirable if it is not. We can circumvent the tradeoff by modifying the task reward: If the agent is only rewarded in states where the task remains solved, incentives to offset effects that contribute to solving the task are weakened. In that case, the initial inaction baseline (which compares the current state with the state that would have occurred if the agent had done nothing until now) deals better with delayed effects and correctly incentivizes offsetting for effects that are irrelevant for the task, while the incentives for offsetting task-relevant effects are balanced out by the task reward. If modifying the task reward is infeasible, similar properties can be achieved in the case of sparse rewards by using the inaction baseline, and resetting its initial state to the current state whenever a reward is achieved. To make the impact measure defined via the time-dependent initial inaction baseline Markovian, we could sample a single baseline state from the inaction rollout or compute a single penalty at the start of the episode, comparing the inaction rollout to a rollout of the agent policy.

Flo's opinion: I like the insight that offsetting is not always bad and the idea of dealing with the bad cases using the task reward. State-based reward functions that capture whether or not the task is currently done also intuitively seem like the correct way of specifying rewards in cases where achieving the task does not end the episode.

Dynamic inconsistency of the inaction and initial state baseline (Stuart Armstrong) (summarized by Rohin): In a fixed, stationary environment, we would like our agents to be time-consistent: that is, they should not have a positive incentive to restrict their future choices. However, impact measures like AUP (AN #25) calculate impact by looking at what the agent could have done otherwise. As a result, the agent has an incentive to change what this counterfactual is, in order to reduce the penalty it receives, and it might accomplish this by restricting its future choices. This is demonstrated concretely with a gridworld example.

Rohin's opinion: It’s worth noting that measures like AUP do create a Markovian reward function, which typically leads to time consistent agents. The reason that this doesn’t apply here is because we’re assuming that the restriction of future choices is “external” to the environment and formalism, but nonetheless affects the penalty. If we instead have this restriction “inside” the environment, then we will need to include a state variable specifying whether the action set is restricted or not. In that case, the impact measure would create a reward function that depends on that state variable. So another way of stating the problem is that if you add the ability to restrict future actions to the environment, then the impact penalty leads to a reward function that depends on whether the action set is restricted, which intuitively we don’t want. (This point is also made in this followup post.)

## MISCELLANEOUS (ALIGNMENT)

Arguments against myopic training (Richard Ngo) (summarized by Rohin): Several (AN #34) proposals (AN #102) in AI alignment involve some form of myopic training, in which an AI system is trained to take actions that only maximize the feedback signal in the next timestep (rather than e.g. across an episode, or across all time, as with typical reward signals). In order for this to work, the feedback signal needs to take into account the future consequences of the AI system’s action, in order to incentivize good behavior, and so providing feedback becomes more challenging.

This post argues that there don’t seem to be any major benefits of myopic training, and so it is not worth the cost we pay in having to provide more challenging feedback. In particular, myopic training does not necessarily lead to “myopic cognition”, in which the agent doesn’t think about long-term consequences when choosing an action. To see this, consider the case where we know the ideal reward function R*. In that case, the best feedback to give for myopic training is the optimal Q-function Q*. However, regardless of whether we do regular training with R* or myopic training with Q*, the agent would do well if it estimates Q* in order to select the right action to take, which in turn will likely require reasoning about long-term consequences of its actions. So there doesn’t seem to be a strong reason to expect myopic training to lead to myopic cognition, if we give feedback that depends on (our predictions of) long-term consequences. In fact, for any approval feedback we may give, there is an equivalent reward feedback that would incentivize the same optimal policy.

Another argument for myopic training is that it prevents reward tampering and manipulation of the supervisor. The author doesn’t find this compelling. In the case of reward tampering, it seems that agents would not catastrophically tamper with their reward “by accident”, as tampering is difficult to do, and so they would only do so intentionally, in which case it is important for us to prevent those intentions from arising, for which we shouldn’t expect myopic training to help very much. In the case of manipulating the supervisor, he argues that in the case of myopic training, the supervisor will have to think about the future outputs of the agent in order to be competitive, which could lead to manipulation anyway.

Rohin's opinion: I agree with what I see as the key point of this post: myopic training does not mean that the resulting agent will have myopic cognition. However, I don’t think this means myopic training is useless. According to me, the main benefit of myopic training is that small errors in reward specification for regular RL can incentivize catastrophic outcomes, while small errors in approval feedback for myopic RL are unlikely to incentivize catastrophic outcomes. (This is because “simple” rewards that we specify often lead to convergent instrumental subgoals (AN #107), which need not be the case for approval feedback.) More details in this comment.

A space of proposals for building safe advanced AI (Richard Ngo) (summarized by Rohin): This post identifies six axes on which these previous alignment proposals (AN #102) can be categorized, in the hope that by pushing on particular axes we can generate new proposals. The six axes are:

1. How hard it is for the overseer to give appropriate feedback.

2. To what extent we are trying to approximate a computational structure we know in advance.

3. Whether we are relying on competition between AI agents.

4. To what extent the proposal depends on natural language.

5. To what extent the proposal depends on interpreting the internal workings of neural networks.

6. To what extent the proposal depends on specific environments or datasets.

# AI STRATEGY AND POLICY

Antitrust-Compliant AI Industry Self-Regulation (Cullen O'Keefe) (summarized by Rohin): One way to reduce the risk of unsafe AI systems is to have agreements between corporations that promote risk reduction measures. However, such agreements may run afoul of antitrust laws. This paper suggests that this sort of self-regulation could be done under the “Rule of Reason”, in which a learned profession (such as “AI engineering”) may self-regulate in order to correct a market failure, as long as the effects of such a regulation promote rather than harm competition.

In the case of AI, if AI engineers self-regulate, this could be argued as correcting the information asymmetry between the AI engineers (who know about risks) and the users of the AI system (who don’t). In addition, since AI engineers arguably do not have a monetary incentive, the self-regulation need not be anticompetitive. Thus, this seems like a plausible method by which AI self-regulation could occur without running afoul of antitrust law, and so is worthy of more investigation.

# OTHER PROGRESS IN AI

## DEEP LEARNING

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (Dmitry Lepikhin et al) (summarized by Asya): This paper introduces GShard, a module that makes it easy to write parallel computation patterns with minimal changes to existing model code. GShard automatically does a lot of the work of splitting computations across machines, enabling the easy creation of much larger models than before.

The authors use GShard to train a 600 billion parameter multilingual Transformer translation model that's wide, rather than deep (36 layers). They use a "mixture of experts" model where some of the individual feed-forward networks in the Transformer are replaced with a set of feed-forward networks-- each one an "expert" in some part of the translation. The experts are distributed across different machines, and the function for sending inputs to experts is learned, with each input being sent to the top two most relevant experts. Since each expert only has to process a fraction of all the inputs, the amount of computation needed is dramatically less than if every input were fed through a single, larger network. This decrease in needed computation comes with a decrease in the amount of weight sharing done by the network.

The paper compares the 600 billion parameter model's performance to several other smaller models as well as a 96-layer deep model with only 2.3 billion parameters. For the wide networks, the authors find that in general, larger models do better, but that at some point the larger model starts doing worse for very "low-resource" languages-- languages that don't have much training data available. The authors argue that this is because the low-resource languages benefit from "positive language transfer", an effect where weights encode knowledge learned from training on other languages that can then be applied to the low-resource ones. As you increase the number of experts in the wide model past a certain point, the amount of training that each expert does decreases, so there's less positive language transfer to low-resource languages within each expert.

They also find that deeper networks are more sample efficient, reaching better test error with the same amount of training examples, but are less computationally efficient (given current constraints). The 600 billion parameter, 36-layer model takes 22.4 TPU core years and 4 days to train, reaching a score on the BLEU benchmark of 44.3. The 2.3 billion parameter, 96-layer model takes 235 TPU core years and 42 days to train, reaching a score on the BLEU benchmark of 36.9.

Asya's opinion: I spent most of the summary talking about the language model, but I think it's likely that the cooler thing is in fact GShard, as it will enable other very large models to do model parallelization in the future.

The improved efficiency for wide models here seems like it may go away as we become able to train even deeper models that are extremely general and so much more sample efficient than wide models.

This model technically has more parameters than GPT-3, but it’s “sparse” in that not all the inputs are used to update all the parameters. Sometimes people compare the number of parameters in a neural network to the number of synapses in the human brain to guess at when we're likely to get human-level AI. I find using this number directly to be pretty dubious, partially because, as this paper illustrates, the exact architecture of a system has a big influence on the effective power of each parameter, even within the relatively narrow domain of artificial neural networks.

GPT-3 Creative Fiction (Gwern Branwen and GPT-3) (summarized by Rohin): In Gwern's words, this is "creative writing by OpenAI’s GPT-3 model, demonstrating poetry, dialogue, puns, literary parodies, and storytelling".

Rohin's opinion: I often find it's very useful to stare directly at raw data in order to understand how something works, in addition to looking at summary statistics and graphs that present a very high-level view of the data. While this isn't literally raw data (Gwern heavily designed the prompts, and somewhat curated the outputs), I think it provides an important glimpse into how GPT-3 works that you wouldn't really get from reading the paper (AN #102).

#### FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

#### PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

# 14

Pingbacks
New Comment

There are three features to the 'old arguments' in favour of AI safety, which Ben identifies here:

A discontinuity premise (e.g. “fast takeoff”)
A premise about the relationship between capabilities and objectives (e.g. “orthogonality thesis”)
A premise about the portion of systems of a certain kind that are deadly (e.g. “instrumental convergence thesis”)

I argued in a previous post that the 'discontinuity premise' is based on taking a high-level argument that should be used simply to establish that sufficiently capable AI will produce very fast progress too literally:

The old recursive self-improvement argument, by giving a significant condition for fast growth that seems feasible (Human baseline AI), leads naturally to an investigation of what will happen in the course of reaching that fast growth regime. Christiano and other current notions of continuous takeoff are perfectly consistent with the counterfactual claim that, if an already superhuman ‘seed AI’ were dropped into a world empty of other AI, it would undergo recursive self-improvement.
This in itself, in conjunction with other basic philosophical claims like the orthogonality thesis, is sufficient to promote AI alignment to attention. Then, following on from that, we developed different models of how progress will look between now and AGI.

In other words, for AGI to appear and quickly achieve a DSA, we need what Ben calls a sudden emergence of some highly capable AI and then an explosive aftermath caused by the AI undergoing rapid capability gain, but most argument and attention has been on the latter, because it is often not recognised that both are required for a discontinuity. We can add that the reason that this was not recognised is that the claim 'powerful AI will accelerate progress by being able to create still more powerful AI' taken at face value, seems to imply that this will occur in a specific AI at some specific time. In other words, the (true) conclusion of an abstract argument is assumed (incorrectly) to directly apply to the real world.

Having listened to the podcast, I now think that the 'directly applying a (correct) abstract argument (incorrectly) to the real world' applies to the other two 'old' AI safety arguments. Therefore, we shouldn't say these arguments are incorrect, and they fulfil their purpose of promoting AI risk to our attention, but they have naturally been replaced by more specific arguments as the field has grown.

Ben directly explains this in the case of the orthogonality thesis (the counterintuitive observation that goals and capability are logically independent). The orthogonality thesis does not imply that goals and capability will in fact be independent (the 'process orthogonality thesis'), but it does raise the issue that goals and capability do not necessarily go together, and leads us to ask what happens if they do not.

If we combine the orthogonality thesis with other premises (if progress is sufficiently fast, we won't be in a position to precisely align AGI with our values, as long as its possible to create unaligned AGI slightly more easily than aligned AGI), then we have a concrete risk. But often further arguments of this sort are not made because the abstract orthogonality thesis is assumed to directly apply to the real world.

And the same applies for 'instrumental convergence' - the observation that most possible goals, especially simple goals, imply a tendency to produce extreme outcomes when ruthlessly maximised:

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.

We could see this as marking out a potential danger - a large number of possible mind-designs produce very bad outcomes if implemented. The fact that such designs exist 'weakly suggest' (Ben's words) that AGI poses an existential risk since we might build them. If we add in other premises that imply we are likely to (accidentally or deliberately) build such systems, the argument becomes stronger. But usually the classic arguments simply note instrumental convergence and assume we're 'shooting into the dark' in the space of all possible minds, because they take the abstract statement about possible minds to be speaking directly about the physical world. There are specific reasons to think this might occur (e.g. mesa-optimisation, sufficiently fast progress preventing us from course-correcting if there is even a small initial divergence) but those are the reasons that combine with instrumental convergence to produce a concrete risk, and have to be argued for separately.

So to sum up, I think that there are correct forms of the three classic arguments for AI safety, but these are somewhat weaker than usually believed, and that each one of them has often been interpreted as applying too directly to the real development of AI, rather than as raising a possibility that might be actualised, given other assumptions:

1

A Powerful AI can be used to develop better AI (amongst other things). This will lead to runaway growth.
becomes
B The first AI that is able to develop better AI will experience explosive growth, gaining a decisive strategic advantage

2

A Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal, so misaligned AI is possible
becomes
B the process of imbuing a system with capabilities and the process of imbuing a system with goals are orthogonal, so misaligned AI is likely without a major course-correction.

3

A Most or all systems that behave like sufficiently effective maximizers of “simple” utility functions are dangerous, so misaligned AI is possible
becomes
B Most or all systems that behave like sufficiently effective maximizers of “simple” utility functions are dangerous, so we expect such systems to be common in the set of AIs we are likely to build, so any small error will probably produce such a dangerous AI.

There is a weak, abstract version and a strong, empirical version of each of the 3 claims. The strong version of each of the above points may be (partially) true, but only if we accept other premises. The mistake in the classic arguments is in confusing the weak and strong versions of the arguments and so not requiring other premises or evidence.

In this framing, my earlier post on discontinuous progress argued that 1A and 1B are distinct and that 1B requires other assumptions than just 1A to be true. Then, after listening to Ben's podcast, I realized that the same is true for 2A and 2B, and 3A and 3B.

It is possible in principle that the process of increasing AI capability and aligning the AI go together, rendering the orthogonality thesis not applicable to actual AI development, so 2B is false (that is the goal of many safety approaches like CIRL or IDA), but suppose that this is not the case and there is a slight divergence (2B is partly true). If we do have to take active measures to keep aligning AI with our goals, that might not be possible if AI develops too rapidly at some point (1A), and then it may be the case that some of the systems we are likely to build are dangerous (3B).

So to sum up, the correct conclusion from the old arguments should be: powerful AI can be used to make better AI so progress should be fast eventually, there is no necessary reason for goals and capability to align especially if progress is very fast and it is possible in principle for power-seeking behaviour to arise if we don't course-correct. This doesn't establish anything to the high degree of certainty that proponents of these arguments aimed for, but it is sufficient to promote AI safety to attention.

Rohin's opinion: [...] Overall I agree pretty strongly with Ben. I do think that some of the counterarguments are coming from a different frame than the classic arguments. For example, a lot of the counterarguments involve an attempt to generalize from current ML practice to make claims about future AI systems. However, I usually imagine that the classic arguments are basically ignoring current ML, and instead claiming that if an AI system is superintelligent, then it must be goal-directed and have convergent instrumental subgoals.

I agree that the book Superintelligence does not mention any non-goal-directed approaches to AI alignment (as far as I can recall). But as long as we're in the business-as-usual state, we should expect some well-resourced companies to train competitive goal-directed agents that act in the real world, right? (E.g. Facebook plausibly uses some deep RL approach to create the feed that each user sees). Do you agree that for those systems, the classic arguments about instrumental convergence and the treacherous turn are correct? (If so, I don't understand how come you agree pretty strongly with Ben; he seems to be skeptical that those arguments can be mapped to contemporary ML methods.)

The intent with the part you quoted was to point out a way in which I disagree with Ben (though I'm not sure it's a disagreement rather than a difference in emphasis).

we should expect some well-resourced companies to train competitive goal-directed agents that act in the real world [...] Do you agree that for those systems, the classic arguments about instrumental convergence and the treacherous turn are correct? (If so, I don't understand how come you agree pretty strongly with Ben; he seems to be skeptical that those arguments can be mapped to contemporary ML methods.)

I don't expect that companies will train misaligned superintellligent goal-directed agents that act in the real world. (I still think it's likely enough that we should work on the problem.)

Like, why didn't the earlier less intelligent versions of the system fail in some non-catastrophic way, or if they did, why didn't that cause Facebook to think it shouldn't build superintelligent goal-directed agents / figure out how to build aligned agents?

Ben's counterarguments are about whether misaligned superintelligent systems arise in the first place, not whether such systems would have instrumental convergence + treacherous turns:

- He thinks it isn't likely that there will be a sudden jump to extremely powerful and dangerous AI systems, and he thinks we have a much better chance of correcting problems as they come up if capabilities grow gradually.
- He thinks that making AI systems capable and making AI systems have the right goals are likely to go together.
- He thinks that just because there are many ways to create a system that behaves destructively doesn't mean that the engineering process creating that system is likely to be attracted to those destructive systems; it seems like we are unlikely to accidentally create systems that are destructive enough to end humanity.

Thank you for clarifying!

Like, why didn't the earlier less intelligent versions of the system fail in some non-catastrophic way

Even if we assume there will be no algorithmic-related-discontinuity, I think the following are potential reasons:

1. Detecting deceptive behaviors in complicated environments may be hard. To continue with the Facebook example, suppose that at some point in the future Facebook's feed-creation-agent would behave deceptively in some non-catastrophic way. Suppose it uses some unacceptable technique to increase user engagement (e.g. making users depressed), but it refrains from doing so in situations where it predicts that Facebook engineers would notice. The agent is not that great at being deceptive though, and a lot of times it ends up using the unacceptable technique when there's actually a high risk of the technique being noticed. Thus, Facebook engineers do notice the unacceptable technique at some point and fix the reward function accordingly (penalizing depressing content or whatever). But how will they detect the deceptive behavior itself? Will they be on the lookout for deceptive behavior and use clever techniques to detect it? (If so, what made Facebook transition into a company that takes AI safety seriously?)

2. Huge scale-ups without much intermediate testing. Suppose at some point in the future, Facebook decides to scale up the model and training process of their feed-creation-agent by 100x (by assumption, data is not the bottleneck). It seems to me that this new agent may pose an existential risk even conditioned on the previous agent being completely benign. If you think that Facebook is unlikely to do a 100x scale-up in one go, suppose that their leadership comes to believe that the scale-up would cause their revenue to increase in expectation by 10%. That's ~$7B per year, so they are probably willing to spend a lot of money on the scale-up. Also, they may want to complete the scale-up ASAP because they "lose"$134M for every week of delay.

But how will they detect the deceptive behavior itself? Will they be on the lookout for deceptive behavior and use clever techniques to detect it?

Either they'll be on the lookout, or they'll have some (correct) reason to expect that deception won't happen.

what made Facebook transition into a company that takes AI safety seriously?

AI capabilities have progressed to the point where researchers think it is plausible that AI systems could actually "intentionally" deceive you. And also we have better arguments for risk. And they are being said by more prestigious people.

(I thought I was more optimistic than average on this point, but here it seems like most commenters are more optimistic than I am.)

If you think that Facebook is unlikely to do a 100x scale-up in one go, suppose that their leadership comes to believe that the scale-up would cause their revenue to increase in expectation by 10%.

You'd have to be really confident in this to not do a 10x less costly 10x scale-up first to see what happens there -- I'd be surprised if you could find examples of big companies doing this. OpenAI is perhaps the company that most bets on its beliefs, and it still scaled to 13B parameters before the 175B parameters of GPT-3, even after they had a paper specifically predicting how large language models would scale with more parameters.

Also I'd be surprised if a 100x scale up to be the difference between "subhuman / can't be deceptive" to "can cause an existential catastrophe".