## AI ALIGNMENT FORUMAF

ESRogs

Engineer at CoinList.co. Donor to LW 2.0.

Sorted by New
115y
4

# Wiki Contributions

-3.  I'm assuming you are already familiar with some basics, and already know what 'orthogonality' and 'instrumental convergence' are and why they're true.

I think this is actually the part that I most "disagree" with. (I put "disagree" in quotes, because there are forms of these theses that I'm persuaded by. However, I'm not so confident that they'll be relevant for the kinds of AIs we'll actually build.)

1. The smart part is not the agent-y part

It seems to me that what's powerful about modern ML systems is their ability to do data compression / pattern recognition. That's where the real cognitive power (to borrow Eliezer's term) comes from. And I think that this is the same as what makes us smart.

GPT-3 does unsupervised learning on text data. Our brains do predictive processing on sensory inputs. My guess (which I'd love to hear arguments against!) is that there's a true and deep analogy between the two, and that they lead to impressive abilities for fundamentally the same reason.

If so, it seems to me that that's where all the juice is. That's where the intelligence comes from. (In the past, I've called this the core smarts of our brains.)

On this view, all the agent-y, planful, System 2 stuff that we do is the analogue of prompt programming. It's a set of not-very-deep, not-especially-complex algorithms meant to cajole the actually smart stuff into doing something useful.

When I try to extrapolate what this means for how AI systems will be built, I imagine a bunch of Drexler-style AI services.

Yes, in some cases people will want to chain services together to form something like an agent, with something like goals. However, the agent part isn't the smart part. It's just some simple algorithms on top of a giant pile of pattern recognition and data compression.

Why is that relevant? Isn't an algorithmically simple superintelligent agent just as scary as (if not moreso than) a complex one? In a sense yes, it would still be very scary. But to me it suggests a different intervention point.

If the agency is not inextricably tied to the intelligence, then maybe a reasonable path forward is to try to wring as much productivity as we can out of the passive, superhuman, quasi-oracular just-dumb-data-predictors. And avoid as much as we can ever creating closed-loop, open-ended, free-rein agents.

Am I just recapitulating the case for Oracle-AI / Tool-AI? Maybe so.

But if agency is not a fundamental part of intelligence, and rather something that can just be added in on top, or not, and if we're at a loss for how to either align a superintelligent agent with CEV or else make it corrigible, then why not try to avoid creating the agent part of superintelligent agent?

I think that might be easier than many think...

The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.

https://intelligence.org/files/AIPosNegFactor.pdf

Suppose we have (something like) an agent, with (something like) a utility function. I think it's important to keep in mind the domain of the utility function. (I'll be making basically the same point repeatedly throughout the rest of this comment.)

By default, I don't expect systems that we build, with agent-like behavior (even superintelligently smart systems!), to care about all the atoms in the future light cone.

Humans (and other animals) care about atoms. We care about (our sensory perceptions of) macroscopic events, forward in time, because we evolved to. But that is not the default domain of an agent's utility function.

For example, I claim that while AlphaGo could be said to be agent-y, it does not care about atoms. And I think that we could make it fantastically more superhuman at Go, and it would still not care about atoms. Atoms are just not in the domain of its utility function.

In particular, I don't think it has an incentive to break out into the real world to somehow get itself more compute, so that it can think more about its next move. It's just not modeling the real world at all. It's not even trying to rack up a bunch of wins over time. It's just playing the single platonic game of Go.

Giant caveat (that you may already be shouting into your screen): abstractions are leaky.

The ML system is not actually trained to play the platonic game of Go. It's trained to play the-game-of-Go-as-implemented-on-particular-hardware, or something like minimize-this-loss-function-informed-by-Go-game-results. The difference between the platonic game and the embodied game can lead to clever and unexpected behavior.

However, it seems to me that these kinds of hacks are going to look a lot more like a system short-circuiting than it out-of-nowhere building a model of, and starting to care about, the whole universe.

3. Orthogonality squared

I really liked Eliezer's Arbital article on Epistemic and instrumental efficiency. He writes:

An agent that is "efficient", relative to you, within a domain, is one that never makes a real error that you can systematically predict in advance.

I think this very succinctly captures what would be so scary about being up against a (sufficiently) superintelligent agent with conflicting goals to yours. If you think you see a flaw in its plan, that says more about your seeing than it does about its plan. In other words, you're toast.

But as above, I think it's important to keep in mind what an agent's goals are actually about.

Just as the utility function of an agent is orthogonal from its intelligence, it seems to me that the domain of its utility function is another dimension of potential orthogonality.

If you're playing chess against AlphaZero Chess, you're going to lose. But suppose you're secretly playing "Who has the most pawns after 10 moves?" I think you've got a chance to win! Even though it cares about pawns!

(Of course if you continue playing out the chess game after the10th move, it'll win at that. But by assumption, that's fine, it's not what you cared about.)

If you and another agent have different goals for the same set of objects, you're going to be in conflict. It's going to be zero sum. But if the stuff you care about is only tangentially related to the stuff it cares about, then the results can be positive sum. You can both win!

In particular, you can both get what you want without either of you turning the other off. (And if you know that, you don't have to preemptively try to turn each other off to prevent being turned off either.)

4. Programs, agents, and real-world agents

Agents are a tiny subset of all programs. And agents whose utility functions are defined over the real world are a tiny subset of all agents.

If we think about all the programs we could potentially write that take in inputs and produce outputs, it will make sense to talk about some of those as agents. These are the programs that seem to be optimizing something. Or seem to have goals and make plans.

But, crucially, all that optimization takes place with respect to some environment. And if the input and output of an agent-y program is hooked up to the wrong environment (or hooked up to the right environment in the wrong way), it'll cease to be agent-y.

For example, if you hook me up to the real world by sticking me in outer space (sans suit), I will cease to be very agent-y. Or, if you hook up the inputs and outputs of AlphaGo to a chess board, it will cease to be formidable (until you retrain it). (In other words, the isAgent() predicate is not a one-place function.)

This suggests to me that we could build agent-y, superintelligent systems that are not a threat to us. (Because they are not agent-y with respect to the real world.)

Yes, we're likely to (drastically) oversample from the subset of agents that are agent-y w.r.t. the real world, because we're going to want to build systems that are useful to us.

But if I'm right about the short-circuiting argument above, even our agent-y systems won't have coherent goals defined over events far outside their original domain (e.g. the arrangement of all the atoms in the future light cone) by default.

So even if our systems are agent-y (w.r.t. some environment), and have some knowledge of and take some actions in the real world, they won't automatically have a utility function defined over the configurations of all atoms.

On the other hand, the more we train them as open-ended agents with wide remit to act in the real world (or a simulation thereof), the more we'll have a (potentially superintelligently lethal) problem on our hands.

To me that suggests that what we need to care about are things like: how open-ended we make our systems, whether we train them via evolution-like competition between agents in a high-def simulation of the real world, and what kind of systems are incentivized to be developed and deployed, society-wide.

5. Conclusion

If I'm right in the above thinking, then orthogonality is more relevant and instrumental convergence is less relevant than it might otherwise appear.

Instrumental convergence would only end up being a concern for agents that care about the same objects / resources / domain that you do. If their utility function is just not about those things, IC will drive them to acquire a totally different set of resources that is not in conflict with your resources (e.g. a positional chess advantage in a go game, or trading for your knight while you try to acquire pawns).

This would mean that we need to be very worried about open-ended real-world agents. But less worried about intelligence in general, or even agents in general.

To be clear, I'm not claiming that it's all roses from here on out. But this reasoning leads me to conclude that the key problems may not be the ones described in the post above.

I think we (mostly) all agree that we want to somehow encode human values into AGIs. That's not a new idea. The devil is in the details.

The problem with this model is, its predictions depend a lot on how you draw the boundary around "field". Take Yudkowsky's example of startups. How do we explain small startups succeed where large companies failed?

I don't quite see how this is a problem for the model. The narrower you draw the boundary, the more jumpy progress will be, right?

Successful startups are big relative to individuals, but not that big relative to the world as a whole. If we're talking about a project / technology / company that can rival the rest of the world in its output, then the relevant scale is trillions of dollars (prob deca-trillions), not billions.

And while the most fantastically successful startups can become billion dollar companies within a few years, nobody has yet made it to a trillion in less than a decade.

EDIT: To clarify, not trying to say that something couldn't grow faster than any previous startup. There could certainly be a 'kink' in the rate of progress, like you describe. I just want to emphasize that:

1. startups are not that jumpy, on the world scale
2. the actual scale of the world matters

A simple model for the discontinuousness of a field might have two parameters — one for the intrinsic lumpiness of available discoveries, and one for total effort going into discovery. And,

• all else equal, more people means smoother progress — if we lived in a trillion person world, AI progress would be more continuous
• it's an open empirical question whether the actual values for these parameters will result in smooth or jumpy takeoff:
• even if investment in AI is in the deca-trillions and a meaningful fraction of all world output, it could still be that the actual territory of available discoveries is so lumpy that progress is discontinuous
• but, remember that reality has a surprising amount of detail, which I think tends to push things in a smoother direction — it means there are more fiddly details to work through, even when you have a unique insight or technological advantage
• or, in other words, even if you have a random draw from a distribution that ends up being an outlier, actual progress in the real world will be the result of many different draws, which will tend to push things more toward the regime of normals

In my view, the biological anchors and the Very Serious estimates derived therefrom are really useful for the following very narrow yet plausibly impactful purpose

I don't understand why it's not just useful directly. Saying that the numbers are not true upper or lower bounds seems like it's expecting way too much!

They're not even labeled as bounds (at least in the headline). They're supposed to be "anchors".

Suppose you'd never done the analysis to know how much compute a human brain uses, or how much compute all of evolution had used. Wouldn't this report be super useful to you?

Sure, it doesn't directly tell you when TAI is going to come, because there's a separate thing you don't know, which is how compute-efficient our systems are going to be compared to the human brain. And also that translation factor is changing with time. But surely that's another quantity we can have a distribution over.

If there's some quantity that we don't know the value of, but we have at least one way to estimate it using some other uncertain quantities, why is it not useful to reduce our uncertainty about some of those other quantities?

This seems like exactly the kind of thing superforecasters are supposed to do. Or that an Eliezer-informed Bayesian rationalist is supposed to do. Quantify your uncertainty. Don't be afraid to use a probability distribution. Don't throw away relevant information, but instead use it to reduce your uncertainty and update your probabilities.

If Eliezer's point is just that the report shouldn't be taken as the gospel truth of when AI is going to come, then fine. Or if he just wants to highlight that there's still uncertainty over the translation factor between the brain's compute-efficiency and our ML systems' compute-efficiency, then that seems like a good point too.

But I don't really understand the point of the rest of the article. If I wanted to have any idea at all when TAI might come, then Moravec's 1988 calculations regarding the human brain seem super interesting. And also Somebody on the Internet's 2006 calculation of how much compute evolution had used.

Either of them would be wrong to think that their number precisely pins down the date. But if you started out not knowing whether to expect AGI in one year or in 10,000 years, then it seems like learning the human brain number and the all-of-evolution number should radically reduce your uncertainty.

It still doesn't reduce your uncertainty all the way, because we still don't know the compute-efficiency translation factor. But who said it reduced uncertainty all the way? Not OpenPhil.

Claim 4: GPT-N need not be "trying" to predict the next word. To elaborate: one model of GPT-N is that it is building a world model and making plans in the world model such that it predicts the next word as accurately as possible. This model is fine on-distribution but incorrect off-distribution. In particular, it predicts that GPT-N would e.g. deliberately convince humans to become more predictable so it can do better on future next-word predictions; this model prediction is probably wrong.

I got a bit confused by this section, I think because the word "model" is being used in two different ways, neither of which is in the sense of "machine learning model".

Paraphrasing what I think is being said:

• An observer (us) has a model_1 of what GPT-N is doing.
• According to their model_1, GPT-N is building its own world model_2, that it uses to plan its actions.
• The observer's model_1 makes good predictions about GPT-N's behavior when GPT-N (the machine learning model_3) is tested on data that comes from the training distribution, but bad predictions about what GPT-N will do when tested (or used) on data that does not come from the training distribution.
• The way that the observer's model_1 will be wrong is not that it will be fooled by GPT-N taking a treacherous turn, but rather the opposite -- the observer's model_1 will predict a treacherous turn, but instead GPT-N will go on filling in missing words, as in training (or something else?).

Is that right?

Let , where  and

[...] The second rule says that  is orthogonal to itself

Should that be "is not orthogonal to itself"? I thought the  meant non-orthogonal, so would think  means that  is not orthogonal to itself.

(The transcript accurately reflects what was said in the talk, but I'm asking whether Scott misspoke.)

But once you let it do more computation, then it doesn't have to know anything at all, right? Like, maybe the best go bot is, "Train an AlphaZero-like algorithm for a million years, and then use it to play."

I know more about go than that bot starts out knowing, but less than it will know after it does computation.

I wonder if, when you use the word "know", you mean some kind of distilled, compressed, easily explained knowledge?

This is commonly said on the basis of his $1b pledge Wasn't it supposed to be a total of$1b pledged, from a variety of sources, including Reid Hoffman and Peter Thiel, rather than $1b just from Musk? EDIT: yes, it was. Sam, Greg, Elon, Reid Hoffman, Jessica Livingston, Peter Thiel, Amazon Web Services (AWS), Infosys, and YC Research are donating to support OpenAI. In total, these funders have committed$1 billion, although we expect to only spend a tiny fraction of this in the next few years.

https://openai.com/blog/introducing-openai/

For those organizations that do choose to compete... I think it is highly likely that they will attempt to build competing systems in basically the exact same way as the first organization did

...

It's unlikely for there to exist both aligned and misaligned AI systems at the same time

If the first group sunk some cost into aligning their system, but that wasn't integral to its everyday task performance, wouldn't a second competing group be somewhat likely to skimp on the alignment part?

It seems like this calls into the question the claim that we wouldn't get a mix of aligned and misaligned systems.

Do you expect it to be difficult to disentangle the alignment from the training, such that the path of least resistance for the second group will necessarily include doing a similar amount of alignment?