## AI ALIGNMENT FORUMAF

ESRogs

Engineer at CoinList.co. Donor to LW 2.0.

Sorted by New
113y
4

# Wiki Contributions

[AN #156]: The scaling hypothesis: a plan for building AGI

Claim 4: GPT-N need not be "trying" to predict the next word. To elaborate: one model of GPT-N is that it is building a world model and making plans in the world model such that it predicts the next word as accurately as possible. This model is fine on-distribution but incorrect off-distribution. In particular, it predicts that GPT-N would e.g. deliberately convince humans to become more predictable so it can do better on future next-word predictions; this model prediction is probably wrong.

I got a bit confused by this section, I think because the word "model" is being used in two different ways, neither of which is in the sense of "machine learning model".

Paraphrasing what I think is being said:

• An observer (us) has a model_1 of what GPT-N is doing.
• According to their model_1, GPT-N is building its own world model_2, that it uses to plan its actions.
• The observer's model_1 makes good predictions about GPT-N's behavior when GPT-N (the machine learning model_3) is tested on data that comes from the training distribution, but bad predictions about what GPT-N will do when tested (or used) on data that does not come from the training distribution.
• The way that the observer's model_1 will be wrong is not that it will be fooled by GPT-N taking a treacherous turn, but rather the opposite -- the observer's model_1 will predict a treacherous turn, but instead GPT-N will go on filling in missing words, as in training (or something else?).

Is that right?

Finite Factored Sets

Let , where  and

[...] The second rule says that  is orthogonal to itself

Should that be "is not orthogonal to itself"? I thought the  meant non-orthogonal, so would think  means that  is not orthogonal to itself.

(The transcript accurately reflects what was said in the talk, but I'm asking whether Scott misspoke.)

Challenge: know everything that the best go bot knows about go

But once you let it do more computation, then it doesn't have to know anything at all, right? Like, maybe the best go bot is, "Train an AlphaZero-like algorithm for a million years, and then use it to play."

I know more about go than that bot starts out knowing, but less than it will know after it does computation.

I wonder if, when you use the word "know", you mean some kind of distilled, compressed, easily explained knowledge?

2020 AI Alignment Literature Review and Charity Comparison

This is commonly said on the basis of his $1b pledge Wasn't it supposed to be a total of$1b pledged, from a variety of sources, including Reid Hoffman and Peter Thiel, rather than $1b just from Musk? EDIT: yes, it was. Sam, Greg, Elon, Reid Hoffman, Jessica Livingston, Peter Thiel, Amazon Web Services (AWS), Infosys, and YC Research are donating to support OpenAI. In total, these funders have committed$1 billion, although we expect to only spend a tiny fraction of this in the next few years.

https://openai.com/blog/introducing-openai/

Homogeneity vs. heterogeneity in AI takeoff scenarios

For those organizations that do choose to compete... I think it is highly likely that they will attempt to build competing systems in basically the exact same way as the first organization did

...

It's unlikely for there to exist both aligned and misaligned AI systems at the same time

If the first group sunk some cost into aligning their system, but that wasn't integral to its everyday task performance, wouldn't a second competing group be somewhat likely to skimp on the alignment part?

It seems like this calls into the question the claim that we wouldn't get a mix of aligned and misaligned systems.

Do you expect it to be difficult to disentangle the alignment from the training, such that the path of least resistance for the second group will necessarily include doing a similar amount of alignment?

Biextensional Equivalence

Biextensional Equivalence

Are the two frames labeled  and  in section 3.3 (with an agent that thinks about either red or green and walks or stays home, and an environment that's safe or has bears) equivalent? (I would guess so, given the section that example was in, but didn't see it stated explicitly.)

Matt Botvinick on the spontaneous emergence of learning algorithms

I didn't feel like any comment I would have made would have anything more to say than things I've said in the past.

FWIW, I say: don't let that stop you! (Don't be afraid to repeat yourself, especially if there's evidence that the point has not been widely appreciated.)

Developmental Stages of GPTs

Next, we might imagine GPT-N to just be an Oracle AI, which we would have better hopes of using well. But I don't expect that an approximate Oracle AI could be used safely with anything like the precautions that might work for a genuine Oracle AI. I don't know what internal optimizers GPT-N ends up building along the way, but I'm not going to count on there being none of them.

Is the distinguishing feature between Oracle AI and approximate Oracle AI, as you use the terms here, just about whether there are inner optimizers or not?

(When I started the paragraph I assumed "approximate Oracle AI" just meant an Oracle AI whose predictions aren't very reliable. Given how the paragraph ends though, I conclude that whether there are inner optimizers is an important part of the distinction you're drawing. But I'm just not sure if it's the whole of the distinction or not.)