## AI ALIGNMENT FORUMAF

ESRogs

Engineer at CoinList.co. Donor to LW 2.0.

# Posts

Sorted by New
113y
4

Homogeneity vs. heterogeneity in AI takeoff scenarios

For those organizations that do choose to compete... I think it is highly likely that they will attempt to build competing systems in basically the exact same way as the first organization did

...

It's unlikely for there to exist both aligned and misaligned AI systems at the same time

If the first group sunk some cost into aligning their system, but that wasn't integral to its everyday task performance, wouldn't a second competing group be somewhat likely to skimp on the alignment part?

It seems like this calls into the question the claim that we wouldn't get a mix of aligned and misaligned systems.

Do you expect it to be difficult to disentangle the alignment from the training, such that the path of least resistance for the second group will necessarily include doing a similar amount of alignment?

Biextensional Equivalence

Biextensional Equivalence

Are the two frames labeled  and  in section 3.3 (with an agent that thinks about either red or green and walks or stays home, and an environment that's safe or has bears) equivalent? (I would guess so, given the section that example was in, but didn't see it stated explicitly.)

Matt Botvinick on the spontaneous emergence of learning algorithms

I didn't feel like any comment I would have made would have anything more to say than things I've said in the past.

FWIW, I say: don't let that stop you! (Don't be afraid to repeat yourself, especially if there's evidence that the point has not been widely appreciated.)

Developmental Stages of GPTs

Next, we might imagine GPT-N to just be an Oracle AI, which we would have better hopes of using well. But I don't expect that an approximate Oracle AI could be used safely with anything like the precautions that might work for a genuine Oracle AI. I don't know what internal optimizers GPT-N ends up building along the way, but I'm not going to count on there being none of them.

Is the distinguishing feature between Oracle AI and approximate Oracle AI, as you use the terms here, just about whether there are inner optimizers or not?

(When I started the paragraph I assumed "approximate Oracle AI" just meant an Oracle AI whose predictions aren't very reliable. Given how the paragraph ends though, I conclude that whether there are inner optimizers is an important part of the distinction you're drawing. But I'm just not sure if it's the whole of the distinction or not.)

Developmental Stages of GPTs

As for planning, we've seen the GPTs ascend from planning out the next few words, to planning out the sentence or line, to planning out the paragraph or stanza. Planning out a whole text interaction is well within the scope I could imagine for the next few iterations, and from there you have the capability of manipulation without external malicious use.

Perhaps a nitpick, but is what it does planning?

Is it actually thinking several words ahead (a la AlphaZero evaluating moves) when it decides what word to say next, or is it just doing free-writing, and it just happens to be so good at coming up with words that fit with what's come before that it ends up looking like a planned out text?

You might argue that if it ends up as-good-as-planned, then it doesn't make a difference if it was actually planned or not. But it seems to me like it does make a difference. If it has actually learned some internal planning behavior, then that seems more likely to be dangerous and to generalize to other kinds of planning.

$1000 bounty for OpenAI to show whether GPT3 was "deliberately" pretending to be stupider than it is First, do we now have an example of an AI not using cognitive capacities that it had, because the 'face' it's presenting wouldn't have those cognitive capacities? This does seem like an interesting question. But I think we should be careful to measure against the task we actually asked the system to perform. For example, if I ask my system to produce a cartoon drawing, it doesn't seem very notable if I get a cartoon as a result rather than a photorealistic image, even if it could have produced the latter. Maybe what this just means is that we should track what the user understands the task to be. If the user thinks of it as "play a (not very smart) character who's asked to do this task", they'll have a pretty different understanding of what's going on than if they think of it as "do this task." I think what's notable in the example in the post is not that the AI is being especially deceptive, but that the user is especially likely to misunderstand the task (compared to tasks that don't involve dialogues with characters).$1000 bounty for OpenAI to show whether GPT3 was "deliberately" pretending to be stupider than it is

I agree. And I thought Arthur Breitman had a good point on one of the related Twitter threads:

GPT-3 didn't "pretend" not to know. A lot of this is the AI dungeon environment. If you just prompt the raw GPT-3 with: "A definition of a monotreme follows" it'll likely do it right. But if you role play, sure, it'll predict that your stoner friend or young nephew don't know.