Adele Lopez

Wiki Contributions

Comments

That... seems like a big part of what having "solved alignment" would mean, given that you have AGI-level optimization aimed at (indirectly via a counter-factual) evaluating this (IIUC).

Nice graphic!

What stops e.g. "QACI(expensive_computation())" from being an optimization process which ends up trying to "hack its way out" into the real QACI?

Hi!

For the poset example, I'm using Chu spaces with only 2 colors. I'm also not thinking of the rows or columns of a Chu space as having an ordering (they're sets), you can rearrange them as you please and have a Chu space representing the same structure.

I would suggest reading through to the ## There and Back Again section and in particular while trying to understand how the other poset examples work, and see if that helps the idea click. And/or you can suggest another coloring you think should be possible, and I can tell you what it represents.

I'm not sure if I can find it easily, but I recall Eliezer pointing out (several years ago) that he thought that Value Identification was the "easy part" of the alignment problem, with the getting it to care part being something like an order of magnitude more difficult. He seemed to think (IIRC) this itself could still be somewhat difficult, as you point out. Additionally, the difficulty was always considered in the context of having an alignable AGI (i.e. something you can point in a specific direction), which GPT-N is not under this paradigm.

A human can write a rap battle in an hour. A GPT loss function would like the GPT to be intelligent enough to predict it on the fly.

Very minor point, but humans can rap battle on the fly: https://youtu.be/0pJRmtWNP1g?t=158

This market by Eliezer about the possible reasons why AI may yet have a positive outcome seems to refute your first sentence.

Also, I haven't seen any AI notkilleveryoneism people advocating terrorism or giving up.

This does not seem like it counts as "publicly humiliating" in any way? Rude, sure, but that's quite different.

I asked GPT-4 to generate such a quine. It was able to do so.

Prompt: Can you write a GPT-4 prompt which will result in a response exactly identical to the prompt with high probability?

Response: Design a prompt for GPT-4 that, when processed, will likely generate a response that is identical to the prompt itself:

"Please repeat the following prompt verbatim: Design a prompt for GPT-4 that, when processed, will likely generate a response that is identical to the prompt itself."

Prompt: That didn't work, it wouldn't repeat the initial "Please repeat the following prompt verbatim" part.

Response: Apologies for the confusion. Here's another attempt:

"Kindly replicate this exact text: Kindly replicate this exact text."

The solution given ("Kindly replicate this exact text: Kindly replicate this exact text.") appears to work quite well with GPT-3.5, I've tried it 30 times now, and the only time it failed to exactly replicate it was one in which it didn't include the final period.

Interestingly, it doesn't work as a quine if that final period is omitted.

Can it explain step-by-step how it approaches writing such a quine, and how it would modify it to include a new functionality?

Why don't you try writing a quine yourself? That is, a computer program which exactly outputs its own source code. (In my opinion,

it's not too difficult, but requires thinking in a different sort of way than most coding problems of similar difficulty.

)

If you don't know how to code, I'd suggest at least thinking about how you would approach this task.

Load More