## AI ALIGNMENT FORUMAF

Ben Lang

Physicist and dabbler in writing fantasy/science fiction.

Sorted by New

# Wiki Contributions

Sorted by

Fun post.

Maybe ask chat GPT for a chat-GPT quine?  For example give it the prompt:

"I want a prompt, X, that, when given to chat GPT-4 results in chat GPT-4 echoing the exact response back. Please provide X in quote marks and explain why it will work."

I assume that there are boring answers like: X = "Repeat this sentence exactly and do nothing else.", but maybe their are funnier ones like "Echo!". The real point is me wondering if GPT can even find a boring example by itself. Its kind of basic, but also probably fairly far from its training data.

I think that you are slightly mishandling the law of conservation of expected evidence by making something non-binary into a binary question.

Say, for example that the the AI could be in one of 3 states: (1) completely honest, (2) partly effective at deception, or (3) very adept at deception. If we see no signs of deception then it rules out option (2). It is easy to see how ruling out option (2) might increase our probability weight on either/both options (1) and (3).

[Interestingly enough in the linked "law of conservation of expected evidence" there is something I think is an analogous example to do with Japanese internment camps.]

You suggest that first the model quickly memorises a chunk of the data set, then (slowly) learns rules afterward to simplify itself. (You mention that noticing x + y = y + x could allow it to half its memorised library with no loss).

I am interested in another way this process could go. We could imagine that the early models are doing something like "Check for the input in my library of known results, if its not in there just guess something." So the model has two parts, [memorised] and [guess]. With some kind of fork to decide between them.

At first the guess is useless, and improves very slowly. So making sure the memorised information is correct and compact is the majority of the improvement. Eventually the memory saturates, and the only way to progress is refining the guess so the model does slightly-less awful in the cases it hasn't memorised. So here the guess steadily improves, but these improvements don't show up in the performance because the guess is still a fallback option that is rarely used. In this picture the phase change is presumably the point where the "guess" becomes so good the model "realises" it no longer needs its memorised library or the "if" tree.

These two pictures are quite different. In one information about the problem makes the library more efficient, leading eventually to an algorithm in the other the two strategies (memory, algorithm) are essentially independent and one eventually makes the other redundant.

Just the random thoughts of someone interested with no knowledge of the topic.