Ofer — AI Alignment Forum

AI ALIGNMENT FORUM
AF

More information about the dangerous capability evaluations we did with GPT-4 and Claude.

Did OpenAI/Anthropic allow you to evaluate smaller scale versions* of GPT4/Claude before training the full-scale model?

* [EDIT: and full-scale models in earlier stages of the training process]

Having thought a bunch about acausal trade — and proven some theorems relevant to its feasibility — I believe there do not exist powerful information hazards about it that stand up to clear and circumspect reasoning about the topic.

Have you discussed this point with other relevant researchers before deciding to publish this post? Is there a wide agreement among relevant researchers that a public, unrestricted discussion about this topic is net-positive? Have you considered the unilateralist's curse and biases that you may have (in terms of you gaining status/prestige from publishing this)?

Categorizing failures as “outer” or “inner” misalignment is often confused

Ofer3y10

(Though even in that case it's not necessarily a generalization problem. Suppose every single "test" input happens to be identical to one that appeared in "training", and the feedback is always good.)

Categorizing failures as “outer” or “inner” misalignment is often confused

Ofer3y10

Generalization-based. This categorization is based on the common distinction in machine learning between failures on the training distribution, and out of distribution failures. Specifically, we use the following process to categorize misalignment failures:

Was the feedback provided on the actual training data bad? If so, this is an instance of outer misalignment.

Did the learned program generalize poorly, leading to bad behavior, even though the feedback on the training data is good? If so, this is an instance of inner misalignment.

This categorization is non-exhaustive. Suppose we create a superintelligence via a training process with good feedback signal and no distribution shift. Should we expect that no existential catastrophe will occur during this training process?

We may be able to see sharp left turns coming

Ofer3y20

The smooth graphs seem like good evidence that there are much smoother underlying changes in the model, and that the abruptness of the change is about behavior or evaluation rather than what gradient descent is learning.

If we're trying to predict abrupt changes in the accuracy of output token sequences, the per-token log-likelihood can be a useful signal. What's the analogous signal when we're talking about abrupt changes in a model's ability to deceptively conceal capabilities, hack GPU firmware, etc.? What log-likelihood plots can we use to predict those types of abrupt changes in behavior?

Common misconceptions about OpenAI

Ofer3y10

Sorry, that text does appear in the linked page (in an image).

Common misconceptions about OpenAI

Ofer3y10

The Partnership may never make a profit

I couldn't find this quote in the page that you were supposedly quoting from. The only google result for it is this post. Am I missing something?

[This comment is no longer endorsed by its author]Reply

Principles of Privacy for Alignment Research

Ofer3y10

That being said, I think that, most of the time, alignment work ending up in training data is good, since it can help our AI systems be differentially better at AI alignment research (e.g. relative to how good they are at AI capabilities research), which is something that I think is pretty important.

That consideration seems relevant only for language models that will be doing/supporting alignment work.

Principles of Privacy for Alignment Research

Ofer3y12

Maybe the question here is whether including certain texts in relevant training datasets can cause [language models that pose an x-risk] to be created X months sooner than otherwise.

The relevant texts I'm thinking about here are:

Descriptions of certain tricks to evade our safety measures.
Texts that might cause the ML model to (better) model AIS researchers or potential AIS interventions, or other potential AI systems that the model might cooperate with (or that might "hijack" the model's logic).

Principles of Privacy for Alignment Research

Ofer3y10

Is that because you think it would be hard to get the relevant researchers to exclude any given class of texts from their training datasets [EDIT: or prevent web crawlers from downloading the texts etc.]? Or even if that part was easy, you would still feel that that lever is very small?

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments