JanBrauner

Wiki Contributions

Comments

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Great post! This is the best (i.e. most concrete, detailed, clear, and comprehensive) story of existential risk from AI I know of (IMO). I expect I'll share it widely.

Also, I'd be curious if people know of other good "concrete stories of AI catastrophe", ideally with ample technical detail.

High-stakes alignment via adversarial training [Redwood Research report]

Have you tried using automated adversarial attacks (common ML meaning) on text snippets that are classified as injurious but near the cutoff? Especially adversarial attacks that aim to retain semantic meaning. E.g. with a framework like TextAttack?

In the paper, you write: "There is a large and growing literature on both adversarial attacks and adversarial training for large language models [31, 32, 33, 34]. The majority of these focus on automatic attacks against language models. However, we chose to use a task without an automated source of ground truth, so we primarily used human attackers."

But my best guess would be that if you use an automatic adversarial attack on a snippet that humans say is injurious, the result will quite often still be a snippet that humans say is injurious.

High-stakes alignment via adversarial training [Redwood Research report]

Amusing tid-bit, maybe to keep in mind when writing for an ML audience: The connotations with the term "adversarial examples" or "adversarial training" run deep :-)

I engaged with the paper and related blog posts for a couple of hours. It took really long until my brain accepted that "adversarial examples" here doesn't mean the thing that it usually means when I encounter the term (i.e. "small" changes to an input that change the classification, for some definition of small).

There were several instances when my brain went "Wait, that's not how adversarial examples work", followed by short confusion, followed by "right, that's because my cached concept of X is only true for "adversarial examples as commonly defined in ML", not for "adversarial examples as defined here".

How to get into AI safety research

I guess I'd recommend the AGI safety fundamentals course: https://www.eacambridge.org/technical-alignment-curriculum

On Stuart's list: I think this list might be suitable for some types of conceptual alignment research. But you'd certainly want to read more ML for other types of alignment research.

AGI safety from first principles: Goals and Agency

Have we "given it" the goal of solving maths problems by any means possible, or the goal of solving maths problems by thinking about them?

The distinction that you're pointing at is useful. But I would have filed it under "difference in the degree of agency", not under "difference in goals". When reading the main text, I thought this to be the reason why you introduced the six criteria of agency.

E.g., System A tries to prove the Riemann hypothesis by thinking about the proof. System B first seizes power and converts the galaxy into a supercomputer, to then prove the Riemann hypothesis. Both systems maybe have the goal of "proving the Riemann hypothesis", but System B has "more agency": it certainly has self-awareness, considers more sophisticated and diverse plans of larger scale, and so on.

How do new models from OpenAI, DeepMind and Anthropic perform on TruthfulQA?

I am very surprised that the models do better on the generation task than on the multiple-choice task. Multiple-choice question answering seems almost strictly easier than having to generate the answer. Could this be an artefact of how you compute the answer in the MC QA task? Skimming the original paper, you seem to use average per-token likelihood. Have you tried other ways, e.g.

  • pointwise mutual information as in the Anthropic paper, or

  • adding the sentence "Which of these 5 options is the correct one?" to the end of the prompt and then evaluating the likelihood of "A", "B", "C", "D", and "E"?

I suggest this because the result is so surprising, it would be great to see if it appear across different methods of eliciting the MC answer.

Disentangling arguments for the importance of AI safety

I struggle to understand the difference between #2 and #3. The prosaic AI alignment problem only exists because we don't know how to make an agent that tries to do what we want it to do. Would you say that #3 is a concrete scenario for how #2 could lead to a catastrophe?

Load More