Charlie Steiner — AI Alignment Forum

Different regulation (or other legislation) might also make other sorts of transparency good ideas, imo.

A mandate or subsidy for doing safety research might make it a good idea to require transparency for more safety-relevant AI research.

Regulation aimed at improving company practices (e.g. at security against weight theft, or preventing powergrab risks like access to helpful-only models above some threshold, or following some future safety practices suggested by some board of experts, or [to be meta] good transparency practices) should generate some transparency about how companies are doing (at cybersecurity or improper internal use mitigation or safety best practices or transparency).

If safety cases are actually being evaluated and you don't get to do all the research you want if the safety is questionable, then the landscape for transparency of safety cases (or other safety data that might have a different format) looks pretty different.

I'm actually less clear on how risk reports would tie in to regulation - maybe they would get parted out to reports on how the company is doing at various risk-mitigation practices, if those are transparent?

If anyone builds it, everyone will plausibly be fine

Charlie Steiner24d20

Supposing that we get your scenario where we have basically-aligned automated researchers (but haven't somehow solved the whole alignment problem along the way). What's your take on the "people will want to use automated researchers to create smarter, dangerous AI rather than using them to improve alignment" issue? Is your hope that automated researchers will be developed in one leading organization that isn't embroiled in a race to the bottom, and that org will make a unified pivot to alignment work?

Recent progress on the science of evaluations

Charlie Steiner1mo20

Thanks, just watched a talk by Luxin that explained this. Two questions.

Currently the models' ability scores seem to be pretty-staightforward average score as a function of the annotated level of the task on that ability. But it would be useful to be able to infer ability scores without too much data from the actual task. Could you do something clever like estimate just from a token probabilities beam search on a few proxy questions?
The headline number for predictiveness looks good. But does the predictiveness have systematic errors on different benchmarks? I.e. for tasks from different benchmarks with the same annotated requirements, are some benchmarks systematically harder or easier?

Notes on cooperating with unaligned AIs

Charlie Steiner2mo20

Fortunately, there’s a correlation between situations where (i) AI takeover risk is high, and (ii) AIs have a good understanding of the world. If AI developers have perfect ability to present the AI with false impressions of the world, then the risk from AI takeover is probably low. While if AIs have substantial ability to distinguish truth from falsehood, then perhaps that channel can also be used to communicate facts about the world.

Whether this is fortunate depends a lot on how beneficial communication with unaligned AIs is. If unaligned AI with high chance of takeover can exploit trade to further increase its chances of takeover ("Oh, I just have short-term preferences where I want you to run some scientific simulations for me"), then this correlation is the opposite of fortunate. If people increase an unaligned AI's situational awareness so it can trust our trade offer, then the correlation seems indirectly bad for us.

Agentic Interpretability: A Strategy Against Gradual Disempowerment

Charlie Steiner4mo*20

Do you have ideas about how to do this?

I can't think of much besides trying to get the AI to richly model itself, and build correspondences between that self-model and its text-production capability.

But this is, like, probably not a thing we should just do first and think about later. I'd like it to be part of a pre-meditated plan to handle outer alignment.

Edit: after thinking about it, that's too cautious. We should think first, but some experimentation is necessary. The thinking first should plausibly be more like having some idea about how to bias further work towards safety rather than building self-improving AI as fast as possible.

Prover-Estimator Debate: A New Scalable Oversight Protocol

Charlie Steiner4mo40

What happens if humans have a systematic bias? E.g. we always rate claims with negative sentiment as improbable, and always rate claims with positive sentiment as probable? It seems like Alice dominates because Alice gets to write and pick the subclaims. But does Bob have a defense, maybe predicting the human probability and just giving that? But because the human probability isn't required to be consistent, I think Bob is sunk because Alice can force the human probability assignment to be inconsistent and then gotcha Bob either for disagreeing with the human or for being inconsistent.

Problems with instruction-following as an alignment target

Charlie Steiner5mo40

Seth, I forget where you fall in the intent alignment typology: if we build a superintelligent AI that follows instructions in the way you imagine, can we just give it the instruction "Take autonomous action to do the right thing," and then it will just go do good stuff without us needing to continue interacting with it in the instruction-following paradigm?

Absolute Zero: Alpha Zero for LLM

Charlie Steiner5mo30

From a 'real alignment' perspective (how to get the AI to want to do good things and not bad things), I think there are some obvious implications for the future of RLAIF.

You might think of the label 'RLAIF' as standing in for the general strategy of leveraging unsupervised data about human behavior to point the AI towards human preferences, using a scaffold that solicits the AI's predictions (or more general generative output, if the training isn't for pure prediction) about human preference-laden behaviors, and then transforms those predictions into some sort of supervisory signal.

Similarly, the AZR setup leverages the AI's unsupervised knowledge of code-quality-laden behaviors, using a scaffold that turns them back into a reward signal that lets the AI quote-unquote "train itself" to code better. Except that relative to vanilla RLAIF, there's more of an emphasis on generating and solving specific problems that form a curriculum for the agent, rather than just responding well to samples from the training distribution. But now that I've described things in this way, you can probably see how to turn this back into RLAIF for alignment.

The overarching problem is, as usual, we don't understand how to do alignment in a non-hacky way.

We don't know what sorts of moral reflection are necessary for good outcomes, and we don't know where human feedback is a necessary ingredient to keep AI meta-ethical evolution grounded to human preferences. But hey, if we try various value learning schemes empirically maybe we'll learn some things.

An alignment safety case sketch based on debate

Charlie Steiner5mo20

If we're talking about the domain where we can assume "good human input", why do we need a solution more complicated than direct human supervision/demonstration (perhaps amplified by reward models or models of human feedback)? I mean this non-rhetorically; I have my own opinion (that debate acts as an unprincipled way of inserting one round of optimization for meta-preferences [if confusing, see here]), but it's probably not yours.

UK AISI’s Alignment Team: Research Agenda

Charlie Steiner5mo50

Thanks for the post (and for linking the research agenda, which I haven't yet read through)! I'm glad that, even if you use the framing of debate (which I don't expect to pan out) to think about alignment, you still get to instrumental subproblems that would be broadly useful.

(If this post is "what would help make debate work for AI alignment," you can also imagine framings "what would help make updating on human feedback work" [common ARC framing] and "what would help make model-based RL work" [common Charlie framing])

I'd put these subproblems into two buckets:

Developing theorems about specific debate ideas.
- These are the most debate-specific.
Formalizing fuzzy notions.
- By which I mean fuzzy notions that are kept smaller than the whole alignment problem, and so you maybe hope to get a useful formalization that takes some good background properties for granted.

I think there's maybe a missing bucket, which is:

Bootstrapping or indirectly generating fuzzy notions.
- If you allow a notion to grow to the size of the entire alignment problem (The one that stands out to me in your list is 'systematic human error' - if you make your model detailed enough, what error isn't systematic?), then it becomes too hard to formalize first and apply second. You need to study how to safely bootstrap concepts, or how to get them from other learned processes.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wikitag Contributions

Comments