AI ALIGNMENT FORUM
AF

Charlie Steiner
Ω1152434840
Message
Dialogue
Subscribe

If you want to chat, message me!

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Reducing Goodhart
3Charlie Steiner's Shortform
5y
4
No wikitag contributions to display.
Recent progress on the science of evaluations
Charlie Steiner2d20

Thanks, just watched a talk by Luxin that explained this. Two questions.

  • Currently the models' ability scores seem to be pretty-staightforward average score as a function of the annotated level of the task on that ability. But it would be useful to be able to infer ability scores without too much data from the actual task. Could you do something clever like estimate just from a token probabilities beam search on a few proxy questions?
  • The headline number for predictiveness looks good. But does the predictiveness have systematic errors on different benchmarks? I.e. for tasks from different benchmarks with the same annotated requirements, are some benchmarks systematically harder or easier?
Reply
Notes on cooperating with unaligned AIs
Charlie Steiner19d20

Fortunately, there’s a correlation between situations where (i) AI takeover risk is high, and (ii) AIs have a good understanding of the world. If AI developers have perfect ability to present the AI with false impressions of the world, then the risk from AI takeover is probably low. While if AIs have substantial ability to distinguish truth from falsehood, then perhaps that channel can also be used to communicate facts about the world.

Whether this is fortunate depends a lot on how beneficial communication with unaligned AIs is. If unaligned AI with high chance of takeover can exploit trade to further increase its chances of takeover ("Oh, I just have short-term preferences where I want you to run some scientific simulations for me"), then this correlation is the opposite of fortunate. If people increase an unaligned AI's situational awareness so it can trust our trade offer, then the correlation seems indirectly bad for us.

Reply
Agentic Interpretability: A Strategy Against Gradual Disempowerment
Charlie Steiner3mo*20

Do you have ideas about how to do this?

I can't think of much besides trying to get the AI to richly model itself, and build correspondences between that self-model and its text-production capability.

But this is, like, probably not a thing we should just do first and think about later. I'd like it to be part of a pre-meditated plan to handle outer alignment.

Edit: after thinking about it, that's too cautious. We should think first, but some experimentation is necessary. The thinking first should plausibly be more like having some idea about how to bias further work towards safety rather than building self-improving AI as fast as possible.

Reply
Prover-Estimator Debate: A New Scalable Oversight Protocol
Charlie Steiner3mo40

What happens if humans have a systematic bias? E.g. we always rate claims with negative sentiment as improbable, and always rate claims with positive sentiment as probable? It seems like Alice dominates because Alice gets to write and pick the subclaims. But does Bob have a defense, maybe predicting the human probability and just giving that? But because the human probability isn't required to be consistent, I think Bob is sunk because Alice can force the human probability assignment to be inconsistent and then gotcha Bob either for disagreeing with the human or for being inconsistent.

Reply
Problems with instruction-following as an alignment target
Charlie Steiner4mo40

Seth, I forget where you fall in the intent alignment typology: if we build a superintelligent AI that follows instructions in the way you imagine, can we just give it the instruction "Take autonomous action to do the right thing," and then it will just go do good stuff without us needing to continue interacting with it in the instruction-following paradigm?

Reply
Absolute Zero: Alpha Zero for LLM
Charlie Steiner4mo30

From a 'real alignment' perspective (how to get the AI to want to do good things and not bad things), I think there are some obvious implications for the future of RLAIF.

You might think of the label 'RLAIF' as standing in for the general strategy of leveraging unsupervised data about human behavior to point the AI towards human preferences, using a scaffold that solicits the AI's predictions (or more general generative output, if the training isn't for pure prediction) about human preference-laden behaviors, and then transforms those predictions into some sort of supervisory signal.

Similarly, the AZR setup leverages the AI's unsupervised knowledge of code-quality-laden behaviors, using a scaffold that turns them back into a reward signal that lets the AI quote-unquote "train itself" to code better. Except that relative to vanilla RLAIF, there's more of an emphasis on generating and solving specific problems that form a curriculum for the agent, rather than just responding well to samples from the training distribution. But now that I've described things in this way, you can probably see how to turn this back into RLAIF for alignment.

The overarching problem is, as usual, we don't understand how to do alignment in a non-hacky way.

We don't know what sorts of moral reflection are necessary for good outcomes, and we don't know where human feedback is a necessary ingredient to keep AI meta-ethical evolution grounded to human preferences. But hey, if we try various value learning schemes empirically maybe we'll learn some things.

Reply
An alignment safety case sketch based on debate
Charlie Steiner4mo20

If we're talking about the domain where we can assume "good human input", why do we need a solution more complicated than direct human supervision/demonstration (perhaps amplified by reward models or models of human feedback)? I mean this non-rhetorically; I have my own opinion (that debate acts as an unprincipled way of inserting one round of optimization for meta-preferences [if confusing, see here]), but it's probably not yours.

Reply
UK AISI’s Alignment Team: Research Agenda
Charlie Steiner4mo50

Thanks for the post (and for linking the research agenda, which I haven't yet read through)! I'm glad that, even if you use the framing of debate (which I don't expect to pan out) to think about alignment, you still get to instrumental subproblems that would be broadly useful.

(If this post is "what would help make debate work for AI alignment," you can also imagine framings "what would help make updating on human feedback work" [common ARC framing] and "what would help make model-based RL work" [common Charlie framing])

I'd put these subproblems into two buckets:

  • Developing theorems about specific debate ideas.
    • These are the most debate-specific.
  • Formalizing fuzzy notions.
    • By which I mean fuzzy notions that are kept smaller than the whole alignment problem, and so you maybe hope to get a useful formalization that takes some good background properties for granted.

I think there's maybe a missing bucket, which is:

  • Bootstrapping or indirectly generating fuzzy notions.
    • If you allow a notion to grow to the size of the entire alignment problem (The one that stands out to me in your list is 'systematic human error' - if you make your model detailed enough, what error isn't systematic?), then it becomes too hard to formalize first and apply second. You need to study how to safely bootstrap concepts, or how to get them from other learned processes.
Reply
AI-enabled coups: a small group could use AI to seize power
Charlie Steiner5mo30

Why train a helpful-only model?

If one of our key defenses against misuse of AI is good ol' value alignment - building AIs that have some notion of what a "good purpose for them" is, and will resist attempts to subvert that purpose (e.g. to instead exalt the research engineer who comes in to work earliest the day after training as god-emperor) - then we should be able to close the security hole and never need to have a helpful-only model produced at any point during training. In fact, with blending of post-training into pre-training, there might not even be a need to ever produce a fully trained predictive-only model.

Reply
Why do misalignment risks increase as AIs get more capable?
Charlie Steiner5mo20

I'm big on point #2 feeding into point #1.

"Alignment," used in a way where current AI is aligned - a sort of "it does basically what we want, within its capabilities, with some occasional mistakes that don't cause much harm" sort of alignment - is simply easier at lower capabilities, where humans can do a relatively good job of overseeing the AI, not just in deployment but also during training. Systematic flaws in human oversight during training leads (under current paradigms) to misaligned AI.

Reply
Load More
33Neural uncertainty estimation review article (for alignment)
2y
0
19How to solve deception and still fail.
2y
2
47Some background for reasoning about dual-use alignment research
2y
6
11[Simulators seminar sequence] #2 Semiotic physics - revamped
3y
14
14Shard theory alignment has important, often-overlooked free parameters.
3y
4
19 [Simulators seminar sequence] #1 Background & shared assumptions
3y
3
10Take 14: Corrigibility isn't that great.
3y
3
24Take 13: RLHF bad, conditioning good.
3y
0
11Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
3y
1
16Take 11: "Aligning language models" should be weirder.
3y
0
Load More