AI ALIGNMENT FORUM
AF

Stuart Armstrong
Ω181936441312
Message
Dialogue
Subscribe

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Concept Extrapolation
AI Safety Subprojects
Practical Guide to Anthropics
Anthropic Decision Theory
Subagents and impact measures
If I were a well-intentioned AI...
5Stuart_Armstrong's Shortform
6y
2
Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
Stuart_Armstrong6mo32

Thanks for the suggestion; that's certainly worth looking into. Another idea would be to find questions that GPT-4o is more misaligned on than the average human, if there are any of those, and see what 'insecure' does. Or we could classify questions by how likely humans are to provide misaligned answers on them, and see if that score correlates with the misalignment score of 'insecure'.

Reply
Using Prompt Evaluation to Combat Bio-Weapon Research
Stuart_Armstrong7mo20

The mundane prompts were blocked 0% of the time. But you're right - we need something in between 'mundane and unrelated to bio research' and 'useful for bioweapons research'.

But I'm not sure what - here we are looking at lab wetwork ability. It seems that that ability is inherently dual-use.

Reply
Acausal trade: Introduction
Stuart_Armstrong2y30

Thanks!

Reply
SolidGoldMagikarp (plus, prompt generation)
Stuart_Armstrong3y20

As we discussed, I feel that the tokens were added for some reason but then not trained on; hence why they are close to the origin, and why the algorithm goes wrong on them, because it just isn't trained on them at all.

Good work on this post.

Reply
Refining the Sharp Left Turn threat model, part 2: applying alignment techniques
Stuart_Armstrong3y43

I'll be very boring and predictable and make the usual model splintering/value extrapolation point here :-)

Namely that I don't think we can talk sensibly about an AI having "beneficial goal-directedness" without situational awareness. For instance, it's of little use to have an AI with the goal of "ensuring human flourishing" if it doesn't understand the meaning of flourishing or human. And, without situational awareness, it can't understand either; at best we could have some proxy or pointer towards these key concepts.

The key challenge seems to be to get the AI to generalise properly; even initially poor goals can work if generalised well. For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.

So I'd be focusing on "do the goals stay safe as the AI gains situational awareness?", rather than "are the goals safe before the AI gains situational awareness?"

Reply
Testing The Natural Abstraction Hypothesis: Project Intro
Stuart_Armstrong3y20

Here's the review, though it's not very detailed (the post explains why):

https://www.lesswrong.com/posts/dNzhdiFE398KcGDc9/testing-the-natural-abstraction-hypothesis-project-update?commentId=spMRg2NhPogHLgPa8

Reply
Testing The Natural Abstraction Hypothesis: Project Update
Stuart_Armstrong3y50Review for 2021 Review

A good review of work done, which shows that the writer is following their research plan and following up their pledge to keep the community informed.

The contents, however, are less relevant, and I expect that they will change as the project goes on. I.e. I think it is a great positive that this post exists, but it may not be worth reading for most people, unless they are specifically interested in research in this area. They should wait for the final report, be it positive or negative.

Reply
Testing The Natural Abstraction Hypothesis: Project Intro
Stuart_Armstrong3y30

I have looked at it, but ignored it when commenting on this post, which should stand on its own (or as part of a sequence).

Reply
Testing The Natural Abstraction Hypothesis: Project Intro
Stuart_Armstrong3y40Review for 2021 Review

A decent introduction to the natural abstraction hypothesis, and how testing it might be attempted. A very worthy project, but it isn't that easy to follow for beginners, nor does it provide a good understanding of how the testing might work in detail. What might consist a success, what might consist a failure of this testing? A decent introduction, but only an introduction, and it should have been part of a sequence or a longer post.

Reply
Large language models can provide "normative assumptions" for learning human preferences
Stuart_Armstrong3y20

Can you clarify: are you talking about inverting the LM as a function or algorithm, or constructing prompts to elicit different information (while using the LM as normal)?

For myself, I was thinking of using CHATGPT-style approaches with multiple queries - what is your prediction for their preferences, how could that prediction be checked, what more information would you need, etc...

Reply
Load More
33Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
6mo
2
7Using Prompt Evaluation to Combat Bio-Weapon Research
7mo
2
11Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation
7mo
0
36Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example
2y
2
20How toy models of ontology changes can be misleading
2y
0
18Different views of alignment have different consequences for imperfect methods
2y
0
25Avoiding xrisk from AI doesn't mean focusing on AI xrisk
2y
2
19What is a definition, how can it be extrapolated?
3y
1
16You're not a simulation, 'cause you're hallucinating
3y
2
16Large language models can provide "normative assumptions" for learning human preferences
3y
4
Load More
Quick Reference Guide To The Infinite
14y
(+3/-3)
Quick Reference Guide To The Infinite
14y
(+1/-2)
Quick Reference Guide To The Infinite
14y
(+2/-3)
Quick Reference Guide To The Infinite
14y
(+2/-2)
Quick Reference Guide To The Infinite
14y
(+2/-5)
Quick Reference Guide To The Infinite
14y
(+3/-4)
Quick Reference Guide To The Infinite
14y
(+3/-4)
Quick Reference Guide To The Infinite
14y
(+2/-6)
Quick Reference Guide To The Infinite
14y
(+2/-4)
Quick Reference Guide To The Infinite
14y
(+6/-3)
Load More