AI ALIGNMENT FORUM
AF

Megan Kinniment
Ω26220
Message
Dialogue
Subscribe

I work at ARC Evals. I like language models.

Am very happy for people to ask to chat - but I might be too busy to accept (message me).

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
0Megan Kinniment's Shortform
3y
0
ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks
Megan Kinniment2y10

Sure.

Reply
Exploring Mild Behaviour in Embedded Agents
Megan Kinniment3y10

1. How does this relate to speed prior and stuff like that?

I list this in the concluding section as something I haven't thought about much but would think about more if I spent more time on it.

2. If the agent figures out how to build another agent...

Yes, tackling these kinds of issues is the point of this post. I think efficient thinking measures would be very difficult / impossible to actually specify well, and I use compute usage as an example of a crappy efficient thinking measure. The point is that even if the measure is crap, it might still be able to induce some degree of mild optimisation, and this mild optimisation could help protect the measure (alongside the rest of the specification) from the kind of gaming behaviour you describe. In the 'Potential for Self-Protection Against Gaming' section, I go through how this works when an agent with a crap efficient thinking measure has the option to perform a 'gaming' action such as delegating or making a successor agent.

Reply
Conditioning Generative Models
Megan Kinniment3y00

For the newspaper and reddit post examples, I think false beliefs remain relevant since these are observations about beliefs. For example, the observation of BigCo announcing they have solved alignment is compatible with worlds where they actually have solved alignment, but also with worlds where BigCo have made some mistake and alignment hasn't actually been solved, even though people in-universe believe that it has. These kinds of 'mistaken alignment' worlds seem like they would probably contaminate the conditioning to some degree at least. (Especially if there are ways that early deceptive AIs might be able to manipulate BigCo and others into making these kinds of mistakes).

Reply
Conditioning Generative Models
Megan Kinniment3y01

Something I’m unsure about here is whether it is possible to separately condition on worlds where X is in fact the case, vs worlds where all the relevant humans (or other text-writing entities) just wrongly believe that X is the case. 

Essentially, is the prompt (particularly the observation) describing the actual facts about this world, or just the beliefs of some in-world text-writing entity? Given that language is often (always?) written by fallible entities, it seems at least not unreasonable to me to assume the second rather than the first interpretation. 

This difference seems relevant to prompts aimed at weeding out deceptive alignment in particular. Since in the prompts as beliefs case, the same prompt could cause conditioning both on worlds where we have in fact solved X problem, but also worlds where we are being actively misled into believing that we have solved X problem (when we actually haven’t).

Reply
18Bounty: Diverse hard tasks for LLM agents
2y
29
32Send us example gnarly bugs
2y
0
24Steering Behaviour: Testing for (Non-)Myopia in Language Models
3y
5
20Recall and Regurgitation in GPT2
3y
0
8Exploring Mild Behaviour in Embedded Agents
3y
2