x

AI ALIGNMENT FORUM

AF

jacquesthibs — AI Alignment Forum

Jacques Thibodeau

Top postsTop post

Jacques Thibodeau

Message

I work primarily on AI Alignment. Scroll down to my pinned Shortform for an idea of my current work and who I'd like to collaborate with.

Website: https://jacquesthibodeau.com

Twitter: https://twitter.com/JacquesThibs

GitHub: https://github.com/JayThibs

LinkedIn: https://www.linkedin.com/in/jacques-thibodeau/

3384

Ω

102

15

460

5y

Jacques Thibodeau

I work primarily on AI Alignment. Scroll down to my pinned Shortform for an idea of my current work and who I'd like to collaborate with.

Website: https://jacquesthibodeau.com

Twitter: https://twitter.com/JacquesThibs

GitHub: https://github.com/JayThibs

LinkedIn: https://www.linkedin.com/in/jacques-thibodeau/

Top postsTop post

But is it really in Rome? An investigation of the ROME model editing technique

Thanks to Andrei Alexandru, Joe Collman, Michael Einhorn, Kyle McDonell, Daniel Paleka, and Neel Nanda for feedback on drafts and/or conversations which led to useful insights for this work. In addition, thank you to both William Saunders and Alex Gray for exceptional mentorship throughout this project. The majority of this work was carried out this summer. Many people in the community were surprised when I mentioned some of the limitations of ROME (Rank-One Model Editing), so I figured it was worth it to write a post about it as well as other insights I gained from looking into the paper. Most tests were done with GPT-2, some were done with GPT-J. The ROME paper (Locating and Editing Factual Associations in GPT) has been one of the most influential papers in the prosaic alignment community. It has several important insights. The main findings are: 1. Factual associations such as “The Eiffel Tower is in Paris” seem to be stored in the MLPs of the early-middle layers of a GPT model. As the Tower token passes through the network, the MLPs of the early-middle layers will write information (e.g. the Eiffel Tower’s location) into the residual so that the model can later read that information to generate a token about that fact (e.g. Paris). 2. Editing/updating the MLP of a single layer for a given (subject, relationship, object) association allows the model to generate text with the updated fact when using new prompts/sentences that include the subject tokens. For example, editing “The Eiffel Tower is in Paris Rome” results in a model that outputs “The Eiffel Tower is right across from St Peter’s Basilica in Rome, Italy. “ In this post, I show that the ROME edit has many limitations: * The ROME edit doesn’t generalize in the way you might expect. It’s true that if the subject tokens you use for the edit are found in the prompt, it will try to generalize from the updated fact. However, it doesn’t “generalize” in the following ways: * It is not direction-agnosti

105Dec 30, 2022

What Makes an AI Startup "Net Positive" for Safety?

A descriptive, not prescriptive, overview of current AI Alignment Research

Research agenda: Supervising AIs improving AIs

Research agenda: Supervising AIs improving AIs

by Quintin Pope, Owen D, Roman Engeler, and jacquesthibs

[This post summarizes some of the work done by Owen Dudney, Roman Engeler and myself (Quintin Pope) as part of the SERI MATS shard theory stream.] TL;DR Future prosaic AIs will likely shape their own development or that of successor AIs. We're trying to make sure they don't go insane....

Apr 29, 2023•76

Practical Pitfalls of Causal Scrubbing

by Jérémy Scheurer, Phil3, tony, jacquesthibs, and David Lindner

TL;DR: We evaluate Causal Scrubbing (CaSc) on synthetic graphs with known ground truth to determine its reliability in confirming correct hypotheses and rejecting incorrect ones. First, we show that CaSc can accurately identify true hypotheses and quantify the degree to which a hypothesis is wrong. Second, we highlight some limitations...

Mar 27, 2023•89

[Simulators seminar sequence] #2 Semiotic physics - revamped

by Jan, Charlie Steiner, Logan Riggs, janus, jacquesthibs, metasemi, Michael Oesterle, Lucas Teixeira, peligrietzer, and remember

Update February 21st: After the initial publication of this article (January 3rd) we received a lot of feedback and several people pointed out that propositions 1 and 2 were incorrect as stated. That was unfortunate as it distracted from the broader arguments in the article and I (Jan K) take...

Feb 27, 2023•22

[Simulators seminar sequence] #1 Background & shared assumptions

by Jan, Charlie Steiner, Logan Riggs, janus, jacquesthibs, metasemi, Michael Oesterle, Lucas Teixeira, peligrietzer, and remember

Meta: Over the past few months, we've held a seminar series on the Simulators theory by janus. As the theory is actively under development, the purpose of the series is to discover central structures and open problems. Our aim with this sequence is to share some of our discussions with...

Jan 2, 2023•50

But is it really in Rome? An investigation of the ROME model editing technique

Thanks to Andrei Alexandru, Joe Collman, Michael Einhorn, Kyle McDonell, Daniel Paleka, and Neel Nanda for feedback on drafts and/or conversations which led to useful insights for this work. In addition, thank you to both William Saunders and Alex Gray for exceptional mentorship throughout this project. The majority of this...

Dec 30, 2022•105

Results from a survey on tool use and workflows in alignment research

In March 22nd, 2022, we released a survey with an accompanying post for the purpose of getting more insight into what tools we could build to augment alignment researchers and accelerate alignment research. Since then, we’ve also released a dataset, a manuscript (LW post), and the (relevant) Simulators post was...

Dec 19, 2022•79

A descriptive, not prescriptive, overview of current AI Alignment Research

by Jan, Logan Riggs, jacquesthibs, and janus

TL;DR: In this project, we collected and cataloged AI alignment research literature and analyzed the resulting dataset in an unbiased way to identify major research directions. We found that the field is growing quickly, with several subfields emerging in parallel. We looked at the subfields and identified the prominent researchers,...

Jun 6, 2022•139

Load More (7/8)