jylin04 — AI Alignment Forum

Can LLM-based models do model-based planning?

I recently spent a few months thinking about whether LLM-based models can do model-based planning, and wrote a ~40-page report on it: "Report on LLMs and model-based planning". The doc is a bit rough around the edges still - most notably, the concepts of "efficient planning" and "sufficiently convoluted" tasks...

Apr 16, 202511

OthelloGPT learned a bag of heuristics

Work performed as a part of Neel Nanda's MATS 6.0 (Summer 2024) training program. TLDR This is an interim report on reverse-engineering Othello-GPT, an 8-layer transformer trained to take sequences of Othello moves and predict legal moves. We find evidence that Othello-GPT learns to compute the board state using many...

Jul 2, 2024111

More Recent Progress in the Theory of Neural Networks

Thanks to Dan Roberts and Sho Yaida for comments on a draft of this post. In this post, I would like to draw attention to the book Principles of Deep Learning Theory (PDLT), which I think represents a significant advance in our understanding of how neural networks work [1]. Among...

Oct 6, 202282

A review of the Bio-Anchors report

This is a linkpost for a review of Ajeya Cotra's Biological Anchors report (see also update here) that I wrote in April 2022. It's since won a prize from the EA criticism and red-teaming contest, so I thought it might be good to share here for further discussion. Here's a...

Oct 3, 202245

Thoughts on AGI safety from the top

In this post, I'll summarize my views on AGI safety after thinking about it for two years while on the Research Scholars Programme at FHI -- before which I was quite skeptical about the entire subject. [1] My goals here are twofold: (1) to introduce the problem from scratch in...

Feb 2, 202237

Transparency and AGI safety

Introduction Transparency is the problem of going inside the black box of an AI system and understanding exactly how it works, by tracking and understanding the processes in its internal state. In this post, I’ll argue that making AI systems more transparent could be very useful from a longtermist or...

Jan 11, 202154