x

AI ALIGNMENT FORUM

AF

Joshua Clancy — AI Alignment Forum

Joshua Clancy

Top postsTop post

Joshua Clancy

Message

20

Ω

2

4

19

3y

Joshua Clancy

20

Ω

2

3y

What’s in the box?! – Towards interpretability by distinguishing niches of value within neural networks.

Abstract Mathematical models can describe neural network architectures and training environments, however the learned representations that emerge have remained difficult to model. Here we build a new theoretical model of internal representations. We do this via an economic and information theory framing. We distinguish niches of value that representations can...

Feb 29, 2024•3

Can we isolate neurons that recognize features vs. those which have some other role?

While I think about interpretability a lot, it's not my day job! Let me dive down a rabbit hole and tell me where I am wrong. Intro As I see it the first step to interpretability would be isolating which neurons perform which role. For example, which neurons are representing...

Oct 21, 2023•4