Satvik Golechha

Message

I research intelligence and its emergence and expression in neural networks to ensure advanced AI is safe and beneficial.

I'm currently a Research Scientist at UK AISI working on training and interpreting model organisms of misalignment — such as of reward hacking, evaluation awareness, and sandbagging.

For...

564

Satvik Golechha

Message

I research intelligence and its emergence and expression in neural networks to ensure advanced AI is safe and beneficial.

I'm currently a Research Scientist at UK AISI working on training and interpreting model organisms of misalignment — such as of reward hacking, evaluation awareness, and sandbagging.

For...

564

Satvik Golechha

Satvik Golechha

Satvik Golechha

Satvik Golechha

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL

Auditing language models for hidden objectives