This post is a companion piece to a forthcoming paper. This work was done as part of MATS 7.0 & 7.1. Abstract We explore how LLMs’ awareness of their own capabilities affects their ability to acquire resources, sandbag an evaluation, and escape AI control. We quantify LLMs' self-awareness of capability...
The analysis and definitions used here are tentative. My familiarity with the concrete systems discussed ranges from rough understanding (markets and parliaments), through abiding amateur interest (biology), to meaningful professional expertise (AI/ML things). The abstractions and terminology have been refined in conversation and private reflection, and the following examples are...
This analysis is speculative. The framing has been refined in conversation and private reflection and research. To some extent it feels vacuous, but at least valuable for further research and communication. A cluster of questions fundamental to many concerns around risks from artificial systems regard the concepts of search, planning,...
When we speak about entities 'wanting' things, or having 'goal-directed behaviour', what do we mean? Because most of the actors that we (my human readers and I) attentively interact frequently with are (presumably) computationally similar (to each other and to ourselves), it is easy for our abstractions to conflate phenomena...
This is a short attempt to articulate a framing which I sometimes find useful for thinking about embedded agency. I noticed that I wanted to refer to it a few times in conversations and other writings. A useful stance for thinking about embedded agents takes as more primitive, or fundamental,...
I'm deliberately inhabiting a devil's advocate mindset because that perspective seems to be missing from the conversations I've witnessed. My actual fully-reflective median takeaway might differ. My covid has made writing difficult at the moment, and I haven't had the energy to gather citations or fully explain the detail for...
Many thanks to Peter Barnett, my alpha interlocutor for the first version of the proof presented, and draft reader. Analogies are often drawn between natural selection and gradient descent (and other training procedures for parametric algorithms). It is important to understand to what extent these are useful and applicable analogies....