So, when a human lies over the course of an interaction, they'd be holding a hidden state in mind throughout. However, an LLM wouldn't carry any cognitive latent state over between telling the lie, and then responding to the elicitation question. I guess it feels more like "I just woke up from amnesia, and seems I have just told a lie. Okay, now what do I do..."

Stating this to:

  1. Verify that indeed this is how the paper works, and there's no particular way of passing latent state that I missed, and
  2. Any thoughts on how this affects the results and approach?
  Yes, this is how the paper works.   Not really. I find the simulator framing is useful to think about this.


There are really many things I found outstanding about this post. The key one, however, is that after reading this, I feel less confused when thinking about transformer language models. The post had that taste of deconfusion where many of the arguments are elegant, and simple; like suddenly tilting a bewildering shape into place. I particularly enjoyed the discussion of ways agency does and does not manifest within a simulator (multiple agents, irrational agents, non-agentic processes), the formulation of the prediction orthogonality thesis, ways i... (read more)

Thank you for this lovely comment. I'm pleasantly surprised that people were able to get so much out of it. As I wrote in the post, I wasn't sure if I'd ever get around to publishing the rest of the sequence, but the reception so far has caused me to bump up the priority of that.

If someone asks what the rock is optimizing, I’ll say “the actions” - i.e. the rock “wants” to do whatever it is that the rock in fact does.

This argument does not seem to me like it captures the reason a rock is not an optimiser? 

I would hand wave and say something like: 

"If you place a human into a messy room, you'll sometimes find that the room is cleaner afterwards. If you place a kid in front of a bowl of sweets, you'll soon find the sweets gone. These and other examples are pretty surprising state transitions, that would be highly unlikely i... (read more)

Exactly! That's an optimization-at-a-distance style intuition. The optimizer (e.g. human) optimizes things outside of itself, at some distance from itself. A rock can arguably be interpreted as optimizing itself, but that's not an interesting kind of "optimization", and the rock doesn't optimize anything outside itself. Throw it in a room, the room stays basically the same.

An update on this: sadly I underestimated how busy I would be after posting this bounty. I spent 2h reading this and Thomas post the other day, but didn't not manage to get into the headspace of evaluating the bounty (i.e. making my own interpretation of John's post, and then deciding whether Thomas' distillation captured that). So I will not be evaluating this. (Still happy to pay if someone else I trust claim Thomas' distillation was sufficient.) My apologies to John and Thomas about that.

Cool, I'll add $500 to the distillation bounty then, to be paid out to anyone you think did a fine job of distilling the thing :)  (Note: this should not be read as my monetary valuation for a day of John work!)

(Also, a cooler pay-out would be basis points, or less, of Wentworth impact equity)

Needing to judge submissions is the main reason I didn't offer a bounty myself. Read the distillation, and see if you yourself understand it. If "Coherence of Distributed Decisions With Different Inputs Implies Conditioning" makes sense as a description of the idea, then you've probably understood it. If you don't understand it after reading an attempted distillation, then it wasn't distilled well enough.

How long would it have taken you to do the distillation step yourself for this one? I'd be happy to post a bounty, but price depends a bit on that.

Short answer: about one full day. Longer answer: normally something like this would sit in my notebook for a while, only informing my own thinking. It would get written up as a post mainly if it were adjacent to something which came up in conversation (either on LW or in person). I would have the idea in my head from the conversation, already be thinking about how best to explain it, chew on it overnight, and then if I'm itching to produce something in the morning I'd bang out the post in about 3-4 hours. Alternative paths: I might need this idea as background for something else I'm writing up, or I might just be in a post-writing mood and not have anything more ready-to-go. In either of those cases, I'd be starting more from scratch, and it would take about a full day.

Jaan/Holden convo link is broken :(


I think this post strikes a really cool balance between discussing some foundational questions about the notion of agency and its importance, as well as posing a concrete puzzle that caused some interesting comments.

For me, Life is a domain that makes it natural to have reductionist intuitions. Compared to say neural networks, I find there are fewer biological metaphors or higher-level abstractions where you might sneak in mysterious answers that purport to solve the deeper questions. I'll consider this post next time I want to introduce some... (read more)

Here are prediction questions for the predictions that TurnTrout himself provided in the concluding post of the Reframing Impact sequence

Elicit Prediction (eli
... (read more)

