I was attempting to solve a relatively specific technical problem related to self-proofs using counterfactuals. So I suppose I do think (at least non-circular ones) are useful. But I'm not sure I'd commit to any broader philosophical statement about counterfactuals beyond "they can be used in a specific formal way to help functions prove statements about their own output in a way that avoid Lob's Theorem issues". That being said, that's a pretty good use, if that's the type of thing you want to do? It's also not totally clear if you're imagining counterfac... (read more)
The Agent needs access to a self pointer, and it is parameterized so it doesn't have to be a static pointer, as it was in the original paper -- this approach in particular needs it to be dynamic in this way.
There are also use cases where a bit of code receives a pointer not to its exact self -- when it is called as a subagent, it will get the parent's pointer.
It seems like a solid symbol grounding solution would allow us to delegate some amount of "translate vague intuitions about alignment into actual policies". In particular, there seems to be a correspondence between CIRL and symbol grounding -- systems aware they do not know the goal they should optimize are similar to symbol-grounding machines aware there is a difference between the literal content of instructions and the desired behavior the instructions represent (although the instructions might be even more abstract symbols than words).
Is there any lite... (read more)
So, this post only deals with agent counterfactuals (not environmental counterfactuals), but I believe I have solved the technical issue you mention about the construction of logical counterfactuals as it concerns TDT. See: https://www.alignmentforum.org/posts/TnkDtTAqCGetvLsgr/a-possible-resolution-to-spurious-counterfactuals
I have fewer thoughts about environmental counterfactuals but think a similar approach could be used to make statements along those lines, i.e. construct alternate agents receiving a different observation about the world. I'm not sure... (read more)
It seems like you could use these counterfactuals to do whatever decision theory you'd like? My goal wasn't to solve actually hard decisions -- the 5 and 10 problem is perhaps the easiest decision I can imagine -- but merely to construct a formalism such that even extremely simple decisions involving self-proofs can be solved at all.
I think the reason this seems to imply a decision theory is that it's such a simple model that there are some ways of making decisions that are impossible in the model -- a fair portion of that was inherited from the psuedocode... (read more)