Gurkenglas

I operate by Crocker's rules.

I won't deliberately, derisively spread something just because you tried to point out an infohazard.

Speculations against GPT-n writing alignment papers

Translating to a tree of natural language descriptions and back lets you

- use different instances of GPT at each step, so that any coded messages would have to rely on the common training set rather than shared internals.
- do a random subset of the tasks yourself, so that any subversion of the target system would have to be based on subtle biases, perhaps with a small number of outright lies that it hopes we don't implement manually, rather than the description being fabricated from whole cloth.

Having the thing write papers is merely an existence proof of embedded agency being irrelevant except for deconfusion.

Intelligent agents causally responsible for your existence.

What do you mean you can think of this, I told this to you :D

Superrational Agents Kelly Bet Influence!

Suppose instead of a timeline with probabilistic events, the coalition experiences the full tree of all possible futures - but we translate everything to preserve behavior. Then beliefs encode which timelines each member cares about, and bets trade influence (governance tokens) between timelines.

Disentangling Corrigibility: 2015-2021

In category theory, one learns that good math is like kabbalah, where nothing is a coincidence. All short terms ought to mean something, and when everything fits together better than expected, that is a sign that one is on the right track, and that there is a pattern to formalize. and can be replaced by and . I expect that the latter formation is better because it is shorter. Its only direct effect would be that you would write instead of , so the previous sentence must cash out as this being a good thing. Indeed, it points out a direction in which to generalize. How does your math interact with quantilization? I plan to expand when I've had time to read all links.

Disentangling Corrigibility: 2015-2021

I am not sure if I understand the question.

pi has form afV, V has form mfV, f is a long reused term. Expand recursion to get afmfmf... and mfmfmf.... Define E=fmE and you get pi=aE without writing f twice. Sure, you use V a lot but my intuition is that there should be some a priori knowable argument for putting the definitions your way or your theory is going to end up with the wrong prior.

Inframeasures and Domain Theory

Damn it, I came to write about the monad^{1} then saw the edit. You may want to add it to this list, and compare it with the other entries.

Here's a dissertation and blog post by Jared Tobin on using `(X -> R) -> R`

with flat reals to represent usual distributions in Haskell. He appears open to get hired.

Maybe you want a more powerful type system? I think Coq allows constructing that subtype of a type which satisfies a property. Agda's cubical type theory places a lot of emphasis for its for the unit interval. Might dependent types be enough express lipschitzness and concavity?

^{1}: Spotted it during literature search on pushforwards to measure the distribution of the vector of all activations in a neural network for one input, given the known distribution of inputs to a GAN generator that outputs inputs to the first network. Which I started modeling (as `((Neurons -> R) -> R) -> R`

) between you giving me that DT book and first reading about the technique in this post :).

Disentangling Corrigibility: 2015-2021

I like your "Corrigibility with Utility Preservation" paper. I don't get why you prefer not using the usual conditional probability notation. leads to TurnTrout's attainable utility preservation. Why not use in the definition of ? Could you change the definition of to , and give the agent the ability to self-modify arbitrarily? The idea is that it would edit itself into its original form in order to make sure is large and small after the button press. (Although it might just keep going further in that direction...) I don't like the privileging of actions.

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

I'm not convinced that we can do nothing if the human wants ghosts to be happy. The AI would simply have to do what would make ghosts happy if they were real. In the worst case, the human's (coherent extrapolated) beliefs are your only source of information on how ghosts work. Any proper general solution to the pointers problem will surely handle this case. Apparently, each state of the agent corresponds to some probability distribution over worlds.

Some AI research areas and their relevance to existential safety

with the exception of people who decided to gamble on being part of the elite in outcome B

Game-theoretically, there's a better way. Assume that after winning the AI race, it is easy to figure out everyone else's win probability, utility function and what they would do if they won. Human utility functions have diminishing returns, so there's opportunity for acausal trade. Human ancestry gives a common notion of fairness, so the bargaining problem is easier than with aliens.

Most of us care some even about those who would take all for themselves, so instead of giving them the choice between none and a lot, we can give them the choice between some and a lot - the smaller their win prob, the smaller the gap can be while still incentivizing cooperation.

Therefore, the AI race game is not all or nothing. The more win probability lands on parties that can bargain properly, the less multiversal utility is burned.

Inner Alignment in Salt-Starved Rats

This all sounds reasonable. I just saw that you were arguing for more being learned at runtime (as some sort of Steven Reknip), and I thought that surely not all the salt machinery can be learnt, and I wanted to see which of those expectations would win.

Suppose the bridge is safe iff there's a proof that the bridge is safe. Then you would forbid the reasoning "Suppose I cross. I must have proven it's safe. Then it's safe, and I get 10. Let's cross.", which seems sane enough in the face of Löb.