## AI ALIGNMENT FORUMAF

Gurkenglas

I operate by Crocker's rules.

I won't deliberately, derisively spread something just because you tried to point out an infohazard.

# Posts

Sorted by New

Superrational Agents Kelly Bet Influence!

Suppose instead of a timeline with probabilistic events, the coalition experiences the full tree of all possible futures - but we translate everything to preserve behavior. Then beliefs encode which timelines each member cares about, and bets trade influence (governance tokens) between timelines.

Disentangling Corrigibility: 2015-2021

In category theory, one learns that good math is like kabbalah, where nothing is a coincidence. All short terms ought to mean something, and when everything fits together better than expected, that is a sign that one is on the right track, and that there is a pattern to formalize. and  can be replaced by  and . I expect that the latter formation is better because it is shorter. Its only direct effect would be that you would write  instead of , so the previous sentence must cash out as this being a good thing. Indeed, it points out a direction in which to generalize. How does your math interact with quantilization? I plan to expand when I've had time to read all links.

Disentangling Corrigibility: 2015-2021

I am not sure if I understand the question.

pi has form afV, V has form mfV, f is a long reused term. Expand recursion to get afmfmf... and mfmfmf.... Define E=fmE and you get pi=aE without writing f twice. Sure, you use V a lot but my intuition is that there should be some a priori knowable argument for putting the definitions your way or your theory is going to end up with the wrong prior.

Inframeasures and Domain Theory

Damn it, I came to write about the monad1 then saw the edit. You may want to add it to this list, and compare it with the other entries.

Here's a dissertation and blog post by Jared Tobin on using (X -> R) -> R with flat reals to represent usual distributions in Haskell. He appears open to get hired.

Maybe you want a more powerful type system? I think Coq allows constructing that subtype of a type which satisfies a property. Agda's cubical type theory places a lot of emphasis for its for the unit interval. Might dependent types be enough express lipschitzness and concavity?

1: Spotted it during literature search on pushforwards to measure the distribution of the vector of all activations in a neural network for one input, given the known distribution of inputs to a GAN generator that outputs inputs to the first network. Which I started modeling (as ((Neurons -> R) -> R) -> R) between you giving me that DT book and first reading about the technique in this post :).

Disentangling Corrigibility: 2015-2021

I like your "Corrigibility with Utility Preservation" paper. I don't get why you prefer not using the usual conditional probability notation.  leads to TurnTrout's attainable utility preservation. Why not use  in the definition of ? Could you change the definition of  to , and give the agent the ability to self-modify arbitrarily? The idea is that it would edit itself into its original form in order to make sure  is large and  small after the button press. (Although it might just keep going further in that direction...) I don't like the privileging of  actions.

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

I'm not convinced that we can do nothing if the human wants ghosts to be happy. The AI would simply have to do what would make ghosts happy if they were real. In the worst case, the human's (coherent extrapolated) beliefs are your only source of information on how ghosts work. Any proper general solution to the pointers problem will surely handle this case. Apparently, each state of the agent corresponds to some probability distribution over worlds.

Some AI research areas and their relevance to existential safety

with the exception of people who decided to gamble on being part of the elite in outcome B

Game-theoretically, there's a better way. Assume that after winning the AI race, it is easy to figure out everyone else's win probability, utility function and what they would do if they won. Human utility functions have diminishing returns, so there's opportunity for acausal trade. Human ancestry gives a common notion of fairness, so the bargaining problem is easier than with aliens.

Most of us care some even about those who would take all for themselves, so instead of giving them the choice between none and a lot, we can give them the choice between some and a lot - the smaller their win prob, the smaller the gap can be while still incentivizing cooperation.

Therefore, the AI race game is not all or nothing. The more win probability lands on parties that can bargain properly, the less multiversal utility is burned.

Inner Alignment in Salt-Starved Rats

This all sounds reasonable. I just saw that you were arguing for more being learned at runtime (as some sort of Steven Reknip), and I thought that surely not all the salt machinery can be learnt, and I wanted to see which of those expectations would win.

Inner Alignment in Salt-Starved Rats

Do you posit that it learns over the course of its life that salt taste cures salt definiency, or do you allow this information to be encoded in the genome?

Confucianism in AI Alignment

The hypotheses after the modification are supposed to have knowledge that they're in training, for example because they have enough compute to find themselves in the multiverse. Among hypotheses with equal behavior in training, we select the simpler one. We want this to be the one that disregards that knowledge. If the hypothesis has form "Return whatever maximizes property _ of the multiverse", the simpler one uses that knowledge. It is this form of hypothesis which I suggest to remove by inspection.