## AI ALIGNMENT FORUMAF

Gurkenglas

I operate by Crocker's rules.

I won't deliberately, derisively spread something just because you tried to point out an infohazard.

Sorted by New

# Wiki Contributions

Ngo and Yudkowsky on alignment difficulty

Oh, I wasn't expecting you to have addressed the issue! 10.2.4 says L wouldn't be S if it were calculated from projected actions instead of given actions. How so? Mightn't it predict the given actions correctly?

You're right on all counts in your last paragraph.

Ngo and Yudkowsky on alignment difficulty

it is very interpretable to humans

Misunderstanding: I expect we can't construct a counterfactual planner because we can't pick out the compute core in the black-box learned model.

And my Eliezer's problem with counterfactual planning is that the plan may start by unleashing a dozen memetic, biological, technological, magical, political and/or untyped existential hazards on the world which then may not even be coordinated correctly when one of your safeguards takes out one of the resulting silicon entities.

Ngo and Yudkowsky on alignment difficulty

If you don't wish to reply to Eliezer, I'm an other and also ask what incoherence allows what corrigibility. I expect counterfactual planning to fail for want of basic interpretability. It would also coherently plan about the planning world - my Eliezer says we might as well equivalently assume superintelligent musings about agency to drive human readers mad.

Someone needs to check if we can use ML to guess activations in one set of neurons from activations in another set of neurons. The losses would give straightforward estimates of such statistical quantities as mutual information. Generating inputs that have the same activations in a set of neurons illustrates what the set of neurons does. I might do this myself if nobody else does.

Troll Bridge

Suppose the bridge is safe iff there's a proof that the bridge is safe. Then you would forbid the reasoning "Suppose I cross. I must have proven it's safe. Then it's safe, and I get 10. Let's cross.", which seems sane enough in the face of Löb.

Speculations against GPT-n writing alignment papers

Translating to a tree of natural language descriptions and back lets you

1. use different instances of GPT at each step, so that any coded messages would have to rely on the common training set rather than shared internals.
2. do a random subset of the tasks yourself, so that any subversion of the target system would have to be based on subtle biases, perhaps with a small number of outright lies that it hopes we don't implement manually, rather than the description being fabricated from whole cloth.

Having the thing write papers is merely an existence proof of embedded agency being irrelevant except for deconfusion.

Intelligent agents causally responsible for your existence.

What do you mean you can think of this, I told this to you :D

Superrational Agents Kelly Bet Influence!

Suppose instead of a timeline with probabilistic events, the coalition experiences the full tree of all possible futures - but we translate everything to preserve behavior. Then beliefs encode which timelines each member cares about, and bets trade influence (governance tokens) between timelines.

Disentangling Corrigibility: 2015-2021

In category theory, one learns that good math is like kabbalah, where nothing is a coincidence. All short terms ought to mean something, and when everything fits together better than expected, that is a sign that one is on the right track, and that there is a pattern to formalize. and  can be replaced by  and . I expect that the latter formation is better because it is shorter. Its only direct effect would be that you would write  instead of , so the previous sentence must cash out as this being a good thing. Indeed, it points out a direction in which to generalize. How does your math interact with quantilization? I plan to expand when I've had time to read all links.

Disentangling Corrigibility: 2015-2021

I am not sure if I understand the question.

pi has form afV, V has form mfV, f is a long reused term. Expand recursion to get afmfmf... and mfmfmf.... Define E=fmE and you get pi=aE without writing f twice. Sure, you use V a lot but my intuition is that there should be some a priori knowable argument for putting the definitions your way or your theory is going to end up with the wrong prior.

Inframeasures and Domain Theory

Damn it, I came to write about the monad1 then saw the edit. You may want to add it to this list, and compare it with the other entries.

Here's a dissertation and blog post by Jared Tobin on using (X -> R) -> R with flat reals to represent usual distributions in Haskell. He appears open to get hired.

Maybe you want a more powerful type system? I think Coq allows constructing that subtype of a type which satisfies a property. Agda's cubical type theory places a lot of emphasis for its for the unit interval. Might dependent types be enough express lipschitzness and concavity?

1: Spotted it during literature search on pushforwards to measure the distribution of the vector of all activations in a neural network for one input, given the known distribution of inputs to a GAN generator that outputs inputs to the first network. Which I started modeling (as ((Neurons -> R) -> R) -> R) between you giving me that DT book and first reading about the technique in this post :).