## AI ALIGNMENT FORUMAF

Linda Linsefors

Hi, I am a Physicist, an Effective Altruist and AI Safety student/rehearser.

# Wiki Contributions

[Intro to brain-like-AGI safety] 15. Conclusion: Open problems, how to help, AMA

I'm also interested to se the list of candidate instincts.

Regarding funding, how much money do you need? Just order of magnitude. There lots of diffrent grants and where you want to appy depends on the size of your budget.

[\$20K in Prizes] AI Safety Arguments Competition

"Most AI reserch focus on building machines that do what we say. Aligment reserch is about building machines that do what we want."

Source: Me, probably heavely inspred by "Human Compatible" and that type of arguments. I used this argument in conversations to explain AI Alignment for a while, and I don't remember when I started. But the argument is very CIRL (cooperative inverse reinforcment learning).

I'm not sure if this works as a one liner explanation. But it does work as a conversation starter of why trying to speify goals directly is a bad idea. And how the things we care about often are hard to messure and therefore hard to instruct an AI to do. Insert referenc to King Midas, or talk about what can go wrong with a super inteligent Youtube algorithm that only optimises for clicks.

_____________________________

"Humans rule the earth becasue we are smart. Some day we'll build something smarter than us. When it hapens we better make sure it's on our side."

Source: Me

Inspiration: I don't know. I probably stole the structure of this agument from somwhere, but it was too long ago to remember.

By "our side" I mean on the side of humans. I don't mean it as an us vs them thing. But maybe it can be read that way. That would be bad. I've never run in to that missunderstanding though, but I also have not talked to politicians.

Formal Inner Alignment, Prospectus

I think one major reason why people don't tend to get hijacked by imagined adversaries is that you can't simulate someone who is smarter than you, and therefore you can defend against anything you can simulate in your mind.

This is not a perfect arugment since I can imagine someone that has power over me in the real world, and for example imagine how angry they would be at me if I did something they did not like. But then their power over me comes from their power in the real world, not their ability to outsmart me inside my own mind.

Redwood Research’s current project

The correct labeling of how violent a knifing is, is not 50.1%, or 49.9%. The correct label is 0 or 100%. There is no "ever so slightly" in the training data. The percentage is about the uncertanty of classifyer, it is not about degrees of violence in the sample. It it was the other way around, then I would mostsy agree with the current training scheem, as I said.

If the model is well calibrated then half the samples would be safe, and half violent at 50%. Moving a up the safe one is helpfull. Decreesing missclassification of safe samples will increas the chance of outputing something safe.

Decreesing the uncertanty from 50% to 0 for an unsafe sample don't do anything, for that sample. But it does help in learning good from bad in general, which is more important.

Redwood Research’s current project

There’s one thing you can do that definitely works, which is to only get labels for snippets which are just barely considered safe enough by your classifier. Eg if your threshold is set to 99%, so that a completion won’t be accepted unless the classifier is 99% sure that it’s safe, then there’s no point looking at completions rated as <99% likely to be safe (because the classifier isn’t going to accept them), and also it’s probably a better bet to look at things that the model thinks are 99.1% likely to be safe rather than 99.9%, because (assuming the model is calibrated) you’ll find errors 9x as often.

This seems wrong to me. You should want to label and train on snippets that your classifier thinks is 50% correct, because that is how you maximmise information.

I don't know how to argue this point since I don't know what the crux behinde the disagreement is, but I'll try to through out some words...

If safeness was a continous number and you want solutions that are safe enough, it would be more reasonable to focus most traning around the cuttoff point. Although a wider traning data probably leads to better generalisations, so I would include that too.

But safety is not a continious number. It's a binary in your setup. It is either somone is hurt or not. When you run it you want to have some extra safety by raising the threshold. But when you train you just want to reduce ucertanty. Things that the classifier thinks is 99%  safe are not inharently 99% safe. They are either safe or not. So focusing your training around the thresshold don't make any sense.

Another way to say this is that the uncertanty is in the model, not in the world. There are going to be snippets that the model is less than 99% sure about, but are acctually perfectly safe, and could be valuable training data.

Selection Theorems: A Program For Understanding Agents

Not sure how usefull this is, but I think this counts as a selection theorem.
(Paper by Caspar Oesterheld, Joar Skalse, James Bell and me)

https://proceedings.neurips.cc/paper/2021/hash/b9ed18a301c9f3d183938c451fa183df-Abstract.html

We played around with taking learning algorithms designed for multi armed bandit problems (your action matters but not your policy) and placing them in Newcomblike environments (both your acctual action and your probability distribution over actions matters). And then we proved some stuf about their behaviour.

On Solving Problems Before They Appear: The Weird Epistemologies of Alignment

Studying early cosmolog has a lot of the same epistemic problems as AI safety. You can't do direct experiment. You can't observe what's going on. You have to extrapolate anything you know far beyond where this knolage is trustworthy.

By early cosmology I mean anything before recombination (when the matter transfomed from plasma to gas, and the uinverse became transparent to light) but especially anything to do with cosmic inflation or compeeting theories, or stuff about how it all started, or is cyclic, etc.

Unfortunatly I don't know what lessons we can learn from cosmology. As far as I know they don't know how to deal with this either. I worked in this field during my PhD and as far as I could see, everyone just used their personal intuission. There where som spectacular disagreement about how to apply probability to world trijectories, but very litte atempts to to solve this issue. I did not see any meta discussions, or people with diffrent intuissions trying to crux it out. But just because this things did not happen infront of me, don't mean it doesn't happen.

The two-layer model of human values, and problems with synthesizing preferences

There is a missmatch in saying cortex=charcter and subcortex=player.

If I understand the player-character model right, then uncosuios coping strategies would be player level tactic. But these are learned behaviours, and would therfore be part of cortex.

In Kaj's example, the idea that cheescake will make the bad go away exist in the cortex's world model.

According to Steven's model of how the brain works (which I think is probably ture), the subcortex is part of the game the player is playing. Specificcally, the subcortex provides the reward signal, and some other importat game stats (stamina level, hit-points, etc). The subcortex is also sort of like a tutorial, drawing your attention to things that the game creator (evoulution) thinks might be usefull, and occational cut scenes (acting out pre-programed behaviour).

ML comparasion:
* The character is the pre trained nerual net
* The player is the backprop
* The cortex is the neural net and backprop
* Subcortex is the reward signarl and sometimes supervisory signal.

Also, I don't like the the player-character model much. Like all models it is at best a simplification, and it does catch some of what is going on, but I think it is more wrong than right and I think something like multi-agent model is much better. I.e. there are coping mechanmisms and other less consious strategies living in your brains side by side with who you think you are. But I don't think these are compleetly invissible the way the player is invissible to the character. They are predictive models (e.g. "cheescake will make me safe"), and it is possible to query them for predictions. And almost all of these models are in the cortex.

The Commitment Races problem

Imagine your life as a tree (as in data structure). Every observation which (from your point of view of prior knowledge) could have been different, and every decision which (from your point of view) could have been different, is a node in this tree.

Ideally you would would want to pre-analyse the entire tree, and decide the optimal pre-commitment for each situation. This is too much work.

So instead you wait and see which branch you find yourself in, only then make the calculations needed to figure out what you would do in that situation, given a complete analysis of the tree (including logical constraints, e.g. people predicting what you would have done, etc). This is UDT. In theory, I see no drawbacks with UDT. Except in practice UDT is also too much work.

What you actually do, as you say, is to rely on experience based heuristics. Experience based heuristics is much superior for computational efficiency, and will give you a leg up in raw power. But you will slide away from optimal DT, which will give you a negotiating disadvantage. Given that I think raw power is more important than negotiating advantage, I think this is a good trade-off.

The only situation where you want to rely more on DT principles, is in super important one-off situations, and you basically only get those in weird acausal trade situations. Like, you could frame us building a friendly AI as acausal trade, like Critch said, but that framing does not add anything useful.

And then there is things like this and this and this, which I don't know how to think of. I suspect it breaks somehow, but I'm not sure how. And if I'm wrong, getting DT right might be the most important thing.

But in any normal situation, you will either have repeated games among several equals, where some coordination mechanism is just uncomplicatedly in everyone interest. Or your in a situation where one person just have much more power over the other one.

The Commitment Races problem

(This is some of what I tried to say yesterday, but I was very tried and not sure I said it well)

Hm, the way I understand UDT, is that you give yourself the power to travel back in logical time. This means that you don't need to actually make commitment early in your life when you are less smart.

If you are faced with blackmail or transparent Newcomb's problem, or something like that, where you realise that if you had though of the possibility of this sort of situation before it happened (but with your current intelligence), you would have pre-committed to something, then you should now do as you would have pre-committed to.

This means that an UDT don't have to do tons of pre-commitments. It can figure things out as it goes, and still get the benefit of early pre-committing. Though as I said when we talked, it does loose some transparency which might be very costly in some situations. Though I do think that you loose transparency in general by being smart, and that it is generally worth it.

(Now something I did not say)

However the there is one commitment that you (maybe?[1]) have to do to get the benefit of UDT if you are not already UDT, which is to commit to become UDT. And I get that you are wary of commitments.

Though more concretely, I don't see how UDT can lead to worse behaviours. Can you give an example? Or do you just mean that UDT get into commitment races at all, which is bad? But I don't know any DT that avoids this, other than always giving in to blackmail and bullies, which I already know you don't, given one of the stories in the blogpost.

[1] Or maybe not. Is there a principled difference between never giving into blackmail becasue you pre-committed something, or just never giving into blackmail with out any binding pre-commitment? I suspect not really, which means you are UDT as long as you act UDT, and no pre-commitment needed, other than for your own sake.