## AI ALIGNMENT FORUMAF

Linda Linsefors

Hi, I am a Physicist, an Effective Altruist and AI Safety student/researcher.

# Comments

There is no study material since this is not a course. If you are accepted to one of the project teams they you will work on that project.

You can read about the previous research outputs here: Research Outputs – AI Safety Camp

The most famous research to come out of AISC is the coin-run experiment.
(95) We Were Right! Real Inner Misalignment - YouTube
[2105.14111] Goal Misgeneralization in Deep Reinforcement Learning (arxiv.org)

But the projects are different each year, so the best way to get an idea for what it's like is just to read the project descriptions.

Second reply. And this time I actually read the link.
I'm not suppressed by that result.

My original comment was a reaction to claims of the type [the best way to solve almost any task is to develop general intelligence, therefore there is a strong selection pressure to become generally intelligent]. I think this is wrong, but I have not yet figured out exactly what the correct view is.

But to use an analogy, it's something like this: In the example you gave, the AI get's better at the sub tasks by learning on a more general training set. It seems like general capabilities was useful. But consider that we just trained on even more data for a singel sub task, then wouldn't it develop general capabilities, since we just noticed that general capabilities was useful for that sub task. I was planing to say "no" but I notice that I do expect some transfer learning. I.e. if you train on just one of the dataset, I expect it to be bad at the other ones, but I also expect it to learn them quicker than without any pre-training.

I seem to expect that AI will develop general capabilities when training on rich enough data, i.e. almost any real world data. LLM is a central example of this.

I think my disagreement with at least my self from some years ago and probably some other people too (but I've been away a bit form the discourse so I'm not sure), is that I don't expect as much agentic long term planing as I used to expect.

I agree that eventually, at some level of trying to solve enough different types of tasks, GI will be efficient, in terms of how much machinery you need, but it will never be able to compete on speed.

Also, it's an open question what is "enough different types of tasks". Obviously, for a sufficient broad class of problems GI will be more efficient (in the sense clarified above). Equally obviously, for a sufficient narrow class of problems narrow capabilities will be more efficient.

Humans have GI to some extent, but we mostly don't use it. This is interesting. This means that a typical human environment is complex enough so that it's worth carrying around the hardware for GI. But even though we have it, it is evolutionary better to fall back at habits, or imitation, or instinkt, for most situations.

Looking back to exactly what I wrote, I said there will not be any selection pressure for GI as long as other options are available. I'm not super confident in this. But if I'm going to defend it here anyway by pointing out that "as long as other options are available", is doing a lot of the work here. Some problems are only solvable by noticing deep patterns in reality, and in this case a sufficiently deep NN with sufficient training will learn this, and that is GI.

I think we agreement.

I think the confusion is because it is not clear form that section of the post if you are saying
1)"you don't need to do all of these things"
or
2) "you don't need to do any of these things".

Because I think 1 goes without saying, I assumed you were saying 2. Also 2 probably is true in rare cases, but this is not backed up by your examples.

But if 1 don't go without saying, then this means that a lot of "doing science" is cargo-culting? Which is sort of what you are saying when you talk about cached methodologies.

So why would smart, curious, truth-seeking individuals use cached methodologies? Do I do this?

Some self-reflection: I did some of this as a PhD student, because I was new, and it was a way to hit the ground running. So, I did some science using the method my supervisor told me to use, while simultaneously working to understand the reason behind this method. I did spend less time that I would have wanted to understand all the assumptions of the sub-sub field of physics I was working in, because of the pressure to keep publishing and because I got carried away by various fun math I could do if i just accepted these assumptions. After my PhD I felt that if I was going to stay in Physics, I wanted to take year or two for just learning, to actually understand Loop Quantum Gravit, and all the other competing theories, but that's not how academia works unfortunately, which is one of the reasons I left.

I think that the fundament of good Epistemic is to not have competing incentives.

In particular, four research activities were often highlighted as difficult and costly (here in order of decreasing frequency of mention):

• Running experiments
• Formalizing intuitions
• Unifying disparate insights into a coherent frame
• Proving theorems

I don't know what your first reaction to this list is, but for us, it was something like: "Oh, none of these activities seems strictly speaking necessary in knowledge-production." Indeed, a quick look at history presents us with cases where each of those activities was bypassed:

What these examples highlight is the classical failure when searching for the need of customers: to anchor too much on what people ask for explicitly, instead of what they actually need.

I disagree that this conclusion follows from the examples. Every example you list uses at least one of the methods in your list. So, this might as well be used as evidence for why this list of methods are important.

In addition, several of the listed examples benefited from division of labour. This is a common practice in Physics. Not everyone does experiments. Some people instead specialise in the other steps of science, such as

• Formalizing intuitions
• Unifying disparate insights into a coherent frame
• Proving theorems

This is very different from concluding that experiments are not necessary.

Similar but not exactly.

I mean that you take some known distribution (the training distribution) as a starting point. But when sampling actions you do so from shifted on truncated distribution to favour higher reward policies.

The in the decision transformers I linked, AI is playing a variety of different games, where the programmers might not know what a good future reward value would be. So they let the system AI predict the future reward, but with the distribution shifted towards higher rewards.

I discussed this a bit more after posting the above comment, and there is something I want to add about the comparison.

In quantilizers if you know the probability of DOOM from the base distribution, you get an upper bound on DOOM for the quantaizer. This is not the case for type of probability shift used for the linked decision transformer.

DOOM = Unforeseen catastrophic outcome. Would not be labelled as very bad by the AI's reward function but is in reality VERY BAD.

Any policy can be model as a consequentialist agent, if you assume a contrived enough utility function. This statement is true, but not helpful.

The reason we care about the concept agency, is because there are certain things we expect from consequentialist agents, e.g. instrumental convergent goals, or just optimisation pressure in some consistent direction. We care about the concept of agency because it holds some predictive power.

[... some steps of reasoning I don't know yet how to explain ...]

Therefore, it's better to use a concept of agency that depend on the internal properties of an algorithm/mind/policy-generator.

I don't think agency can be made into a crisp concept. It's either a fuzzy category or a leaky abstraction depending on how you apply the concept. But it does point to something important. I think it is worth tracking how agentic different systems are, because doing so has predictive power.

Decision transformers  Quantilizers

Thanks :)
How are the completions provided?
Are you just looking at the output probabilities for the two relevant completions?

Load More