## AI ALIGNMENT FORUMAF

Koen Holtman

Computing scientist and Systems architect. Currently doing self-funded AGI safety research.

# Sequences

Counterfactual Planning

# Wiki Contributions

Why I'm co-founding Aligned AI

To do this, we'll start by offering alignment as a service for more limited AIs.

Interesting move! Will be interesting to see how you will end up packaging and positioning this alignment as a service, compared to the services offered by more general IT consulting companies. Good luck!

Thoughts on AGI safety from the top

I like your section 2. As you are asking for feedback on your plans in section 3:

By default I plan to continue looking into the directions in section 3.1, namely transparency of current models and its (potential) intersection with developments in deep learning theory. [...] Since this is what I plan to do, it'd be useful for me to know if it seems totally misguided

I see two ways to improve AI transparency in the face of opaque learned models:

1. try to make the learned models less opaque -- this is your direction

2. try to find ways to build more transparent systems that use potentially opaque learned models as building blocks. This is a research direction that your picture of a "human-like ML model" points to. Creating this type of transparency is also one of the main thoughts behind Drexler's CAIS. You can also find this approach of 'more aligned architectures built out of opaque learned models' in my work, e.g. here.

Now, I am doing alignment research in part because of plain intellectual curiosity.

But an argument could be made that, if you want to be maximally effective in AI alignment and minimising x-risk, you need to do either technical work to improve systems of type 2, or policy work on banning systems which are completely opaque inside, banning their use in any type of high-impact application. Part of that argument would also be that mainstream ML research is already plenty interested in improving the transparency of current generation neural nets, but without really getting there yet.

Instrumental Convergence For Realistic Agent Objectives

instrumental convergence basically disappears for agents with utility functions over action-observation histories.

Wait, I am puzzled. Have you just completely changed your mind about the preconditions needed to get a power-seeking agent? The way the above reads is: just add some observation of actions to your realistic utility function, and you instrumental convergence problem is solved.

1. u-AOH (utility functions over action-observation histories): No IC

2. u-OH (utility functions over observation histories): Strong IC

There are many utility functions in u-AOH that simply ignore the A part of the history, so these would then have Strong IC because they are u-OH functions. So are you are making a subtle mathematical point about how these will average away to zero (given various properties of infinite sets), or am I missing something?

Challenges with Breaking into MIRI-Style Research

Any thoughts on how to encourage a healthier dynamic.

I have no easy solution to offer, except for the obvious comment that the world is bigger than this forum.

My own stance is to treat the over-production of posts of type 1 above as just one of these inevitable things that will happen in the modern media landscape. There is some value to these posts, but after you have read about 20 of them, you can be pretty sure about how the next one will go.

So I try to focus my energy, as a reader and writer, on work of type 2 instead. I treat arXiv as my main publication venue, but I do spend some energy cross-posting my work of type 2 here. I hope that it will inspire others, or at least counter-balance some of the type 1 work.

Challenges with Breaking into MIRI-Style Research

I like your summary of the situation:

Most people doing MIRI-style research think most other people doing MIRI-style research are going about it all wrong.

This has also been my experience, at least on this forum. Much less so in academic-style papers about alignment. This has certain consequences for the problem of breaking into preparadigmatic alignment research.

Here are two ways to do preparadigmatic research:

1. Find something that is all wrong with somebody else's paradigm, then write about it.

MIRI-style preparadigmatic research, to the extent that it is published, read, and discussed on this forum, is almost all about the first of the above. Even on a forum as generally polite and thoughtful as this one, social media dynamics promote and reward the first activity much more than the second.

In science and engineering, people will usually try very hard to make progress by standing on the shoulders of others. The discourse on this forum, on the other hand, more often resembles that of a bunch of crabs in a bucket.

My conclusion is of course that if you want to break into preparadigmatic research, then you are going about it all wrong if your approach is to try to engage more with MIRI, or to maximise engagement scores on this forum.

My Overview of the AI Alignment Landscape: A Bird's Eye View

Thanks, yes that new phrasing is better.

Bit surprised that you can think of no researchers to associate with Corrigibility. MIRI have written concrete work about it and so has Christiano. It is a major theme in Bostrom's Superintelligence, and it also appears under the phrasing 'problem of control' in Russell's Human Compatible.

In terms of the history of ideas of the field, I think it that corrigibility is a key motivating concept for newcomers to be aware of. See this writeup on corrigibility, which I wrote in part for newcomers, for links to broader work on corrigibility.

I've only seen it come up as a term to reason about or aim for, rather than as a fully-fledged plan for how to produce corrigible systems.

My current reading of the field is that Christiano believes that corrigibility will appear as an emergent property as a result of building an aligned AGI according to his agenda, while MIRI on the other hand (or at least 2021 Yudkowsky) have abandoned the MIRI 2015 plans/agenda to produce corrigibility, and now despair about anybody else ever producing corrigibility either. The CIRL method discussed by Russell produces a type of corrigibility, but as Russell and Hadfield-Menell point out, this type decays as the agent learns more, so it is not a full solution.

I have written a few papers which have the most fully fledged plans that I am aware of, when it comes to producing (a pretty useful and stable version of) AGI corrigibility. This sequence is probably the most accessible introduction to these papers.

My Overview of the AI Alignment Landscape: A Bird's Eye View

Thanks for posting this writeup, overall this reads very well, and it should be useful to newcomers. The threat models section is both compact and fairly comprehensive.

I have a comment on the agendas to build safe AGI section however. In the section you write

I focus on three agendas I consider most prominent

When I finished reading the list of three agendas in it, my first thought was 'Why does this not mention other prominent agendas like corrigibility? This list is hardly is a birds-eye overview mentioning all prominent agendas to build safe AI.'

The three proposals I discuss here are just the three I know the most about, have seen the most work on and, in my subjective judgement, the ones it is most worth newcomers to the field learning about.

which is quite different. My feeling is that your Google document description of what you are doing here in scoping is much more accurate and helpful to the reader than the 'most prominent' you use above.

$1000 USD prize - Circular Dependency of Counterfactuals Not aware of which part would be a Wittgenstenian quote. Long time ago that I read Wittgenstein, and I read him in German. In any case, I remain confused on what you mean with 'circular'.$1000 USD prize - Circular Dependency of Counterfactuals

Wait, I was under the impression from the quoted text that you make a distinction between 'circular epistemology' and 'other types of epistemology that will hit a point where we can provide no justification at all'. i.e. these other types are not circular because they are ultimately defined as a set of axioms, rewriting rules, and observational protocols for which no further justification is being attempted.

So I think I am still struggling to see what flavour of philosophical thought you want people to engage with, when you mention 'circular'.

Mind you, I see 'hitting a point where we provide no justification at all' as a positive thing in a mathematical system, a physical theory, or an entire epistemology, as long as these points are clearly identified.

\$1000 USD prize - Circular Dependency of Counterfactuals