Koen Holtman

Computing scientist and Systems architect. Currently doing self-funded AGI safety research.


Counterfactual Planning

Wiki Contributions


Interesting. Some high-level thoughts:

When reading your definition of concept extrapolation as it appears here here:

Concept extrapolation is the skill of taking a concept, a feature, or a goal that is defined in a narrow training situation... and extrapolating it safely to a more general situation.

this reads to me like the problem of Robustness to Distributional Change from Concrete Problems. This problem also often known as out-of-distribution robustness, but note that Concrete Problems also considers solutions like the AI detecting that it is out-of-training distribution and then asking for supervisory input. I think you are also considering such approaches within the broader scope of your work.

To me, the above benchmark does not smell like being about out-of-distribution problems anymore, it reminds me more of the problem of unsupervised learning, specifically the problem of clustering unlabelled data into distinct groups.

One (general but naive) way to compute the two desired classifiers would be to first take the unlabelled dataset and use unsupervised learning to classify it into 4 distinct clusters. Then, use the labelled data to single out the two clusters that also appear in the labelled dataset, or at least the two clusters that appear appear most often. Then, construct the two classifiers as follows. Say that the two groups also in the labelled data are cluster A, whose members mostly have the label happy, and cluster B, whose members mostly have the label sad. Call the remaining clusters C and D. Then the two classifiers are (A and C=happy, B and D = sad) and (A and D = happy, B and C = sad). Note that this approach will not likely win any benchmark contest, as the initial clustering step fails to use some information that is available in the labelled dataset. I mention it mostly because it highlights a certain viewpoint on the problem.

For better benchmark results, you need a more specialised clustering algorithm (this type is usually called Semi-Supervised Clustering I believe) that can exploit the fact that the labelled dataset gives you some prior information on the shapes of two of the clusters you want.

One might also argue that, if the above general unsupervised clustering based method does not give good benchmark results, then this is a sign that, to be prepared for every possible model split, you will need more than just two classifiers.

I generally agree with you on the principle Tackle the Hamming Problems, Don't Avoid Them.

That being said, some of the Hamming problems I see that are being avoided most on this forum, and in the AI alignment community, are

  1. Do something that will affect policy in a positive way

  2. Pick some actual human values, and then hand-encode these values into open source software components that can go into AI reward functions

Having read the original post and may of the comments made so far, I'll add an epistemological observation that I have not seen others make yet quite so forcefully. From the original post:

Here, from my perspective, are some different true things that could be said, to contradict various false things that various different people seem to believe, about why AGI would be survivable [...]

I want to highlight that many of the different 'true things' on the long numbered list in the OP are in fact purely speculative claims about the probable nature of future AGI technology, a technology nobody has seen yet.

The claimed truth of several of these 'true things' is often backed up by nothing more than Eliezer's best-guess informed-gut-feeling predictions about what future AGI must necessarily be like. These predictions often directly contradict the best-guess informed-gut-feeling predictions of others, as is admirably demonstrated in the 2021 MIRI conversations.

Some of Eliezer's best guesses also directly contradict my own best-guess informed-gut-feeling predictions. I rank the credibility of my own informed guesses far above those of Eliezer.

So overall, based on my own best guesses here, I am much more optimistic about avoiding AGI ruin than Eliezer is. I am also much less dissatisfied about how much progress has been made so far.

I tried something like this much earlier with a single question, "Can you explain why it'd be hard to make an AGI that believed 222 + 222 = 555", and got enough pushback from people who didn't like the framing that I shelved the effort.

Interesting. I kind of like the framing here, but I have written a paper and sequence on the exact opposite question, on why it would be easy to make an AGI that believes 222+222=555, if you ever had AGI technology, and what you can do with that in terms of safety.

I can honestly say however that the project of writing that thing, in a way that makes the math somewhat accessible, was not easy.

If you’re interested in conceptual work on agency and the intersection of complex systems and AI alignment

I'm interested in this agenda, and I have been working on this kind of thing myself, but I am not interested at this time in moving to Prague. I figure that you are looking for people interested in moving to Prague, but if you are issuing a broad call for collaborators in general, or are thinking about setting up a much more distributed group, please clarify.

A more technical question about your approach:

What we’re looking for is more like a vertical game theory.

I'm not sure if you are interested in developing very generic kinds of vertical game theory, or in very specific acts of vertical mechanism design.

I feel that vertical mechanism design where some of the players are AIs is deeply interesting and relevant to alignment, much more so than generic game theory. For some examples of the kind of mechanism design I am talking about, see my post and related paper here. I am not sure if my interests make me a nearest neighbour of your research agenda, or just a very distant neighbour.

There are some good thoughts here, I like this enough that I am going to comment on the effective strategies angle. You state that

The wider AI research community is an almost-optimal engine of apocalypse.


AI capabilities are advancing rapidly, while our attempts to align it proceed at a frustratingly slow pace.

I have to observe that, even though certain people on this forum definitely do believe the above two statements, even on this forum this extreme level of pessimism is a minority opinion. Personally, I have been quite pleased with the pace of progress in alignment research.

This level of disagreement, which is almost inevitable as it involves estimates about about the future. has important implications for the problem of convincing people:

As per above, we'd be fighting an uphill battle here. Researchers and managers are knowledgeable on the subject, have undoubtedly heard about AI risk already, and weren't convinced.

I'd say that you would indeed be facing an uphill battle, if you'd want to convince most researchers and managers that the recent late-stage Yudkowsky estimates about the inevitability of an AI apocalypse are correct.

The effective framing you are looking for, even if you believe yourself that Yudkowsky is fully correct, is that more work is needed on reducing long-term AI risks. Researchers and managers in the AI industry might agree with you on that, even if they disagree with you and Yudkowsky about other things.

Whether these researchers and managers will change their whole career just because they agree with you is a different matter. Most will not. This is a separate problem, and should be treated as such. Trying to solve both problems at once by making people deeply afraid about the AI apocalypse is a losing strategy.

To do this, we'll start by offering alignment as a service for more limited AIs.

Interesting move! Will be interesting to see how you will end up packaging and positioning this alignment as a service, compared to the services offered by more general IT consulting companies. Good luck!

I like your section 2. As you are asking for feedback on your plans in section 3:

By default I plan to continue looking into the directions in section 3.1, namely transparency of current models and its (potential) intersection with developments in deep learning theory. [...] Since this is what I plan to do, it'd be useful for me to know if it seems totally misguided

I see two ways to improve AI transparency in the face of opaque learned models:

  1. try to make the learned models less opaque -- this is your direction

  2. try to find ways to build more transparent systems that use potentially opaque learned models as building blocks. This is a research direction that your picture of a "human-like ML model" points to. Creating this type of transparency is also one of the main thoughts behind Drexler's CAIS. You can also find this approach of 'more aligned architectures built out of opaque learned models' in my work, e.g. here.

Now, I am doing alignment research in part because of plain intellectual curiosity.

But an argument could be made that, if you want to be maximally effective in AI alignment and minimising x-risk, you need to do either technical work to improve systems of type 2, or policy work on banning systems which are completely opaque inside, banning their use in any type of high-impact application. Part of that argument would also be that mainstream ML research is already plenty interested in improving the transparency of current generation neural nets, but without really getting there yet.

instrumental convergence basically disappears for agents with utility functions over action-observation histories.

Wait, I am puzzled. Have you just completely changed your mind about the preconditions needed to get a power-seeking agent? The way the above reads is: just add some observation of actions to your realistic utility function, and you instrumental convergence problem is solved.

  1. u-AOH (utility functions over action-observation histories): No IC

  2. u-OH (utility functions over observation histories): Strong IC

There are many utility functions in u-AOH that simply ignore the A part of the history, so these would then have Strong IC because they are u-OH functions. So are you are making a subtle mathematical point about how these will average away to zero (given various properties of infinite sets), or am I missing something?

Any thoughts on how to encourage a healthier dynamic.

I have no easy solution to offer, except for the obvious comment that the world is bigger than this forum.

My own stance is to treat the over-production of posts of type 1 above as just one of these inevitable things that will happen in the modern media landscape. There is some value to these posts, but after you have read about 20 of them, you can be pretty sure about how the next one will go.

So I try to focus my energy, as a reader and writer, on work of type 2 instead. I treat arXiv as my main publication venue, but I do spend some energy cross-posting my work of type 2 here. I hope that it will inspire others, or at least counter-balance some of the type 1 work.

Load More