Koen Holtman

Computing scientist and Systems architect. Currently doing self-funded AGI safety research.

Sequences

Counterfactual Planning

Wiki Contributions

Comments

Solve Corrigibility Week

I don't feel like joining this, but I do wish you luck, and I'll make a high level observation about methodology.

I do believe there’s a legitimate, albeit small, chance that we solve corrigibility or find its “core” this week. Nonetheless, I think it’s of great value to be able to make actual progress on alignment issues as a community and to figure out how to do that better.

I don't consider myself to be a rationalist or EA, but I do post on this web site, so I guess this makes me part of the community of people who post on this site. My high level observation on solving corrigibility is this: the community of people who post on this site have absolutely no mechanism for agreeing among themselves whether a problem has been solved.

This is what you get when a site is in part a philosophy-themed website/forum/blogging platform. In philosophy, problems are never solved to the satisfaction of the community of all philosophers. This is not necessarily a bad thing. But it does imply that you should not expect that this community will ever be willing to agree that corrigibility, or any other alignment problem. has been solved.

In business, there is the useful terminology that certain meetings will be run as 'decision making meetings', e.g. to make a go/no-go decision on launching a certain product design, even though a degree of uncertainty remains. Other meetings are exploratory meetings only, and are labelled as such. This forum is not a decision making forum.

How To Get Into Independent Research On Alignment/Agency

I'm aware that a lot of AI Safety research is already of questionable quality. So my question is: how can I determine as quickly as possible whether I'm cut out for this?

My key comment here is that, to be an independent researcher, you will have to rely day-by-day on your own judgement on what has quality and what is valuable. So do you think you have such judgement and could develop it further?

To find out, I suggest you skim a bunch of alignment research agendas, or research overviews like this one, and then read some abstracts/first pages of papers mentioned in there, while trying apply your personal, somewhat intuitive judgement to decide

  • which agenda item/approach looks most promising to you as an actual method for improving alignment

  • which agenda item/approach you feel you could contribute most to, based on your own skills.

If your personal intuitive judgement tells you nothing about the above questions, if it all looks the same to you, then you are probably not cut out to be an independent alignment researcher.

Ngo and Yudkowsky on alignment difficulty

I haven't read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. [...] Do you address this in the articles?

Yes I address this, see for example the part about The possibility of learned self-knowledge in the sequence. I show there that any RL agent, even a non-AGI, will always have the latent ability to 'look at itself' and create a machine-learned model of its compute core internals.

What is done with this latent ability is up to the designer. The key thing here is that you have a choice as a designer, you can decide if you want to design an agent which indeed uses this latent ability to 'look at itself'.

Once you decide that you don't want to use this latent ability, certain safety/corrigibility problems become a lot more tractable.

Wikipedia has the following definition of AGI:

Artificial general intelligence (AGI) is the hypothetical ability of an intelligent agent to understand or learn any intellectual task that a human being can.

Though there is plenty of discussion on this forum which silently assumes otherwise, there is no law of nature which says that, when I build a useful AGI-level AI, I must necessarily create the entire package of all human cognitive abilities inside of it.

this made me curious about what we could do with an advanced model that is instructed to not learn and also whether we can even define and ensure a model stops learning.

Terminology note if you want to look into this some more: ML typically does not frame this goal as 'instructing the model not to learn about Q'. ML would frame this as 'building the model to approximate the specific relation between some well-defined observables, and this relation is definitely not Q'.

Ngo and Yudkowsky on alignment difficulty

Update: I just recalled that Eliezer and MIRI often talk about Dutch booking when they talk about coherence. So not being susceptible to Dutch booking may be the type of coherence Eliezer has in mind here.

When it comes to Dutch booking as a coherence criterion, I need to repeat again the observation I made below:

In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don't see that fact mentioned often on this forum, so I will expand.

An agent that plans coherently given a reward function to maximize paperclips will be an incoherent planner if you judge its actions by a reward function that values the maximization of staples instead.

To extend this to Dutch booking: if you train a superintelligent poker playing agent with a reward function that rewards it for losing at poker, you will find that if can be Dutch booked rather easily, if your Dutch booking test is whether you can find a counter-strategy to make it loose money.

How To Get Into Independent Research On Alignment/Agency

As nobody else has mentioned it yet in this comment section: AI Safety Support is a resource-hub specifically set up to help people get into alignment research field.

I am a 50 year old independent alignment researcher. I guess I need to mention for the record that I never read the sequences, and do not plan to. The piece of Yudkowsky writing that I'd recommend everybody interested in alignment should read is Corrigibilty. But in general: read broadly, and also beyond this forum.

I agree with John's observation that some parts of alignment research are especially well-suited to independent researchers, because they are about coming up with new frames/approaches/models/paradigms/etc.

But I would like to add a word of warning. Here are two somewhat equally valid ways to interpret LessWrong/Alignment Forum:

  1. It is a very big tent that welcomes every new idea

  2. It is a social media hang-out for AI alignment researchers who prefer to engage with particular alignment sub-problems and particular styles of doing alignment research only.

So while I agree with John's call for more independent researchers developing good new ideas, I need to warn you that your good new ideas may not automatically trigger a lot of interest or feedback on this forum. Don't tie your sense of self-worth too strongly to this forum.

On avoiding bullshit: discussion on this forum are often a lot better than on some other social media sites, but still Sturgeon's law applies.

Ngo and Yudkowsky on alignment difficulty

10.2.4 says L wouldn't be S if it were calculated from projected actions instead of given actions. How so? Mightn't it predict the given actions correctly?

Not sure if a short answer will help, so I will write a long one.

In 10.2.4 I talk about the possibility of an unwanted learned predictive function that makes predictions without using the argument . This is possible for example by using together with a (learned) model of the compute core to predict : so a viable could be defined as . This could make predictions fully compatible with the observational record , but I claim it would not be a reasonable learned according to the reasonableness criterion . How so?

The reasonableness criterion is similar to that used in supervised machine learning: we evaluate the learned not primarily by how it matches the training set (how well it predicts the observations in ), but by evaluating it on a separate test set. This test set can be constructed by sampling to create samples not contained in . Mathematically, perfect reasonableness is defined as , which implies that predicts all samples from fully accurately.

Philosophically/ontologically speaking, an the agent specification in my paper, specifically the learning world diagram and the descriptive text around it of how this diagram is a model of reality, gives the engineer an unambiguous prescription of how they might build experimental equipment that can measure the properties of the in the learning world diagram by sampling reality. A version of this equipment must of course be built into the agent, to create the observations that drive machine learning of , but another version can be used stand-alone to construct a test set.

A sampling action to construct a member of the test set would set up a desired state and action , and then observe the resulting . Mathematically speaking, this observation gives additional information about the numeric value of and of all for all .

I discuss in the section that, if we take an observational record sampled from , then two learned predictive functions and could be found which are both fully compatible with all observations in . So to determine which one might be a more reasonable approximation of , we can see how well they would each predict samples not yet in .

In the case of section 10.2.4, the crucial experimental test showing that is an unreasonable approximation of is one where we create a test set by setting up an and an where we know that is an action that would definitely not be taken by the real compute core software running in the agent, when it it encounters state . So we set up a test where we expect that . will (likely) mis-predict the outcome of this test. In philosophical/ontological terms, you can read this test as one that (likely) falsifies the claim that is a correct theory of .

As discussed in section 10.2.4, there are parallels between the above rejection test and the idea of random exploration, where random exploration causes the observational record , the training set, to already contain observations where for any deterministic . So this will likely suppress the creation of an unwanted via machine learning.

Some background: the symbol grounding issue I discuss in 10.2.4 is very related to the five-and-ten problem you can find in MIRI's work on embedded agency. In my experience, most people in AI, robotics, statistics, or cyber-physical systems have no problem seeing the solution to this five-and-ten problem, i.e. how to construct an agent that avoids it But somehow, and I do not know exactly why, MIRI-style(?) Rationalists keep treating it as a major open philosophical problem that is ignored by the mainstream AI/academic community. So you can read section 10.2.4 as my attempt to review and explain the standard solution to the five-and-ten problem, as used in statistics and engineering. The section was partly written with Rationalist readers in mind.

Philosophically speaking, the reasonableness criterion defined in my paper, and by supervised machine learning, has strong ties to Popper's view of science and engineering, which emphasizes falsification via new experiments as the key method for deciding between competing theories about the nature of reality. I believe that MIRI-style rationality de-emphasizes the conceptual tools provided by Popper. Instead it emphasizes a version of Bayesianism that provides a much more limited vocabulary to reason about differences between the map and the territory.

I would be interested to know if the above explanation was helpful to you, and if so which parts.

Corrigibility Can Be VNM-Incoherent

but it's always been fairly intuitive to me that corrigibility can only make any kind of sense under reward uncertainty

If you do not know it already, this intuition lies at the heart of CIRL. So before you jump to coding, my recommendation is to read that paper first. You can find lots of discussion on this forum and elsewhere on why CIRL is not a perfect corrigibility solution. If I recall correctly, the paper itself also points out the limitation I feel is most fundamental: if uncertainty is reduced based on further learning, CIRL-based corrigibility is also reduced.

There are many approaches to corrigibility that do not rely on the concept of reward uncertainty, e.g. counterfactual planning and Armstrong's indifference methods.

Corrigibility Can Be VNM-Incoherent

(I already commented on parts of this post in this comment elsewhere, the first and fourth paragaph below copy text from there.)

My first impression is that your concept of VNM-incoherence is only weakly related to the meaning that Eliezer has in mind when he uses the term incoherence. In my view, the four axioms of VNM-rationality have only a very weak descriptive and constraining power when it comes to defining rational behavior. I believe that Eliezer's notion of rationality, and therefore his notion of coherence above, goes far beyond that implied by the axioms of VNM-rationality. My feeling is that Eliezer is using the term 'coherence constraints' an intuituon-pump, in a meaning where coherence implies, or almost always implies, that a coherent agent will develop the incentive to self-preserve.

While are using math to disambiguate some properties of corrigibility above (yay!), you are not necessarily disambiguating Eliezer.

Maybe I am reading your post wrong: I am reading it as an effort to apply the axioms of VNM-rationality to define a notion you call VNM-incoherence. But maybe VN and M defined a notion of coherence not related to their rationality axioms. a version of coherence I cannot find on the Wikipedia page -- if so please tell me.

I am having trouble telling exactly how you are defining VNM-incoherence. You seem to be toying with several alternative definitions, one where it applies to reward functions (or preferences over lotteries) which are only allowed to examine the final state in a 10-step trajectory, another where the reward function can also examine/score the entire trajectory and maybe the actions taken to produce that trajectory. I think that your proof only works in the first case, but fails in the second case.

When it comes to a multi-time-step agent, I guess there are two ways to interpret the notion of 'outcome' in VNM theory: the outcome is either the system state obtained after the last time step, or the entire observable trajectory of events over all time steps.

As for what you prove above, I would phrase the statement being proven as follows. If you want to force a utility-maximising agent to adopt a corrigible policy by defining its utility function, then it is not always sufficient to define a utility function that evaluates the final state along its trajectory only. The counter-example given shows that, if you only reference the final state, you cannot construct a utility function that will score and differently.

The corollary is: if you want to create a certain type of corrigibility via terms you add to the utility function of a utility-maximising agent, you will often need to define a utility function that evaluates the entire trajectory, maybe including the specific actions taken, not just the end state. The default model of an MDP reward function, the one where the function is applied to each state transition along the trajectory, will usually let you do that. You mention:

I don't think this is a deep solution to corrigibility (as defined here), but rather a hacky prohibition.

I'd claim that you have proven that you actually might need such hacky prohibitions to solve corrigibility in the general case.

To echo some of the remarks made by tailcalled: maybe this is not surprising, as human values are often as much about the journey as about the destination. This seems to apply to corrigibility. The human value that corrigibility expresses does not in fact express a preference ordering on the final states an agent will reach: on the contrary it expresses a preference ordering among the methods that the agent will use to get there.

Ngo and Yudkowsky on alignment difficulty

Read your post, here are my initial impressions on how it relates to the discussion here.

In your post, you aim to develop a crisp mathematical definition of (in)coherence, i.e. VNM-incoherence. I like that, looks like a good way to move forward. Definitely, developing the math further has been my own approach to de-confusing certain intuitive notions about what should be possible or not with corrigibility.

However, my first impression is that your concept of VNM-incoherence is only weakly related to the meaning that Eliezer has in mind when he uses the term incoherence. In my view, the four axioms of VNM-rationality have only a very weak descriptive and constraining power when it comes to defining rational behavior. I believe that Eliezer's notion of rationality, and therefore his notion of coherence above, goes far beyond that implied by the axioms of VNM-rationality. My feeling is that Eliezer is using the term 'coherence constraints' an intuition-pump way where coherence implies, or almost always implies, that a coherent agent will develop the incentive to self-preserve.

Looking at your post, I am also having trouble telling exactly how you are defining VNM-incoherence. You seem to be toying with several alternative definitions, one where it applies to reward functions (or preferences over lotteries) which are only allowed to examine the final state in a 10-step trajectory, another where the reward function can examine the entire trajectory and maybe the actions taken to produce that trajectory. I think that your proof only works in the first case, but fails in the second case. This has certain (fairly trivial) corollaries about building corrigibility. I'll expand on this in a comment I plan to attach to your post.

I'm interested in hearing about how your approach handles this environment,

I think one way to connect your ABC toy environment to my approach is to look at sections 3 and 4 of my earlier paper where I develop a somewhat similar clarifying toy environment, with running code.

Another comment I can make is that your ABC nodes-and-arrows state transition diagram is a depiction which makes it hard see how to apply my approach, because the depiction mashes up the state of the world outside of the compute core and the state of the world inside the compute core. If you want to apply counterfactual planning, or if you want to have a an agent design that can compute the balancing function terms according to Armstrong's indifference approach, you need a different depiction of your setup. You need one which separates out these two state components more explicitely. For example, make an MDP model where the individual states are instances of the tuple (physical position of agent in the ABC playing field,policy function loaded into the compute core).

Not sure how to interpret your statement that you got lost in symbol-grounding issues. If you can expand on this, I might be able to help.

Ngo and Yudkowsky on alignment difficulty

we can't pick out the compute core in the black-box learned model.

Agree it is hard to pick the compute core out of a black-box learned model that includes the compute core.

But one important point I am trying to make in the counterfactual planning sequence/paper is that you do not have to solve that problem. I show that it is tractable to route around it, and still get an AGI.

I don't understand your second paragraph 'And my Eliezer's problem...'. Can you unpack this a bit more? Do you mean that counterfactual planning does not automatically solve the problem of cleaning up an already in-progress mess when you press the emergency stop button too late? It does not intend to, and I do not think that the cleanup issue is among the corrigibility-related problems Eliezer has been emphasizing in the discussion above.

Load More