Pointing at Normativity
Consequences of Logical Induction
Partial Agency
Alternate Alignment Ideas
Embedded Agency


I am not sure whether I am more excited about 'positive' approaches (accelerating alignment research more) vs 'negative' approaches (cooling down capability-gain research). I agree that some sorts of capability-gain research are much more/less dangerous than others, and the most clearly risky stuff right now is scaling & scaling-related.

So you agree with the claim that current LLMs are a lot more useful for accelerating capabilities work than they are for accelerating alignment work?

Hmm. Have you tried to have conversations with Claude or other LLMs for the purpose of alignment work? If so, what happened?

For me, what happens is that Claude tries to work constitutional AI in as the solution to most problems. This is part of what I mean by "bad at philosophy". 

But more generally, I have a sense that I just get BS from Claude, even when it isn't specifically trying to shoehorn its own safety measures in as the solution.

Any thoughts on the sort of failure mode suggested by AI doing philosophy = AI generating hands? I feel strongly that Claude (and all other LLMs I have tested so far) accelerate AI progress much more than they accelerate AI alignment progress, because they are decent at programming but terrible at philosophy. It also seems easier in principle to train LLMs to be even better at programming. There's also going to be a lot more of a direct market incentive for LLMs to keep getting better at programming.

(Helping out with programming is also not the only way LLMs can help accelerate capabilities.)

So this seems like a generally dangerous overall dynamic -- LLMs are already better at accelerating capabilities progress than they are at accelerating alignment, and furthermore, it seems like the strong default is for this disparity to get worse and worse. 

I would argue that accelerating alignment research more than capabilities research should actually be considered a basic safety feature.

I finally got around to reading this today, because I have been thinking about doing more interpretability work, so I wanted to give this piece a chance to talk me out of it. 

It mostly didn't.

  • A lot of this boils down to "existing interpretability work is unimpressive". I think this is an important point, and significant sub-points were raised to argue it. However, it says little 'against almost every theory of impact of interpretability'. We can just do better work.
  • A lot of the rest boils down to "enumerative safety is dumb". I agree, at least for the version of "enumerative safety" you argue against here. 

My impact story (for the work I am considering doing) is most similar to the "retargeting" story which you briefly mention, but barely critique.

I do think the world would be better off if this were required reading for anyone considering going into interpretability vs other areas. (Barring weird side-effects of the counterfactual where someone has the ability to enforce required reading...) It is a good piece of work which raises many important points.

Ah, very interesting, thanks! I wonder if there is a different way to measure relative endorsement that could achieve transitivity.

Yeah, the stuff in the updatelessness section was supposed to gesture at how to handle this with my definition. 

First of all, I think children surprise me enough in pursuit of their own goals that they do often count as agents by the definition in the post.

But, if children or animals who are intuitively agents often don't fit the definition in the post, my idea is that you can detect their agency by looking at things with increasingly time/space/data bounded probability distributions. I think taking on "smaller" perspectives is very important.

I can feel what you mean about arbitrarily drawing a circle around the known optimizer and then "deleting" it, but this just doesn't feel that weird to me? Like I think the way that people model the world allows them to do this kind of operation with pretty substantially meaningful results.

I agree, but I am skeptical that there could be a satisfying mathematical notion here. And I am particularly skeptical about a satisfying mathematical notion that doesn't already rely on some other agent-detector piece which helps us understand how to remove the agent.

I think this is where Flint's framework was insightful. Instead of "detecting" and "deleting" the optimization process and then measuring the diff, you consider the system of every possible trajectory, measure the optimization of each (with respect to the ordering over states), take the average, and then compare your potential optimizer to this.

Looking back at Flint's work, I don't agree with this summary. His idea is more about spotting attractor basins in the dynamics. There is no "compare your optimizer to this" step which I can see, since he studies the dynamics of the entire system. He suggests that in cases where it is meaningful to make an optimizer/optimized distinction, this could be detected by noticing that a specific region (the 'optimizer') is sensitive to very small perturbations, which can take the whole system out of the attractor basin. 

In any case, I agree that Flint's work also eliminates the need for an unnatural baseline in which we have to remove the agent. 

Overall, I expect my definition to be more useful to alignment, but I don't currently have a well-articulated argument for that conclusion. Here are some comparison points:

  • Flint's definition requires a system with stable dynamics over time, so that we can define an iteration rule. My definition can handle that case, but does not require it. So, for example, Flint's definition doesn't work well for a goal like "become President in 2030" -- it works better for continual goals, like "be president".
  • Flint's notion of robustness involves counterfactual perturbations which we may never see in the real world. I feel a bit suspicious about this aspect. Can counterfactual perturbations we'll never see in practice be really relevant and useful for reasoning about alignment?
  • Flint's notion is based more on the physical system, whereas mine is more about how we subjectively view that system. 
  • I feel that "endorsement" comes closer to a concept of alignment. Because of the subjective nature of endorsement, it comes closer to formalizing when an optimizer is trusted, rather than merely good at its job. 
  • It seems more plausible that we can show (with plausible normative assumptions about our own reasoning) that we (should) absolutely endorse some AI, in comparison to modeling the world in sufficient detail to show that building the AI would put us into a good attractor basin.
  • I suspect Flint's definition suffers more from the value change problem than mine, although I think I haven't done the work necessary to make this clear.

There are several compromises I made for the sake of getting the idea across as simply as I could. 

  • I think the graduate-level-textbook version of this would be much more clear about what the quotes are doing. I was tempted to not even include the quotes in the mathematical expressions, since I don't think I'm super clear about why they're there.
  • I totally ignored the difference between  (probability conditional on ) and  (probability after learning ).
  • I neglect to include quantifiers in any of my definitions; the reader is left to guess which things are implicitly universally quantified.

I think I do prefer the version I wrote, which uses  rather than , but obviously the English-language descriptions ignore this distinction and make it sound like what I really want is .

It seems like the intention is that  "learns" or "hears about" 's belief, and then  updates (in the above Bayesian inference sense) to have a new  that has the consistency condition with .

Obviously we can consider both possibilities and see where that goes, but I think maybe the conditional version makes more sense as a notion of whether you right now endorse something. A conditional probability is sort of like a plan for updating. You won't necessarily follow the plan exactly when you actually update, but the conditional probability is your best estimate.

To throw some terminology out there, let's call my thing "endorsement" and a version which uses actual updates rather than conditionals "deference" (because you'd actually defer to their opinions if you learn them). 

  • You can know whether you endorse something, since you can know your current conditional probabilities (to within some accuracy, anyway). It is harder to know whether you defer to something, since in the case where updates don't equal conditionals, you must not know what you are going to update to. I think it makes more sense to define the intentional stance in terms of something you can more easily know about yourself. 
  • Using endorsement to define agency makes it about how you reason about specific hypotheticals, whereas using deference to try and define agency would make it about what actually happens in those hypotheticals (ie, how you would actually update if you learned a thing). Since you might not ever get to learn that thing, this makes endorsement more well-defined than deference. 

Bayes' theorem is the statement about , which is true from the axioms of probability theory for any  and  whatsoever.

I actually prefer the view of Alan Hajek (among others) who holds that P(A|B) is a primitive, not defined as in Bayes' ratio formula for conditional probability. Bayes' ratio formula can be proven in the case where P(B)>0, but if P(B)=0 it seems better to say that conditional probabilities can exist rather than necessarily being undefined. For example, we can reason about the conditional probability that a meteor hits land given that it hits the equator, even if hitting the equator is a measure zero event. Statisticians learn to compute such things in advanced stats classes, and it seems sensible to unify such notions under the formal P(A|B) rather than insisting that they are technically some other thing.

By putting  in the conditional, you're saying that it's an event on , a thing with the same type as . And it feels like that's conceptually correct, but also kind of the hard part. It's as if  is modelling  as an agent embedded into .

Right. This is what I was gesturing at with the quotes. There has to be some kind of translation from  (which is a mathematical concept 'outside' ) to an event inside . So the quotes are doing something similar to a Goedel encoding.

While trying to understand the equations, I found it easier to visualize  and  as two separate distributions on the same , where endorsement is simply a consistency condition. For belief consistency, you would just say that  endorses  on event  if .

But that isn't what you wrote; instead you wrote thing this with conditioning on a quoted thing. And of course, the thing I said is symmetrical between  and , whereas your concept of endorsement is not symmetrical.

The asymmetry is quite important. If we could only endorse things that have exactly our opinions, we could never improve.

Load More