So you agree with the claim that current LLMs are a lot more useful for accelerating capabilities work than they are for accelerating alignment work?
Hmm. Have you tried to have conversations with Claude or other LLMs for the purpose of alignment work? If so, what happened?
For me, what happens is that Claude tries to work constitutional AI in as the solution to most problems. This is part of what I mean by "bad at philosophy".
But more generally, I have a sense that I just get BS from Claude, even when it isn't specifically trying to shoehorn its own safety measures in as the solution.
Any thoughts on the sort of failure mode suggested by AI doing philosophy = AI generating hands? I feel strongly that Claude (and all other LLMs I have tested so far) accelerate AI progress much more than they accelerate AI alignment progress, because they are decent at programming but terrible at philosophy. It also seems easier in principle to train LLMs to be even better at programming. There's also going to be a lot more of a direct market incentive for LLMs to keep getting better at programming.
(Helping out with programming is also not the only way LLMs can help accelerate capabilities.)
So this seems like a generally dangerous overall dynamic -- LLMs are already better at accelerating capabilities progress than they are at accelerating alignment, and furthermore, it seems like the strong default is for this disparity to get worse and worse.
I would argue that accelerating alignment research more than capabilities research should actually be considered a basic safety feature.
I finally got around to reading this today, because I have been thinking about doing more interpretability work, so I wanted to give this piece a chance to talk me out of it.
It mostly didn't.
My impact story (for the work I am considering doing) is most similar to the "retargeting" story which you briefly mention, but barely critique.
I do think the world would be better off if this were required reading for anyone considering going into interpretability vs other areas. (Barring weird side-effects of the counterfactual where someone has the ability to enforce required reading...) It is a good piece of work which raises many important points.
Ah, very interesting, thanks! I wonder if there is a different way to measure relative endorsement that could achieve transitivity.
Yeah, the stuff in the updatelessness section was supposed to gesture at how to handle this with my definition.
First of all, I think children surprise me enough in pursuit of their own goals that they do often count as agents by the definition in the post.
But, if children or animals who are intuitively agents often don't fit the definition in the post, my idea is that you can detect their agency by looking at things with increasingly time/space/data bounded probability distributions. I think taking on "smaller" perspectives is very important.
I can feel what you mean about arbitrarily drawing a circle around the known optimizer and then "deleting" it, but this just doesn't feel that weird to me? Like I think the way that people model the world allows them to do this kind of operation with pretty substantially meaningful results.
I agree, but I am skeptical that there could be a satisfying mathematical notion here. And I am particularly skeptical about a satisfying mathematical notion that doesn't already rely on some other agent-detector piece which helps us understand how to remove the agent.
I think this is where Flint's framework was insightful. Instead of "detecting" and "deleting" the optimization process and then measuring the diff, you consider the system of every possible trajectory, measure the optimization of each (with respect to the ordering over states), take the average, and then compare your potential optimizer to this.
Looking back at Flint's work, I don't agree with this summary. His idea is more about spotting attractor basins in the dynamics. There is no "compare your optimizer to this" step which I can see, since he studies the dynamics of the entire system. He suggests that in cases where it is meaningful to make an optimizer/optimized distinction, this could be detected by noticing that a specific region (the 'optimizer') is sensitive to very small perturbations, which can take the whole system out of the attractor basin.
In any case, I agree that Flint's work also eliminates the need for an unnatural baseline in which we have to remove the agent.
Overall, I expect my definition to be more useful to alignment, but I don't currently have a well-articulated argument for that conclusion. Here are some comparison points:
There are several compromises I made for the sake of getting the idea across as simply as I could.
I think I do prefer the version I wrote, which uses rather than , but obviously the English-language descriptions ignore this distinction and make it sound like what I really want is .
It seems like the intention is that "learns" or "hears about" 's belief, and then updates (in the above Bayesian inference sense) to have a new that has the consistency condition with .
Obviously we can consider both possibilities and see where that goes, but I think maybe the conditional version makes more sense as a notion of whether you right now endorse something. A conditional probability is sort of like a plan for updating. You won't necessarily follow the plan exactly when you actually update, but the conditional probability is your best estimate.
To throw some terminology out there, let's call my thing "endorsement" and a version which uses actual updates rather than conditionals "deference" (because you'd actually defer to their opinions if you learn them).
Bayes' theorem is the statement about , which is true from the axioms of probability theory for any and whatsoever.
I actually prefer the view of Alan Hajek (among others) who holds that P(A|B) is a primitive, not defined as in Bayes' ratio formula for conditional probability. Bayes' ratio formula can be proven in the case where P(B)>0, but if P(B)=0 it seems better to say that conditional probabilities can exist rather than necessarily being undefined. For example, we can reason about the conditional probability that a meteor hits land given that it hits the equator, even if hitting the equator is a measure zero event. Statisticians learn to compute such things in advanced stats classes, and it seems sensible to unify such notions under the formal P(A|B) rather than insisting that they are technically some other thing.
By putting in the conditional, you're saying that it's an event on , a thing with the same type as . And it feels like that's conceptually correct, but also kind of the hard part. It's as if is modelling as an agent embedded into .
Right. This is what I was gesturing at with the quotes. There has to be some kind of translation from (which is a mathematical concept 'outside' ) to an event inside . So the quotes are doing something similar to a Goedel encoding.
While trying to understand the equations, I found it easier to visualize and as two separate distributions on the same , where endorsement is simply a consistency condition. For belief consistency, you would just say that endorses on event if .
But that isn't what you wrote; instead you wrote thing this with conditioning on a quoted thing. And of course, the thing I said is symmetrical between and , whereas your concept of endorsement is not symmetrical.
The asymmetry is quite important. If we could only endorse things that have exactly our opinions, we could never improve.
I am not sure whether I am more excited about 'positive' approaches (accelerating alignment research more) vs 'negative' approaches (cooling down capability-gain research). I agree that some sorts of capability-gain research are much more/less dangerous than others, and the most clearly risky stuff right now is scaling & scaling-related.