Abram Demski


Pointing at Normativity
Consequences of Logical Induction
Partial Agency
Alternate Alignment Ideas
Embedded Agency

Wiki Contributions

Load More


Refactoring Alignment (attempt #2)

Great! I feel like we're making progress on these basic definitions.

Re-Define Intent Alignment?

InfraBayes doesn't look for the regularity in reality that NNs are taking advantage of, agreed. But InfraBayes is exactly about "what kind of regularity assumptions can we realistically make about reality?" You can think of it as a reaction to the unrealistic nature of the regularity assumptions which Solomonoff induction makes. So it offers an answer to the question "what useful+realistic regularity assumptions could we make?"

The InfraBayesian answer is "partial models". IE, the idea that even if reality cannot be completely described by usable models, perhaps we can aim to partially describe it. This is an assumption about the world -- not all worlds can be usefully described by partial models. However, it's a weaker assumption about the world than usual. So it may not have presented itself as an assumption about the world in your mind, since perhaps you were thinking more of stronger assumptions.

If it's a good answer, it's at least plausible that NNs work well for related reasons.

But I think it also makes sense to try to get at the useful+realistic regularity assumptions from scratch, rather than necessarily making it all about NNs

Refactoring Alignment (attempt #2)

I like the addition of the pseudo-equivalences; the graph seems a lot more accurate as a representation of my views once that's done.

But it seems to me that there's something missing in terms of acceptability.

The definition of "objective robustness" I used says "aligns with the base objective" (including off-distribution). But I think this isn't an appropriate representation of your approach. Rather, "objective robustness" has to be defined something like "generalizes acceptably". Then, ideas like adversarial training and checks and balances make sense as a part of the story.

WRT your suggestions, I think there's a spectrum from "clean" to "not clean", and the ideas you propose could fall at multiple points on that spectrum (depending on how they are implemented, how much theory backs them up, etc). So, yeah, I favor "cleaner" ideas than you do, but that doesn't rule out this path for me.

Re-Define Intent Alignment?

All of that made perfect sense once I thought through it, and I tend to agree with most it. I think my biggest disagreement with you is that (in your talk) you said you don't expect formal learning theory work to be relevant. I agree with your points about classical learning theory, but the alignment community has been developing basically-classical-learning-theory tools which go beyond those limitations. I'm optimistic that stuff like Vanessa's InfraBayes could help here.

Granted, there's a big question of whether that kind of thing can be competitive. (Although there could potentially be a hybrid approach.)

Re-Define Intent Alignment?

I've watched your talk at SERI now.

One question I have is how you hope to define a good notion of "acceptable" without a notion of intent. In your talk, you mention looking at why the model does what it does, in addition to just looking at what it does. This makes sense to me (I talk about similar things), but, it seems just about as fraught as the notion of mesa-objective:

  1. It requires approximately the same "magic transparency tech" as we need to extract mesa-objectives.
  2. Even with magical transparency tech, it requires additional insight as to which reasoning is acceptable vs unacceptable. 

If you are pessimistic about extracting mesa-objectives, why are you optimistic about providing feedback about how to reason? More generally, what do you think "acceptability" might look like?

(By no means do I mean to say your view is crazy; I am just looking for your explanation.)

Re-Define Intent Alignment?

(Meta: was this meant to be a question?)

I originally conceived of it as such, but in hindsight, it doesn't seem right.

In contrast, the generalization-focused approach puts less emphasis on the assumption that the worst catastrophes are intentional.

I don't think this is actually a con of the generalization-focused approach.

By no means did I intend it to be a con. I'll try to edit to clarify. I think it is a real pro of the generalization-focused approach that it does not rely on models having mesa-objectives (putting it in Evan's terms, there is a real possibility of addressing objective robustness without directly addressing inner alignment). So, focusing on objective robustness seems like a potential advantage -- it opens up more avenues of attack. Plus, the generalization-focused approach requires a much weaker notion of "outer alignment", which may be easier to achieve as well.

But, of course, it may also turn out that the only way to achieve objective robustness is to directly tackle inner alignment. And it may turn out that the weaker notion of outer alignment is insufficient in reality.

Are you the historical origin of the robustness-centric approach? I noticed that Evan's post has the modified robustness-centric diagram in it, but I don't know if it was edited to include that. The "Objective Robustness and Inner Alignment Terminology" post attributes it to you (at least, attributes a version of it to you). (I didn't look at the references there yet.)

Discussion: Objective Robustness and Inner Alignment Terminology

If there were a "curated posts" system on the alignment forum, I would nominate this for curation. I think it's a great post.

My Current Take on Counterfactuals

All of which I really should have remembered, since it's all stuff I have known in the past, but I am a doofus. My apologies.

(But my error wasn't being too mired in EDT, or at least I don't think it was; I think EDT is wrong. My error was having the term "counterfactual" too strongly tied in my head to what you call linguistic counterfactuals. Plus not thinking clearly about any of the actual decision theory.)

I'm glad I pointed out the difference between linguistic and DT counterfactuals, then!

It still feels to me as if your proof-based agents are unrealistically narrow. Sure, they can incorporate whatever beliefs they have about the real world as axioms for their proofs -- but only if those axioms end up being consistent, which means having perfectly consistent beliefs. The beliefs may of course be probabilistic, but then that means that all those beliefs have to have perfectly consistent probabilities assigned to them. Do you really think it's plausible that an agent capable of doing real things in the real world can have perfectly consistent beliefs in this fashion?

I'm not at all suggesting that we use proof-based DT in this way. It's just a model. I claim that it's a pretty good model -- that we can often carry over results to other, more complex, decision theories.

However, if we wanted to, then yes, I think we could... I agree that if we add beliefs as axioms, the axioms have to be perfectly consistent. But if we use probabilistic beliefs, those probabilities don't have to be perfectly consistent; just the axioms saying which probabilities we have. So, for example, I could use a proof-based agent to approximate a logical-induction-based agent, by looking for proofs about what the market expectations are. This would be kind of convoluted, though.

My Current Take on Counterfactuals

It's obvious how ordinary conditionals are important for planning and acting (you design a bridge so that it won't fall down if someone drives a heavy lorry across it; you don't cross a bridge because you think the troll underneath will eat you if you cross), but counterfactuals? I mean, obviously you can put them in to a particular problem

All the various reasoning behind a decision could involve material conditionals, probabilistic conditionals, logical implication, linguistic conditionals (whatever those are), linguistic counterfactuals, decision-theoretic counterfactuals (if those are indeed different as I claim), etc etc etc. I'm not trying to make the broad claim that counterfactuals are somehow involved.

The claim is about the decision algorithm itself. The claim is that the way we choose an action is by evaluating a counterfactual ("what happens if I take this action?"). Or, to be a little more psychologically realistic, the cashed values which determine which actions we take are estimated counterfactual values.

What is the content of this claim?

A decision procedure is going to have (cashed-or-calculated) value estimates which it uses to make decisions. (At least, most decision procedures work that way.) So the content of the claim is about the nature of these values.

If the values act like Bayesian conditional expectations, then the claim that we need counterfactuals to make decisions is considered false. This is the claim of evidential decision theory (EDT).

If the values are still well-defined for known-false actions, then they're counterfactual. So, a fundamental reason why MIRI-type decision theory uses counterfactuals is to deal with the case of known-false actions.

However, academic decision theorists have used (causal) counterfactuals for completely different reasons (IE because they supposedly give better answers). This is the claim of causal decision theory (CDT).

My claim in the post, of course, is that the estimated values used to make decisions should match the EDT expected values almost all of the time, but, should not be responsive to the same kinds of reasoning which the EDT values are responsive to, so should not actually be evidential.

Could you give a couple of examples where counterfactuals are relevant to planning and acting without having been artificially inserted?

It sounds like you've kept a really strong assumption of EDT in your head; so strong that you couldn't even imagine why non-evidential reasoning might be part of an agent's decision procedure. My example is the troll bridge: conditional reasoning (whether proof-based or expectation-based) ends up not crossing the bridge, where counterfactual reasoning can cross (if we get the counterfactuals right).

The thing you call "proof-based decision theory" involves trying to prove things of the form "if I do X, I will get at least Y utility" but those look like ordinary conditionals rather than counterfactuals to me too.

Right. In the post, I argue that using proofs like this is more like a form of EDT rather than CDT, so, I'm more comfortable calling this "conditional reasoning" (lumping it in with probabilistic conditionals).

The Troll Bridge is supposed to show a flaw in this kind of reasoning, suggesting that we need counterfactual reasoning instead (at least, if "counterfactual" is broadly understood to be anything other than conditional reasoning -- a simplification which mostly makes sense in practice).

though this is pure prejudice and maybe there are better reasons for it than I can currently imagine: we want agents that can act in the actual world, about which one can generally prove precisely nothing of interest

Oh, yeah, proof-based agents can technically do anything which regular expectation-based agents can do. Just take the probabilistic model the expectation-based agents are using, and then have the proof-based agent take the action for which it can prove the highest expectation. This isn't totally slight of hand; the proof-based agent will still display some interesting behavior if it is playing games with other proof-based agents, dealing with Omega, etc.

At any rate, right now "passing Troll Bridge" looks to me like a problem applicable only to a very specific kind of decision-making agent, one I don't see any particular reason to think has any prospect of ever being relevant to decision-making in the actual world -- but I am extremely aware that this may be purely a reflection of my own ignorance.

Even if proof-based decision theory didn't generalize to handle uncertain reasoning, the troll bridge would also apply to expectation-based reasoners if their expectations respect logic. So the narrow class of agents for whome it makes sense to ask "does this agent pass the troll bridge" are basically agents who use logic at all, not just agents who are ristricted to pure logic with no probabilistic belief.

Load More