Abram Demski


Pointing at Normativity
Consequences of Logical Induction
Partial Agency
Alternate Alignment Ideas
Embedded Agency


I'm a bit uncomfortable with the "extreme adversarial threats aren't credible; players are only considering them because they know you'll capitulate" line of reasoning because it is a very updateful line of reasoning. It makes perfect sense for UDT and functional decision theory to reason in this way. 

I find the chicken example somewhat compelling, but I can also easily give the "UDT / FDT retort": since agents are free to choose their policy however they like, one of their options should absolutely be to just go straight. And arguably, the agent should choose that, conditional on bargaining breaking down (precisely because this choice maximizes the utility obtained in fact -- ie, the only sort of reasoning which moves UDT/FDT). Therefore, the coco line of reasoning isn't relying on an absurd hypothetical. 

Another argument for this perspective: if we set the disagreement point via Nash equilibrium, then the agents have an extra incentive to change their preferences before bargaining, so that the Nash equilibrium is closer to the optimal disagreement point (IE the competition point from coco). This isn't a very strong argument, however, because (as far as I know) the whole scheme doesn't incentivize honest reporting in any case. So agents may be incentivised to modify their preferences one way or another. 

Reflect Reality?

One simple idea: the disagreement point should reflect whatever really happens when bargaining breaks down. This helps ensure that players are happy to use the coco equilibrium instead of something else, in cases where "something else" implies the breakdown of negotiations. (Because the coco point is always a pareto-improvement over the disagreement point, if possible -- so choosing a realistic disagreement point helps ensure that the coco point is realistically an improvement over alternatives.)

However, in reality, the outcome of conflicts we avoid remain unknown. The realist disagreement point is difficult to define or measure if in reality agreement is achieved.

So perhaps we should suppose that agreement cannot always be reached, and base our disagreement point on the observed consequences of bargaining failure. 

The agent's own generative model also depends on (adapts to, is learned from, etc.) the agent's environment. This last bit comes from "Discovering Agents".

"Having own generative model" is the shakiest part.

What it means for the agent to "have a generative model" is that the agent systematically corrects this model based on its experience (to within some tolerable competence!).

It probably means that storage, computation, and maintenance (updates, learning) of the model all happen within the agent's boundaries: if not, the agent's boundaries shall be widened,

A model/belief/representation depends on reference maintenance, but in general, the machinery of reference maintenance can and usually should extend far beyond the representation itself.

For example, an important book will tend to get edition updates, but the complex machinery which results in such an update extends far beyond the book's author.

A telescope produces a representation of far-away space, but the empty space between the telescope and the stars is also instrumental in maintaining the reference (eg, it must remain clear of obstacles).

A student does a lot of work "within their own boundaries" to maintain their knowledge, but they also use notebooks, computers, etc. The student's teachers are also heavily involved in the reference-maintenance. 

My current favourite notion of agency, primarily based on Active Inference

I'm not a big fan of active inference. It strikes me as, basically, a not-particularly-great scheme for injecting randomness into actions to encourage exploration.

I think the main problem is that expected utility theory is in many ways our most well-developed framework for understanding agency, but, makes no empirical predictions, and in particular does not tie agency to other important notions of optimization we can come up with (and which, in fact, seem like they should be closely tied to agency).

I'm identifying one possible source of this disconnect.

The problem feels similar to trying to understand physical entropy without any uncertainty. So it's like, we understand balloons at the atomic level, but we notice that how inflated they are seems to depend on the temperature of the air, but temperature is totally divorced from the atomic level (because we can't understand entropy and thermodynamics without using any notion of uncertainty). So we have this concept of balloons and this separate concept of inflatedness, which really really should relate to each other, but we can't bridge the gap because we're not thinking about uncertainty in the right way.

I think Bob still doesn't really need a two-part strategy in this case. Bob knows that Alice believes "time and space are relative", so Bob believes this proposition, even though Bob doesn't know the meaning of it. Bob doesn't need any special-case rule to predict Alice. The best thing Bob can do in this case still seems like, predict Alice based off of Bob's own beliefs.

(Perhaps you are arguing that Bob can't believe something without knowing what that thing means? But to me this requires bringing in extra complexity which we don't know how to handle anyway, since we don't have a bayesian definition of "definition" to distinguish "Bob thinks X is true but doesn't know what X means" from a mere "Bob thinks X is true".)

A similar example would be an auto mechanic. You expect the mechanic to do things like pop the hood, get underneath the vehicle, grab a wrench, etc. However, you cannot predict which specific actions are useful for a given situation.

We could try to use a two-part model as you suggest, where we (1) maintain an incoherent-but-useful model of car-specific beliefs mechanics have, such as "wrenches are often needed"; (2) use the best of our own beliefs where that model doesn't apply. 

However, this doesn't seem like it's ever really necessary or like it saves processing power for bounded reasoners, because we also believe that "wrenches are sometimes useful". This belief isn't specific enough that we could reproduce the mechanic's actions by acting on these beliefs; but, that's fine, that's just because we don't know enough.

(Perhaps you have in mind a picture where we can't let incoherent beliefs into our world-model -- our limited understanding of Alice's physics, or of the mechanic's work, means that we want to maintain a separate, fully coherent world-model, and apply our limited understanding of expert knowledge only as a patch. If this is what you are getting at, this seems reasonable, so long as we can still count the whole resulting thing "my beliefs" -- my beliefs, as a bounded agent, aren't required to be one big coherent model.)

But, it does seem like there might be an example close to the one you spelled out. Perhaps when Alice says "X is relative", Alice often starts doing an unfamiliar sort of math on the whiteboard. Bob has no idea how to interpret any of it as propositions -- he can't even properly divide it up into equations, to pay lip service to equations in the "X is true, but I don't know what it means" sense I used above.

Then, it seems like Bob has to model Alice with a special-case "Alice starts writing the crazy math" model. Bob has some very basic beliefs about the math Alice is writing, such as "writing the letter Beta seems to be involved", but these are clearly object-level beliefs about Alice's behaviors, which Bob has to keep track of specifically. So in this situation it seems like Bob's best model of Alice's behavior doesn't just follow from Bob's own best model of what to do?

(So I end this comment on a somewhat uncertain note)

Another example of this happening comes when thinking about utilitarian morality, which by default doesn't treat other agents as moral actors (as I discuss here).

Interesting point! 

Maintain a model of Alice's beliefs which contains the specific things Alice is known to believe, and use that to predict Alice's actions in domains closely related to those beliefs.

It sounds to me like you're thinking of cases on my spectrum, somewhere between Alice>Bob and Bob>Alice. If Bob thinks Alice knows strictly more than Bob, then Bob can just use Bob's own beliefs, even when specific-things-bob-knows-Alice-believes are relevant -- because Bob also already believes those things, by hypothesis. So it's only in intermediate cases that Bob might get a benefit from a split strategy like the one you describe. 

I've often repeated scenarios like this, or like the paperclip scenario.

My intention was never to state that the specific scenario was plausible or default or expected, but rather, that we do not know how to rule it out, and because of that, something similarly bad (but unexpected and hard to predict) might happen

The structure of the argument we eventually want is one which could (probabilistically, and of course under some assumptions) rule out this outcome. So to me, pointing it out as a possible outcome is a way of pointing to the inadequacy of our current ability to analyze the situation, not as part of a proto-model in which we are conjecturing that we will be able to predict "the AI will make paperclips" or "the AI will literally try to make you smile".

If opens are thought of as propositions, and specialization order as a kind of ("logical") time, 

Up to here made sense.

with stronger points being in the future of weaker points, then this says that propositions must be valid with respect to time (that is, we want to only allow propositions that don't get invalidated).

After here I was lost. Which propositions are valid with respect to time? How can we only allow propositions which don't get invalidated (EG if we don't know yet which will and will not be), and also, why do we want that?

This setting motivates thinking of points not as objects of study, but as partial observations of objects of study, their shadows that develop according to specialization preorder. [...] The best we can capture objects of study is with neighborhood filters, [...] start with a poset of finite observations (about computations, the elusive objects of study),

You're saying a lot about what the "objects of study" are and aren't, but not very concretely, and I'm not getting the intuition for why this is important. I'm used to the idea that the points aren't really the objects of study in topology; the opens are the more central structure. 

But the important question for a proposed modeling language is how well it models what we're after. 

This is mostly a keyword dump, pointing to standard theory that offers a way of making sense of logical time.

It seems like you are trying to do something similar to what cartesian frames and finite factored sets are doing, when they reconstruct time-like relationships from other (purportedly more basic) terms. Would you care to compare the reconstructions of time you're gesturing at to those provided by cartesian frames and/or finite factored sets?

As far as I can tell, this is the entire point. I don't see this 2D vector space actually being used in modeling agents, and I don't think Abram does either.

I largely agree. In retrospect, a large part of the point of this post for me is that it's practical to think of decision-theoretic agents as having expected value estimates for everything without having a utility function anywhere, which the expected values are "expectations of". 

A utility function is a gadget for turning probability distributions into expected values. This object makes sense in a context like VNM, where you are asking agents to judge between arbitrary gambles. In the jeffrey-bolker setting, you instead only ask agents to choose between events, not gambles. This allows us to directly derive coherence constraints on expectations without introducing a function they're expectations "of".

For me, this fits better with the way humans seem to think; it's relatively easy to compare events to each other, but nigh impossible to take entire world-descriptions and compare them (which is what a utility function does).

The rotation comes into play because looking at preferences this way is much more 'situated': you are only required to have preferences relating to your current beliefs, rather than relating to arbitrary probability distributions (as in VNM). We can intuit from our experience that there is some wiggle room between probability vs preference when representing situations in the real world. VNM doesn't model this, because probabilities are simply given to us in the VNM setting, and we're to take them as gospel truth. 

So jeffrey-bolker seems to do a better job of representing the subjective nature of probability, and the vector rotations illustrate this.

On the other hand, I think there is a real advantage to the 2d vector representation of a preference structure. For agents with identical beliefs (the "common prior assumption"), Harsanyi showed that cooperative preference structures can be represented by simple linear mixtures (Harsanyi's utilitarian theorem). However, Critch showed that combining preferences in general is not so simple. You can't separately average two agent's beliefs and their utility function; you have to dynamically change the weights of the utility-function averaging based on how bayesian updates shift the weights of the probability mixture.

Averaging the vector-valued measures together works fine, though, I believe. (I haven't worked it out in detail.) If true, this makes vector-valued measures an easier way to think about coalitions of cooperating agents who merge preferences in order to select a pareto-optimal joint policy.

Not to disagree hugely, but I have heard one religious conversion (an enlightenment type experience) described in a way that fits with "takeover without holding power over someone". Specifically this person described enlightenment in terms close to "I was ready to pack my things and leave. But the poison was already in me. My self died soon after that."

It's possible to get the general flow of the arguments another person would make, spontaneously produce those arguments later, and be convinced by them (or at least influenced).

Fair enough! I admit that John did not actually provide an argument for why alignment might be achievable by "guessing true names". I think the approach makes sense, but my argument for why this is the case does differ from John's arguments here.

Load More