Related:


Background and Core Concepts

I operationalised "strong coherence" as:

Informally: a system has immutable terminal goals.

Semi-formally: a system's decision making is well described as an approximation of argmax over actions (or higher level mappings thereof) to maximise the expected value of a single fixed utility function over states.

 

And contended that humans, animals (and learning based agents more generally?) seem to instead have values ("contextual influences on decision making").

The shard theory account of value formation in learning based agents is something like:

  • Value shards are learned computational/cognitive heuristics causally downstream of similar historical reinforcement events
  • Value shards activate more strongly in contexts similar to those where they were historically reinforced

 

And I think this hypothesis of how values form in intelligent systems could be generalised out of a RL context to arbitrary constructive optimisation processes[1]. The generalisation may be something like:

Decision making in intelligent systems is best described as "executing computations/cognition that historically correlated with higher performance on the objective functions a system was selected for performance on"[2].

 

This seems to be an importantly different type of decision making from expected utility maximisation[3]. For succinctness, I'd refer to systems of the above type as "systems with malleable values".


The Argument

In my earlier post I speculated that "strong coherence is anti-natural". To operationalise that speculation:

  • Premise 1: The generalised account of value formation is broadly accurate
    • At least intelligent systems in the real world form "contextually activated cognitive heuristics that influence decision making" as opposed to "immutable terminal goals"
    • Humans can program algorithms with immutable terminal goals in simplified virtual environments, but we don't actually know how to construct sophisticated intelligent systems via design; we can only construct them as the product of search like optimisation processes[4]
      • And intelligent systems constructed by search like optimisation processes form malleable values instead of immutable terminal goals
    • I.e. real world intelligent systems form malleable values
  • Premise 2: Systems with malleable values do not self modify to have immutable terminal goals
    • Would you take a pill that would make you an expected utility maximiser[3]? I most emphatically would not.
      • If you accept the complexity and fragility of value theses, then self modifying to become strongly coherent just destroys most of what the current you values.
    • For systems with malleable values, becoming "strongly coherent" is grossly suboptimal by their current values
    • A similar argument might extend to such systems constructing expected utility maximisers were they given the option to
  • Conclusion 1: Intelligent systems in the real world do not converge towards strong coherence
    • Strong coherence is not the limit of effective agency in the real world
    • Idealised agency does not look like "(immutable) terminal goals" or "expected utility maximisation"
  • Conclusion 2: "strong coherence" does not naturally manifest in sophisticated real world intelligent systems
    • Sophisticated intelligent systems in the real world are the product of search like optimisation processes 
    • Such optimisation processes do not produce intelligent systems that are strongly coherent
    • And those systems do not converge towards becoming strongly coherent as they are subjected to more selection pressure/"scaled up"/or otherwise amplified
  1. ^

    E.g:

    * Stochastic gradient descent

    * Natural selection/other evolutionary processes

  2. ^
  3. ^

    Of a single fixed utility function over states.

  4. ^

    E.g I'm under the impression that humans can't explicitly design an algorithm to achieve AlexNet accuracy on the ImageNet dataset.

    I think the self supervised learning that underscores neocortical cognition is a much harder learning task.

    I believe that learning is the only way there is to create capable intelligent systems that operate in the real world given our laws of physics. 

New to LessWrong?

New Answer
New Comment

4 Answers sorted by

There needs to be some process which, given a context, specifies what value shards should be created (or removed/edited) to better work in that context. Not clear we can't think of this as constituting the system's immutable goal in some sense, especially as it gets more powerful. That said it would probably not be strongly coherent by your semi-formal definition.

I think you are onto something, with the implication that building a highly intelligent, learning entity with strong coherence in this sense is unlikely, and hence, getting it morally aligned in this fashion is also unlikely. Which isn't that bad, insofar as plans for aligning it that way honestly did not look particularly promising.

Which is why I have been advocating for instead learning from how we teach morals to existing complex intelligent agents - namely, through ethical, rewarding interactions in a controlled environment slowly allowing more freedom. 

We know how to do this, it does not require us to somehow define the core of ethics mathematically. We know it works. We know how setbacks look, and how to tackle them. We know how to do this with human interactions the average person can do/train, rather than with code. It seems easier and more doable and promising in so many ways. 

That doesn't mean it will be easy, or risk free, and it still comes with a hell of a lot of problems based on the fact that AIs, even machine learning ones, are quite simply not human, they are not inherently social, they do not inherently have altruistic urges, they do not inherently have empathic abilities. But I see a clearer path to dealing with that than to directly encoding an abstract ethics into an intelligent, flexible actor.

EDIT: I found out my answer is quite similar to this other one you probably read already.

I think not.

Imagine such a malleable agent's mind as made of parts. Each part of the mind does something. There's some arrangement of the things each part does, and how many parts do each kind of thing. We won't ask right now where this organization comes from, but take it for given.

Imagine that---be it by chance or design---some parts were cooperating, while some were not. "Cooperation" means making actions that bring about a consequence in a somewhat stable way, so something towards being coherent and consequentialist, although not perfectly so by any measure. The other parts would oftentimes work at cross purposes, treading on each other toes. "Working at cross purposes", again, in other words means not being consequentialist and coherent; from the point of view of the parts, there may not even be a notion of "cross purposes" if there is no purpose.

By the nature of coherence, the ensemble of coherent and aligned parts would get to their purpose much more efficiently than the other parts are not-getting to that purpose and being a hindrance, assuming the purpose was reachable enough. This means that coherent agents are not just reflectively consistent, but also stable: once there's some seed of coherence, it can win other the non-coherent parts.

Conclusion 1: Intelligent systems in the real world do not converge towards strong coherence

It seems to me that humans are more coherent and consequentialist than other animals. Humans are not perfectly coherent, but the direction is towards more coherence. Actually, I'd expect that any sufficiently sophisticated bounded agent would not introspectively look coherent to itself if it spent enough time to think about it. Would the trend break after us?

Would you take a pill that would make you an expected utility maximiser?

Would you take a pill that made you a bit less coherent? Would you take a pill that made you a bit more coherent? (Not rhetorical questions.)

By the nature of coherence, the ensemble of coherent and aligned parts would get to their purpose much more efficiently than the other parts are not-getting to that purpose and being a hindrance, assuming the purpose was reachable enough. This means that coherent agents are not just reflectively consistent, but also stable: once there's some seed of coherence, it can win other the non-coherent parts.

I think this fails to adequately engage with the hypothesis that values are inherently contextual.

Alternatively, the kind of cooperation you describe where ... (read more)

1rotatingpaguro1y
I agree it's unrealistic in some sense. That's why I qualified "assuming the purpose was reachable enough". In this "evolutionary" interpretation of coherence, there's a compromise between attainability of the goal and the cooperation needed to achieve it. Some goals are easier. So in my framework, where I consider humans the pinnacle of known coherence, I do not consider as valid saying that a rock is more coherent because it is very good at just being a rock. About realism, I consider humans very unlikely a priori (we seem to be alone), but once there are humans around, the important low probability thing already happened. In this part of your answer, I am not sure whether you are saying "emerging coherence is forbidden in shard theory" or "I think emerging coherence is false in the real world". Answering to "emerging coherence is forbidden": I'm not sure because I don't know shard theory beyond what you are saying here, but: "values are inherently contextual" does not mean your system is not flexible enough to allow implementing coherent values within it, even if they do not correspond to the things you labeled "values" when defining the system. It can be unlikely, which leads back to the previous item, which leads back to the disagreement about humans being coherent. Answering to "I think emerging coherence is false in the real world": this leads back again to to the disagreement about humans being coherent. The crux! I said that purely out of intuition. I find this difficult to argue because, for any specific example I think of where I say "humans are more coherent and consequentialist than the cat here", I imagine you replying "No, humans are more intelligent than the cat, and so can deploy more effective strategies for their goals, but these goals and strategies are still all sharded, maybe even more than in the cat". Maybe the best argument I can make is: it seems to me humans have more of a conscious outer loop than other animals, with more power over t
5anonymousaisafety1y
This isn't a universally held view. Someone wrote a fairly compelling argument against it here: https://sohl-dickstein.github.io/2023/03/09/coherence.html
1rotatingpaguro1y
For context: the linked post exposes a well-designed survey of experts about the intelligence and coherence of various entities. The answers show a clear coherence-intelligence anti-correlation. The questions they ask the experts are: Intelligence: Coherence: Of course there's the problem of what are peoples' judgements of "coherence" measuring. In considering possible ways of making the definition more clear, the post says: It seems to me the kind of measure proposed for machine learning systems is at odds with the one for living beings. For ML, it's "robustness to environmental changes". For animals, it's "spending all resources on survival". For organizations, "spending all resources on the stated mission". By the for-ML definition, humans, I'd say, win: they are the best entity at adapting, whatever their goal. By the for-animals definition, humans would lose completely. So these are strongly inconsistent definitions. I think the problem is fixing the goal a priori: you don't get to ask "what is the entity pursuing, actually?", but proclaim "the entity is pursuing survival and reproduction", "the organization is pursuing what it says on paper". Even though they are only speculative definitions, not used in the survey, I think they are evidence of confusion in the mind of who wrote them, and potentially in the survey respondents (alternative hypothesis: sloppiness, "survival+reproduction" was intended for most animals but not humans). So, what did the experts read in the question? Take two entities at opposite ends in the figure: the "single ant" (judged most coherent) and a human (judged least coherent). .............. SINGLE ANT vs. HUMAN ANT: A great heap, sir! I have a simple and clear utility function! Feed my mother the queen! HUMAN: Wait, wait, wait. I bet you would stop feeding your queen as soon as I put you somewhere else. It's not utility, it's just learned patterns of behavior. ANT: Ohi, that's not valid sir! That's cheating! You can do t

(A somewhat theologically inspired answer:)

Outside the dichotomy of values (in the shard-theory sense) vs. immutable goals, we could also talk about valuing something that is in some sense fixed, but "too big" to fit inside your mind. Maybe a very abstract thing. So your understanding of it is always partial, though you can keep learning more and more about it (and you might shift around, feeling out different parts of the elephant). And your acted-on values would appear mutable, but there would actually be a, perhaps non-obvious, coherence to them.

It's possible this is already sort of a consequence of shard theory? In the way learned values would have coherences to accord with (perhaps very abstract or complex) invariant structure in the environment?

My claim is mostly that real world intelligent systems do not have values that can be well described by a single fixed utility function over agent states.

I do not see this answer as engaging with that claim at all.

If you define utility functions over agent histories, then everything is an expected utility maximiser for the function that assigns positive utility to whatever action the agent actually took and zero utility to every other action.

I think such a definition of utility function is useless.

If however you define utility functions over agent states, ... (read more)

3PaulK1y
Sorry, I guess I didn't make the connection to your post clear. I substantially agree with you that utility functions over agent-states aren't rich enough to model real behavior. (Except, maybe, at a very abstract level, a la predictive processing? (which I don't understand well enough to make the connection precise)).  Utility functions over world-states -- which is what I thought you meant by 'states' at first -- are in some sense richer, but I still think inadequate. And I agree that utility functions over agent histories are too flexible. I was sort of jumping off to a different way to look at value, which might have both some of the desirable coherence of the utility-function-over-states framing, but without its rigidity. And this way is something like, viewing 'what you value' or 'what is good' as something abstract, something to be inferred, out of the many partial glimpses of it we have in the form of our extant values.

Oh, huh, this post was on the LW front page, and dated as posted today, so I assumed it was fresh, but the replies' dates are actually from a month ago.

4the gears to ascension1y
lesswrong has a bug that allows people to restore their posts to "new" status on the frontpage by moving them to draft and then back.
5TekhneMakre1y
Uh, this seems bad and anti-social? This bug/feature should either be made an explicit feature, or is a bug, and using it is defecting. @Ruby
2DragonGod1y
I mean I think it's fine. I have not experienced the feature being abused. In this case I didn't get any answers the last time I posted it and ended up needing answers so I'm reposting. Better than posting the entire post again as a new post and losing the previous conversation (which is what would happen if not for this feature). Like what's the argument that it's defecting? There are just legitimate reasons to repost stuff and you can't really stop users from reposting stuff. FWIW, it was a mod that informed me of this feature.
2TekhneMakre1y
If it's a mod telling you with the implication that it's fine, then yeah, it's not defecting and is good. In that case I think it should be an explicit feature in some way!
2DragonGod1y
I mean I think it can be abused, and the use case where I was informed of it was a different use case (making substantial edits to a post). I do not know that they necessary approve of republishing for this particular use case. But the alternative to republishing for this particular use case is just reposting the question as an entirely new post which seems strictly worse.
2TekhneMakre1y
Of course there is also the alternative of not reposting the question. What's possibly defecty is that maybe lots of people want their thing to have more attention, so it's potentially a tragedy of the commons. Saying "well, just have those people who most want to repost their thing, repost their thing" could in theory work, but it seems wrong in practice, like you're just opening up to people who don't value others's attention enough. One could also ask specific people to comment on something, if LW didn't pick it up.
4DragonGod1y
A lot of LessWrong actually relies on just trusting users not to abuse the site/features. I make judgment calls on when to repost keeping said trust in mind. And if reposts were a nuisance people could just mass downvote reposts. But in general, I think it's misguided to try and impose a top down moderation solution given that the site already relies heavily on user trust/judgment calls. This repost hasn't actually been a problem and is only being an issue because we're discussing whether it's a problem or not.
3DragonGod1y
Reposted it because I didn't get any good answers last time, and I'm working on a post that's a successor to this one currently and would really appreciate the good answers I did not get.
6 comments, sorted by Click to highlight new comments since: Today at 11:46 AM

Systems with malleable values do not self modify to have (immutable) terminal goals

Consider the alternative framing where agents with malleable values don't modify themselves, but still build separate optimizers with immutable terminal goals.

These two kinds of systems could then play different roles. For example, strong optimizers with immutable goals could play the role of laws of nature, making the most efficient use of underlying physical substrate to implement many abstract worlds where everything else lives. The immutable laws of nature in each world could specify how and to what extent the within-world misalignment catastrophes get averted, and what other value-optimizing interventions are allowed outside of what the people who live there do themselves.

Here, strong optimizers are instruments of value, they are not themselves optimized to be valuable content. And the agents with malleable values are the valuable content from the point of view of the strong optimizers, but they don't need to be very good at optimizing things for anything in particular. The goals of the strong optimizers could be referring to an equilibrium of what the people end up valuing, over the vast archipelago of civilizations that grow up with many different value-laden laws of nature, anticipating how the worlds develop given these values, and what values the people living there end up expressing as a result.

But this is a moral argument, and misalignment doesn't respect moral arguments. Even if it's a terrible idea for systems with malleable values to either self modify into strong immutable optimizers or build them, that doesn't prevent the outcome where they do that regardless and perish as a result, losing everything of value. Moloch is the most natural force in a disorganized society that's not governed by humane laws of nature. Only nothingness above.

To get to coherence, you need a method that accepts incoherence and spits out coherence. In the context of preferences, two datapoints:

  • You can compute the Hodge-decomposition of a weakly connected directed edge-weighted graph in polynomial time, and the algorithm is AFAIK feasible in practice, but directed edge-directed graphs can't represent typical incoherent preferences such as the Allais paradox.
  • Computing the set of acyclic tournaments with the smallest graph-edit distance to a given directed graph seems like it is at least in NP, and the best algorithm I have for it is factorial on the number of nodes.

So it look like computing the coherent version of incoherent preferences is computationally difficult. Don't know about approximations, or how this applies to Helmholtz decomposition (though vector fields also can't represent all the known incoherence).

Informally: a system has (immutable) terminal goals. Semiformally: a system's decision making is well described as (an approximation) of argmax over actions (or higher level mappings thereof) to maximise (the expected value of) a simple unitary utility function.

Are the (parenthesized) words part of your operationalization or not? If so, I would recommend removing the parentheses, to make it clear that they are not optional.

Also, what do you mean by "a simple unitary utility function"? I suspect other people will also be confused/thrown off by that description.

The "or higher mappings thereof" is to accommodate agents that choose state —> action policies directly, and agent that choose policies over ... over policies, so I'll keep it.

 

I don't actually know if my critique applies well to systems that have non immutable terminal goals.

I guess if you have sufficiently malleable terminal goals, you get values near exactly.

Are the (parenthesized) words part of your operationalization or not, or not? If so, I would recommend removing the parentheses, to make it clear that they are not optional.

Will do.

 

Also, what do you mean by "a simple unitary utility function"? I suspect other people will also be confused/thrown off by that description.

If you define your utility function in a sufficiently convoluted manner, then everything is a utility maximiser.

Less contrived, I was thinking of stuff like Wentworth's subagents that identifies decision making with pareto optimality over a set of utility functions.

I think subagents comes very close to being an ideal model of agency and could probably be adapted to be a complete model.

I don't want to include subagents in my critique at this point.

If you define your utility function in a sufficiently convoluted manner, then everything is a utility maximiser.

Less contrived, I was thinking of stuff like Wentworth's subagents that identifies decision making with pareto optimality over a set of utility functions.

I think subagents comes very close to being an ideal model of agency and could probably be adapted to be a complete model.

I don't want to include subagents in my critique at this point.

I think what you want might be "a single fixed utility function over states" or something similar. That captures that you're excluding from critique:

  • Agents with multiple internal "utility functions" (subagents)
  • Agents whose "utility function" is malleably defined
  • Agents that have trivial utility functions, like over universe-histories