Abram Demski


Pointing at Normativity
Consequences of Logical Induction
Partial Agency
Alternate Alignment Ideas
Embedded Agency


An Orthodox Case Against Utility Functions

(I don't follow it all, for instance I don't recall why it's important that the former view assumes that utility is computable.)

Partly because the "reductive utility" view is made a bit more extreme than it absolutely had to be. Partly because I think it's extremely natural, in the "LessWrong circa 2014 view", to say sentences like "I don't even know what it would mean for humans to have uncomputable utility functions -- unless you think the brain is uncomputable". (I think there is, or at least was, a big overlap between the LW crowd and the set of people who like to assume things are computable.) Partly because the post was directly inspired by another alignment researcher saying words similar to those, around 2019.

Without this assumption, the core of the "reductive utility" view would be that it treats utility functions as actual functions from actual world-states to real numbers. These functions wouldn't have to be computable, but since they're a basic part of the ontology of agency, it's natural to suppose they are -- in exactly the same way it's natural to suppose that an agent's beliefs should be computable, and in a similar way to how it seems natural to suppose that physical laws should be computable.

Ah, I guess you could say that I shoved the computability assumption into the reductive view because I secretly wanted to make 3 different points:

  1. We can define beliefs directly on events, rather than needing "worlds", and this view seems more general and flexible (and closer to actual reasoning).
  2. We can define utility directly on events, rather than "worlds", too, and there seem to be similar advantages here.
  3. In particular, uncomputable utility functions seem pretty strange if you think utility is a function on worlds; but if you think it's defined as a coherent expectation on events, then it's more natural to suppose that the underlying function on worlds (that would justify the event expectations) isn't computable.

Rather than make these three points separately, I set up a false dichotomy for illustration.

Also worth highlighting that, like my post Radical Probabilism, this post is mostly communicating insights that it seems Richard Jeffrey had several decades ago. 

Are limited-horizon agents a good heuristic for the off-switch problem?

I think we could get a GPT-like model to do this if we inserted other random sequences, in the same way, in the training data; it should learn a pattern like "non-word-like sequences that repeat at least twice tend to repeat a few more times" or something like that.

GPT-3 itself may or may not get the idea, since it does have some significant breadth of getting-the-idea-of-local-patterns-its-never-seen-before.

So I don't currently see what your experiment has to do with the planning-ahead question.

I would say that the GPT training process has no "inherent" pressure toward Bellman-like behavior, but the data provides such pressure, because humans are doing something more Bellman-like when producing strings. A more obvious example would be if you trained a GPT-like system to predict the chess moves of a tree-search planning agent.

There is essentially one best-validated theory of cognition.

I think maybe our disagreement is about how good/useful of an overarching model ACT-R is? It's definitely not like in physics, where some overarching theories are widely accepted (e.g. the standard model) even by people working on much more narrow topics -- and many of the ones that aren't (e.g. string theory) are still widely known about and commonly taught. The situation in cog sci (in my view, and I think in many people's views?) is much more that we don't have an overarching model of the mind in anywhere close to the level of detail/mechanistic specificity that ACT-R posits, and that any such attempt would be premature/foolish/not useful right now.

Makes some sense to me! This is part of why my post's conclusion said stuff like this doesn't mean you should believe in ACT-R. But yeah, I also think we have a disagreement somewhere around here.

I was trained in the cognitive architecture tradition, which tends to find this situation unfortunate. I have heard strong opinions, which I respect and generally believe, of the "we just don't know enough" variety which you also espouse. However, I also buy Allen Newell's famous argument in "you can't play 20 questions with nature and win", where he argues that we may never get there without focusing on that goal. From this perspective, it makes (some) sense to try to track a big picture anyway

In some sense the grand goal of cognitive architecture is that it should eventually be seen as standard (almost required) for individual works of experimental psychology to contribute to a big picture in some way. Imagine for a moment if every paper had a section relating to ACT-R (or some other overarching model), either pointing out how it fits in (agreeing with and extending the overarching model) or pointing out how it doesn't (revising the overarching model). 

With the current state of things, it's very unclear (as you highlighted in your original comment) what the status of overarching models like ACT-R even is. Is it an artifact from the 90s which is long-irrelevant? Is it the state of the art big-picture? Nobody knows and few care? Wouldn't it be better if it were otherwise?

On the other hand, working with cognitive architectures like ACT-R can be frustrating and time consuming. In theory, they could be a time-saving tool (you start with all the power of ACT-R and can move forward from that!). In practice, my personal observation at least is that they add time and reduce other kinds of progress you can make. To caricaturize, a cog arch phd student spends their first 2 years learning the cognitive architecture they'll work with, while a non-cog-arch cogsci student can hit the ground running instead. (This isn't totally true of course; I've heard people say that most phd students are not really productive for their first year or two of grad school.) So I do not want to gloss over the downsides to a cog arch focus. 

One big problem is what I'll call the "task integration problem". Let's say you have 100 research psychologists who each spend a chunk of time doing "X in ACT-R" for many different values of X. Now you have lots of ACT-R models of lots of different cognitive phenomena. Can you mash them all together into one big model which does all 100 things?

I'm not totally sure about ACT-R, but I've heard that for most cognitive architectures, the answer is "no". Despite existing in one cognitive architecture, the individual "X" models are sorta like standalone programs which don't know how to talk to each other. 

This undermines the premise of cog arch as helping us fit everything into one coherent picture. So, this is a hurdle which cog arch would have to get past in order to play the kind of role it wants to play.

There is essentially one best-validated theory of cognition.

I think my post (at least the title!) is essentially wrong if there are other overarching theories of cognition out there which have similar track records of matching data. Are there?

By "overarching theory" I mean a theory which is roughly as comprehensive as ACT-R in terms of breadth of brain regions and breadth of cognitive phenomena.

As someone who has also done grad school in cog-sci research (but in a computer science department, not a psychology department, so my knowledge is more AI focused), my impression is that most psychology research isn't about such overarching theories. To be more precise:

  • There are cognitive architecture people, who work on overarching theories of cognition. However, ACT-R stands out amongst these as having extensive experimental validation. The rest have relatively minimal direct comparisons to human data, or none.
  • There are "bayesian brain" and other sorta overarching theories, but (to my limited knowledge!) these ideas don't have such a fleshed-out computational model of the brain. EG, you might apply bayesian-brain ideas to create a model of (say) emotional processing, but it isn't really part of one big model in quite the way ACT-R allows.
  • There's a lot of more isolated work on specific subsystems of the brain, some of which is obviously going to be highly experimentally validated, but, just isn't trying to be an overaching model at all.

So my claim is that ACT-R occupies a unique position in terms of (a) taking an experimental-psych approach, while (b) trying to provide a model of everything and how it fits together. Do you think I'm wrong about that?

I think it's a bit like physics: outsiders hear about these big overarching theories (GUTs, TOEs, strings, ...), and to an extent it makes sense for outsiders to focus on the big picture in that way. Working physicists, on the other hand, can work on all sorts of specialized things (the physics of crystal growth, say) without necessarily worrying about how it fits into the big picture. Not everyone works on the big-picture questions.

OTOH, I also feel like it's unfortunate that more work isn't integrated into overarching models.

This paper gives what I think is a much more contemporary overview of overarching theories of human cognition.

I've only skimmed it, but it seems to me more like a prospectus which speculates about building a totally new architecture (combining the strengths of deep learning with several handpicked ideas from psychology), naming specific challenges and possible routes forward for such a thing.

(Also, this is a small thing, but "fitting human reaction times" is not impressive -- that's a basic feature of many, many models.)

I said "down to reaction times" mostly because I think this gives readers a good sense of the level of detail, and because I know reaction times are something ACT-R puts effort into, as opposed to because I think reaction times is the big advantage ACT-R has over other models; but, in retrospect this may have been misleading.

I guess it comes down to my AI-centric background. For example, GPT-3 is in some sense a very impressive model of human linguistic behavior; but, it makes absolutely no attempt to match human reaction times. It's very rare for ML people to be interested in that sort of thing. This also relates to the internal design of ACT-R. An AI/ML programmer isn't usually interested in purposefully slowing down operations to match human performance. So this would be one of the most alien things about the ACT-R codebase for a lot of people.

There is essentially one best-validated theory of cognition.

This lines up fairly well with how I've seen psychology people geek out over ACT-R. That is: I had a psychology professor who was enamored with the ability to line up programming stuff with neuroanatomy. (She didn't use it in class or anything, she just talked about it like it was the most mind blowing stuff she ever saw as a research psychologist, since normally you just get these isolated little theories about specific things.)

And, yeah, important to view it as a programming language which can model a bunch of stuff, but requires fairly extensive user input to do so. One way I've seen this framed is that ACT-R lacks domain knowledge (since it is not in fact an adult human), so you can think of the programming as mostly being about hypothesizing what domain knowledge people invoke to solve a task.

The first of your two images looks broken in my browser.

There is essentially one best-validated theory of cognition.

I think that's not quite fair. ACT-R has a lot to say about what kinds of processing are happening, as well. Although, for example, it does not have a theory of vision (to my limited understanding anyway), or of how the full motor control stack works, etc. So in that sense I think you are right.

What it does have more to say about is how the working memory associated with each modality works: how you process information in the various working memories, including various important cognitive mechanisms that you might not otherwise think about. In this sense, it's not just about interconnection like you said.

Are limited-horizon agents a good heuristic for the off-switch problem?

We also know how to implement it today. 

I would argue that inner alignment problems mean we do not know how to do this today. We know how to limit the planning horizon for parts of a system which are doing explicit planning, but this doesn't bar other parts of the system from doing planning. For example, GPT-3 has a time horizon of effectively one token (it is only trying to predict one token at a time). However, it probably learns to internally plan ahead anyway, just because thinking about the rest of the current sentence (at least) is useful for thinking about the next token.

So, a big part of the challenge of creating myopic systems is making darn sure they're as myopic as you think they are.

Are limited-horizon agents a good heuristic for the off-switch problem?

Imagine a spectrum of time horizons (and/or discounting rates), from very long to very short.

Now, if the agent is aligned, things are best with an infinite time horizon (or, really, the convergently-endorsed human discounting function; or if that's not a well-defined thing, whatever theoretical object replaces it in a better alignment theory). As you reduce the time horizon, things get worse and worse: the AGI willingly destroys lots of resources for short-term prosperity.

At some point, this trend starts to turn itself around: the AGI becomes so shortsighted that it can't be too destructive, and becomes relatively easy to control.

But where is the turnaround point? It depends hugely on the AGI's capabilities. An uber-capable AI might be capable of doing a lot of damage within hours. Even setting the time horizon to seconds seems basically risky; do you want to bet everything on the assumption that such a shortsighted AI will do minimal damage and be easy to control?

This is why some people, such as Evan H, have been thinking about extreme forms of myopia, where the system is supposed to think only of doing the specific thing it was asked to do, with no thoughts of future consequences at all.

Now, there are (as I see it) two basic questions about this.

  1. How do we make sure that the system is actually as limited as we think it is?
  2. How do we use such a limited system to do anything useful?

Question #1 is incredibly difficult and I won't try to address it here.

Question #2 is also challenging, but I'll say some words.

Getting useful work out of extremely myopic systems.

As you scale down the time horizon (or scale up the temporal discounting, or do other similar things), you can also change the reward function. (Or utility function, or other equivalent thing is in whatever formalism.) We don't want something that spasmodically tries to maximize the human fulfillment experienced in the next three seconds. We actually want something that approximates the behavior of a fully-aligned long-horizon AGI. We just want to decrease the time horizon to make it easier to trust, easier to control, etc.

The strawman version of this is: choose the reward function for the totally myopic system to approximate the value function which the long-time-horizon aligned AGI would have.

If you do this perfectly right, you get 100% outer-aligned AI. But that's only because you get a system that's 100% equivalent to the not-at-all-myopic aligned AI system we started with. This certainly doesn't help us build safe systems; it's only aligned by hypothesis.

Where things get interesting is if we approximate that value function in a way we trust. An AGI RL system with supposedly aligned reward function calculates its value function by looking far into the future and coming up with plans to maximize reward. But, we might not trust all the steps in this process enough to trust the result. For example, we think small mistakes in the reward function tend to be amplified to large errors in the value function.

In contrast, we might approximate the value function by having humans look at possible actions and assign values to them. You can think of this as deontological: kicking puppies looks bad, curing cancer looks good. You can try to use machine learning to fit these human judgement patterns. This is the basic idea of approval-directed agents. Hopefully, this creates a myopic system which is incapable of treacherous turns, because it just tries to do what is "good" in the moment rather than doing any planning ahead. (One complication with this is inner alignment problems. It's very plausible that to imitate human judgements, a system has to learn to plan ahead internally. But then you're back to trying to outsmart a system that can possibly plan ahead of you; IE, you've lost the myopia.)

There may also be many other ways to try to approximate the value function in more trustable ways.

Knowledge is not just mutual information

Recently I have been thinking that we should in fact use "really basic" definitions, EG "knowledge is just mutual information", and also other things with a general theme of "don't make agency so complicated".  The hope is to eventually be able to build up to complicated types of knowledge (such as the definition you seek here), but starting with really basic forms. Let me see if I can explain.

First, an ontology is just an agents way of organizing information about the world. These can take lots of forms and I'm not going to constrain it to any particular formalization (but doing so could be part of the research program I'm suggesting).

Second, a third-person perspective is a "view from nowhere" which has the capacity to be rooted at specific locations, recovering first-person perspectives. In other words, I relate subjective and objective in the following way: objectivity is just a mapping from specific "locations" (within the objective perspective) to subjective views "from that location".

Note that an objective view is not supposed to be necessarily correct; it is just a hypothesis about what the universe looks like, described from the 3rd person perspective.

Notice what I'm doing: I'm defining a 3rd person perspective as a solution to the mind-body problem. Why?

Well, what's a 3rd-person perspective good for? Why do we invent such things in the first place?

It's good for communication. I speak of the world in objective terms largely because this is one of the best ways to communicate with others. Rather than having a different word for the front of a car, the side of a car, etc (all the ways I can experience a car), I have "car", so that I can refer to a car in an experience-agnostic way. Other people understand this language and can translate it to their own experience appropriately.

Similarly, if I say something is "to my left" rather than "left", other people know how to translate that to their own personal coordinate system.

So far so good.

Now, a reasonable project would be to create as useful a 3rd person perspective as possible. One thing this means is that it should help translate between as many perspectives as possible.

I don't claim to have a systematic grasp of what that implies, but one obvious thing people do is qualify statements: eg, "I believe X" rather than just stating "X" outright. "I want X" rather than "X should happen". This communicates information that a broad variety of listeners can accept.

Now, a controversial step. A notion of objectivity needs to decide what counts as a "conscious experiencer" or "potential viewpoint". That's because the whole point of a notion of objectivity is to be an ontology which can be mapped into a set of 1st-person viewpoints.

So I propose that we make this as broad as possible. In particular, we should be able to consider the viewpoint of any physical object. (At least.)

This is little baby panpsychism. I'm not claiming that "all physical objects have conscious experiences" in any meaningful sense, but I do want my notion of conscious experience to extend to all physical objects, just because that's a pretty big boundary I can draw, so that I'm sure I'm not excluding anyone actually important with my definition.

For an object to "experience" an event is for it to, like, "feel some shockwaves" from the event -- ie, for there to be mutual information there. On the other hand, for an object to "directly experience" an event could be defined as being contained within the physical space of the event, or perhaps to intersect that physical space, or something along those lines.

For an object to "know about something" in this broad sense is just for there to be mutual information.

For me to think there is knowledge is for my objective model to say that there is mutual information.

These definitions obviously have some problems. 

Let's look at a different type of knowledge, which I will call tacit knowledge -- stuff like being able to ride a bike (aka "know-how"). I think this can be defined (following my "very basic" theme) from an object's ability to participate successfully in patterns. A screw "knows how" to fit in threaded holes of the correct size. It "knows how" to go further in when rotated in one way, and come further out when rotated the other way. Etc.

Now, an object has some kind of learning if it can increase its tacit knowledge (in some sense) through experience. Perhaps we could say something like, it learns for a specified goal predicate if it has a tendency to increase the measure of situations in which it satisfies this goal predicate, through experience? (Mathematically this is a bit vague, sorry.)

Now we can start to think about measuring the extent to which mutual information contributes to learning of tacit knowledge. Something happens to our object. It gains some mutual information w/ external stuff. If this mutual information increases its ability to pursue some goal predicate, we can say that the information is accessible wrt that goal predicate. We can imagine the goal predicate being "active" in the agent, and having a "translation system" whereby it unpacks the mutual information into what it needs.

On the other hand, if I undergo an experience while I'm sleeping, and the mutual information I have with that event is just some small rearrangements of cellular structure which I never notice, then the mutual information is not accessible to any significant goal predicates which my learning tracks.

I don't think this solves all the problems you want to solve, but it seems to me like a fruitful way of trying to come up with definitions -- start with really basic forms of "knowledge" and related things, and try to stack them up to get to the more complex notions.

Load More