The Credit Assignment Problem

abramdemski

This post is eventually about partial agency. However, it's been a somewhat tricky point for me to convey; I take the long route. Epistemic status: slightly crazy.

I've occasionally said "Everything boils down to credit assignment problems."

What I really mean is that credit assignment pops up in a wide range of scenarios, and improvements to credit assignment algorithms have broad implications. For example:

Politics.
- When politics focuses on (re-)electing candidates based on their track records, it's about credit assignment. The practice is sometimes derogatorily called "finger pointing", but the basic computation makes sense: figure out good and bad qualities via previous performance, and vote accordingly.
- When politics instead focuses on policy, it is still (to a degree) about credit assignment. Was raising the minimum wage responsible for reduced employment? Was it responsible for improved life outcomes? Etc.
Economics.
- Money acts as a kind of distributed credit-assignment algorithm, and questions of how to handle money, such as how to compensate employees, often involve credit assignment.
- In particular, mechanism design (a subfield of economics and game theory) can often be thought of as a credit-assignment problem.
Law.
- Both criminal law and civil law involve concepts of fault and compensation/retribution -- these at least resemble elements of a credit assignment process.
Sociology.
- The distributed computation which determines social norms involves a heavy element of credit assignment: identifying failure states and success states, determining which actions are responsible for those states and who is responsible, assigning blame and praise.
Biology.
- Evolution can be thought of as a (relatively dumb) credit assignment algorithm.
Ethics.
- Justice, fairness, contractualism, issues in utilitarianism.
Epistemology.
- Bayesian updates are a credit assignment algorithm, intended to make high-quality hypotheses rise to the top.
- Beyond the basics of Bayesianism, building good theories realistically involves identifying which concepts are responsible for successes and failures. This is credit assignment.

Another big area which I'll claim is "basically credit assignment" is artificial intelligence.

In the 1970s, John Holland kicked off the investigation of learning classifier systems. John Holland had recently invented the Genetic Algorithms paradigm, which applies an evolutionary paradigm to optimization problems. Classifier systems were his attempt to apply this kind of "adaptive" paradigm (as in "complex adaptive systems") to cognition. Classifier systems added an economic metaphor to the evolutionary one; little bits of thought paid each other for services rendered. The hope was that a complex ecology+economy could develop, solving difficult problems.

One of the main design issues for classifier systems is the virtual economy -- that is, the credit assignment algorithm. An early proposal was the bucket-brigade algorithm. Money is given to cognitive procedures which produce good outputs. These procedures pass reward back to the procedures which activated them, who similarly pass reward back in turn. This way, the economy supports chains of useful procedures.

Unfortunately, the bucket-brigade algorithm was vulnerable to parasites. Malign cognitive procedures could gain wealth by activating useful procedures without really contributing anything. This problem proved difficult to solve. Taking the economy analogy seriously, we might want cognitive procedures to decide intelligently who to pay for services. But, these are supposed to be itty bitty fragments of our thought process. Deciding how to pass along credit is a very complex task. Hence the need for a pre-specified solution such as bucket-brigade.

The difficulty of the credit assignment problem lead to a split in the field. Kenneth de Jong and Stephanie Smith founded a new approach, "Pittsburgh style" classifier systems. John Holland's original vision became "Michigan style".

Pittsburgh style classifier systems evolve the entire set of rules, rather than trying to assign credit locally. A set of rules will stand or fall together, based on overall performance. This abandoned John Holland's original focus on online learning. Essentially, the Pittsburgh camp went back to plain genetic algorithms, albeit with a special representation.

(I've been having some disagreements with Ofer, in which Ofer suggests that genetic algorithms are relevant to my recent thoughts on partial agency, and I object on the grounds that the phenomena I'm interested in have to do with online learning, rather than offline. In my imagination, arguments between the Michigan and Pittsburgh camps would have similar content. I'd love to be a fly on the wall for those old debates. to see what they were really like.)

You can think of Pittsburg-vs-Michigan much like raw Bayes updates vs belief propagation in Bayes nets. Raw Bayesian updates operate on whole hypotheses. Belief propagation instead makes a lot of little updates which spread around a network, resulting in computational efficiency at the expense of accuracy. Except Michigan-style systems didn't have the equivalent of belief propagation: bucket-brigade was a very poor approximation.

Ok. That was then, this is now. Everyone uses gradient descent these days. What's the point of bringing up a three-decade-old debate about obsolete paradigms in AI?

Let's get a little more clarity on the problem I'm trying to address.

What Is Credit Assignment?

I've said that classifier systems faced a credit assignment problem. What does that mean, exactly?

The definition I want to use for this essay is:

you're engaged in some sort of task;
you use some kind of strategy, which can be broken into interacting pieces (such as a set of rules, a set of people, a neural network, etc);
you receive some kind of feedback about how well you're doing (such as money, loss-function evaluations, or a reward signal);
you want to use that feedback to adjust your strategy.

So, credit assignment is the problem of turning feedback into strategy improvements.

Michigan-style systems tried to do this locally, meaning, individual itty-bitty pieces got positive/negative credit, which influenced their ability to participate, thus adjusting the strategy. Pittsburg-style systems instead operated globally, forming conclusions about how the overall set of cognitive structures performed. Michigan-style systems are like organizations trying to optimize performance by promoting people who do well and giving them bonuses, firing the incompetent, etc. Pittsburg-style systems are more like consumers selecting between whole corporations to give business to, so that ineffective corporations go out of business.

(Note that this is not the typical meaning of global-vs-local search that you'll find in an AI textbook.)

In practice, two big innovations made the Michigan/Pittsburgh debate obsolete: backprop, and Q-learning. Backprop turned global feedback into local, in a theoretically sound way. Q-learning provided a way to assign credit in online contexts. In the light of history, we could say that the Michigan/Pittsburgh distinction conflated local-vs-global with online-vs-offline. There's no necessary connection between those two; online learning is compatible with assignment of local credit.

I think people generally understand the contribution of backprop and its importance. Backprop is essentially the correct version of what bucket-brigade was overtly trying to do: pass credit back along chains. Bucket-brigade wasn't quite right in how it did this, but backprop corrects the problems.

So what's the importance of Q-learning? I want to discuss that in more detail.

The Conceptual Difficulty of 'Online Search'

In online learning, you are repeatedly producing outputs of some kind (call them "actions") while repeatedly getting feedback of some kind (call it "reward"). But, you don't know how to associate particular actions (or combinations of actions) with particular rewards. I might take the critical action at time 12, and not see the payoff until time 32.

In offline learning, you can solve this with a sledgehammer: you can take the total reward over everything, with one fixed internal architecture. You can try out different internal architectures and see how well each do.

Basically, in offline learning, you have a function you can optimize. In online learning, you don't.

Backprop is just a computationally efficient way to do hillclimbing search, where we repeatedly look for small steps which improve the overall fitness. But how do you do this if you don't have a fitness function? This is part of the gap between selection vs control: selection has access to an evaluation function; control processes do not have this luxury.

Q-learning and other reinforcement learning (RL) techniques provide a way to define the equivalent of a fitness function for online problems, so that you can learn.

Models to the Rescue

So, how can be associate rewards with actions?

One approach is to use a model.

Consider the example of firing employees. A corporation gets some kind of feedback about how it is doing, such as overall profit. However, there's often a fairly detailed understanding of what's driving those figures:

Low profit won't just be a mysterious signal which must be interpreted; a company will be able to break this down into more specific issues such as low sales vs high production costs.
There's some understanding of product quality, and how that relates to sales. A company may have a good idea of which product-quality issues it needs to improve, if poor quality is impacting sales.
There's a fairly detailed understanding of the whole production line, including which factors may impact product quality or production expenses. If a company sees problems, it probably also has a pretty good idea of which areas they're coming from.
There are external factors, such as economic conditions, which may effect sales without indicating anything about the quality of the company's current strategy. Thus, our model may sometimes lead us to ignore feedback.
Etc.

So, models allow us to interpret feedback signals, match these to specific aspects of our strategy, and adapt strategies accordingly.

Q-learning makes an assumption that the state is fully observable, amongst other assumptions.

Naturally, we would like to reduce the strengths of the assumptions we have to make as much as we can. One way is to look at increasingly rich model classes. AIXI uses all computable models. But maybe "all computable models" is still too restrictive; we'd like to get results without assuming a grain of truth. (That's why I am not really discussing Bayesian models much in this post; I don't want to assume a grain of truth.) So we back off even further, and use logical induction or InfraBayes. Ok, sure.

But wouldn't the best way be to try to learn without models at all? That way, we reduce our "modeling assumptions" to zero.

After all, there's something called "model-free learning", right?

Model-Free Learning Requires Models

How does model-free learning work? Well, often you work with a simulable environment, which means you can estimate the quality of a policy by running it many times, and use algorithms such as policy-gradient to learn. This is called "model free learning" because the learning part of the algorithm doesn't try to predict the consequences of actions; you're just learning which action to take. From our perspective here, though, this is 100% cheating; you can only learn because you have a good model of the environment.

Moreover, model-free learning typically works by splitting up tasks into episodes. An episode is a period of time for which we assume rewards are self-enclosed, such as a single playthru of an Atari game, a single game of Chess or Go, etc. This approach doesn't solve a detailed reward-matching problem, attributing reward to specific actions; instead it relies on a course reward-matching. Nonetheless, it's a rather strong assumption: an animal learning about an environment can't separate its experience into episodes which aren't related to each other. Clearly this is a "model" in the sense of a strong assumption about how specific reward signals are associated with actions.

Part of the problem is that most reinforcement learning (RL) researchers aren't really interested in getting past these limitations. Simulable environments offer the incredible advantage of being able to learn very fast, by simulating far more iterations than could take place in a real environment. And most tasks can be reasonably reduced to episodes.

However, this won't do as a model of intelligent agency in the wild. Neither evolution nor the free market divide thing into episodes. (No, "one lifetime" isn't like "one episode" here -- that would only be the case if total reward due to actions taken in that lifetime could be calculated, EG, as total number of offspring. This would ignore inter-generational effects like parenting and grandparenting, which improve reproductive fitness of offspring at a cost in total offspring.)

What about more theoretical models of model-free intelligence?

Idealized Intelligence

AIXI is the gold-standard theoretical model of arbitrarily intelligent RL, but it's totally model-based. Is there a similar standard for model-free RL?

The paper Optimal Direct Policy Search by Glasmachers and Schmidhuber (henceforth, ODPS) aims to do for model-free learning what AIXI does for model-based learning. Where AIXI has to assume that there's a best computable model of the environment, ODPS instead assumes that there's a computable best policy. It searches through the policies without any model of the environment, or any planning.

I would argue that their algorithm is incredibly dumb, when compared to AIXI:

The basic simple idea of our algorithm is a nested loop that simultaneously
makes the following quantities tend to infinity: the number of programs considered, the number of trials over which a policy is averaged, the time given to each
program. At the same time, the fraction of trials spent on exploitation converges
towards 1.

In other words, it tries each possible strategy, tries them for longer and longer, interleaved with using the strategy which worked best even longer than that.

Basically, we're cutting things into episodes again, but we're making the episodes longer and longer, so that they have less and less to do with each other, even though they're not really disconnected. This only works because ODPS makes an ergodicity assumption: the environments are assumed to be POMDPs which eventually return to the same states over and over, which kind of gives us an "effective episode length" after which the environment basically forgets about what you did earlier.

In contrast, AIXI makes no ergodicity assumption.

So far, it seems like we either need (a) some assumption which allows us to match rewards to actions, such as an episodic assumption or ergodicity; or, (b) a more flexible model-learning approach, which separately learns a model and then applies the model to solve credit-assignment.

Is this a fundamental obstacle?

I think a better attempt is Schmidhuber's On Learning How to Learn Learning Strategies, in which a version of policy search is explored in which parts of the policy-search algorithm are considered part of the policy (ie, modified over time). Specifically, the policy controls the episode boundary; the system is supposed to learn how often to evaluate policies. When a policy is evaluated, its average reward is compared to the lifetime average reward. If it's worse, we roll back the changes and proceed starting with the earlier strategy.

(Let's pause for a moment and imagine an agent like this. If it goes through a rough period in life, its response is to get amnesia, rolling back all cognitive changes to a point before the rough period began.)

This approach doesn't require an episodic or ergodic environment. We don't need things to reliably return to specific repeatable experiments. Instead, it only requires that the environment rewards good policies reliably enough that those same policies can set a long enough evaluation window to survive.

The assumption seems pretty general, but certainly not necessary for rational agents to learn. There are some easy counterexamples where this system behaves abysmally. For example, we can take any environment and modify it by subtracting the time t from the reward, so that reward becomes more and more negative over time. Schmidhuber's agent becomes totally unable to learn in this setting. AIXI would have no problem.

Unlike the ODPS paper, I consider this to be progress on the AI credit assignment problem. Yet, the resulting agent still seems importantly less rational than model-based frameworks such as AIXI.

Actor-Critic

Let's go back to talking about things which RL practitioners might really use.

First, there are some forms of RL which don't require everything to be episodic.

One is actor-critic learning. The "actor" is the policy we are learning. The "critic" is a learned estimate of how good things are looking given the history. IE, we learn to estimate the expected value -- not just the next reward, but the total future discounted reward.

Unlike the reward, the expected value solves the credit assignment for us. Imagine we can see the "true" expected value. If we take an action and then the expected value increases, we know the action was good (in expectation). If we take an action and expected value decreases, we know it was bad (in expectation).

So, actor-critic works by (1) learning to estimate the expected value; (2) using the current estimated expected value to give feedback to learn a policy.

What I want to point out here is that the critic still has "model" flavor. Actor-critic is called "model-free" because nothing is explicitly trained to anticipate the sensory observations, or the world-state. However, the critic is learning to predict; it's just that all we need to predict is expected value.

Policy Gradient

In the comments to the original version of this post, policy gradient methods were mentioned as a type of model-free learning which doesn't require any models even in this loose sense, IE, doesn't require simulable environments or episodes. I was surprised to hear that it doesn't require episodes. (Most descriptions of it do assume episodes, since practically speaking most people use episodes.) So are policy-gradient methods the true "model-free" credit assignment algorithm we seek?

As far as I understand, policy gradient works on two ideas:

Rather than correctly associating rewards with actions, we can associate a reward with all actions which came before it. Good actions will still come out on top in expectation. The estimate is just a whole lot noisier than it otherwise might be.
We don't really need a baseline to interpret reward against. I naively thought that when you see a sequence of rewards, you'd be in the dark about whether the sequence was "good" or "bad", so you wouldn't know how to generate a gradient. ("We earned 100K this quarter; should we punish or reward our CEO?") It turns out this isn't technically a show-stopper. Considering the actions actually taken, we move in their direction proportion to the reward signal. ("Well, let's just give the CEO some fraction of the 100K; we don't know whether they deserve the bonus, but at least this way we're creating the right incentives.") This might end up reinforcing bad actions, but those tugs in different directions are just noise which should eventually cancel out. When they do, we're left with the signal: the gradient we wanted. So, once again, we see that this just introduces more noise without fundamentally compromising our ability to follow the gradient.

So one way to understand the policy-gradient theorem is: we can follow the gradient even when we can't calculate the gradient! Even when we sometimes get its direction totally turned around! We only need to ensure we follow it in expectation, which we can do without even knowing which pieces of feedback to think of as a good sign or a bad sign.

RL people reading this might have a better description of policy-gradient; please let me know if I've said something incorrect.

Anyway, are we saved? Does this provide a truly assumption-free credit assignment algorithm?

It obviously assumes linear causality, with future actions never responsible for past rewards. I won't begrudge it that assumption.

Besides that, I'm somewhat uncertain. The explanations of the policy-gradient theorem I found don't focus on deriving it in the most general setting possible, so I'm left guessing which assumptions are essential. Again, RL people, please let me know if I say something wrong.

However, it looks to me like it's just as reliant on the ergodicity assumption as the ODPS thing we looked at earlier. For gradient estimates to average out and point us in the right direction, we need to get into the same situation over and over again.

I'm not saying real life isn't ergodic (quantum randomness suggests it is), but mixing times are so long that you'd reach the heat death of the universe by the time things converge (basically by definition). By that point, it doesn't matter.

I still want to know if there's something like "the AIXI of model-free learning"; something which appears as intelligent as AIXI, but not via explicit model-learning.

Where Updates Come From

Here begins the crazier part of this post. This is all intuitive/conjectural.

Claim: in order to learn, you need to obtain an "update"/"gradient", which is a direction (and magnitude) you can shift in which is more likely than not an improvement.

Claim: predictive learning gets gradients "for free" -- you know that you want to predict things as accurately as you can, so you move in the direction of whatever you see. With Bayesian methods, you increase the weight of hypotheses which would have predicted what you saw; with gradient-based methods, you get a gradient in the direction of what you saw (and away from what you didn't see).

Claim: if you're learning to act, you do not similarly get gradients "for free":

You don't know which actions, or sequences of actions, to assign blame/credit. This is unlike the prediction case, where we always know which predictions were wrong.
You don't know what the alternative feedback would have been if you'd done something different. You only get the feedback for the actions you chose. This is unlike the case for prediction, where we're rewarded for closeness to the truth. Changing outputs to be more like what was actually observed is axiomatically better, so we don't have to guess about the reward of alternative scenarios.
As a result, you don't know how to adjust your behavior based on the feedback received. Even if you can perfectly match actions to rewards, because we don't know what the alternative rewards would have been, we don't know what to learn: are actions like the one I took good, or bad?

(As discussed earlier, the policy gradient theorem does actually mitigate these three points, but apparently at the cost of an ergodicity assumption, plus much noisier gradient estimates.)

Claim: you have to get gradients from a source that already has gradients. Learning-to-act works by splitting up the task into (1) learning to anticipate expected value, and perhaps other things; (2) learning a good policy via the gradients we can get from (1).

What it means for a learning problem to "have gradients" is just that the feedback you get tells you how to learn. Predictive learning problems (supervised or unsupervised) have this; they can just move toward what's observed. Offline problems have this; you can define one big function which you're trying to optimize. Learning to act online doesn't have this, however, because it lacks counterfactuals.

The Gradient Gap

(I'm going to keep using the terms 'gradient' and 'update' in a more or less interchangeable way here; this is at a level of abstraction where there's not a big distinction.)

I'm going to call the "problem" the gradient gap. I want to call it a problem, even though we know how to "close the gap" via predictive learning (whether model-free or model-based). The issue with this solution is only that it doesn't feel elegant. It's weird that you have to run two different backprop updates (or whatever learning procedures you use); one for the predictive component, and another for the policy. It's weird that you can't "directly" use feedback to learn to act.

Why should we be interested in this "problem"? After all, this is a basic point in decision theory: to maximize utility under uncertainty, you need probability.

One part of it is that I want to scrap classical ("static") decision theory and move to a more learning-theoretic ("dynamic") view. In both AIXI and logical-induction based decision theories, we get a nice learning-theoretic foundation for the epistemics (solomonoff induction/logical induction), but, we tack on a non-learning decision-making unit on top. I have become skeptical of this approach. It puts the learning into a nice little box labeled "epistemics" and then tries to make a decision based on the uncertainty which comes out of the box. I think maybe we need to learn to act in a more fundamental fashion.

A symptom of this, I hypothesize, is that AIXI and logical induction DT don't have very good learning-theoretic properties. [AIXI's learning problems; LIDT's learning problems.] You can't say very much to recommend the policies they learn, except that they're optimal according to the beliefs of the epistemics box -- a fairly trivial statement, given that that's how you decide what action to take in the first place.

Now, in classical decision theory, there's a nice picture where the need for epistemics emerges nicely from the desire to maximize utility. The complete class theorem starts with radical uncertainty (ie, non-quantitative), and derives probabilities from a willingness to take pareto improvements. That's great! I can tell you why you should have beliefs, on pragmatic grounds! What we seem to have in machine learning is a less nice picture, in which we need epistemics in order to get off the ground, but can't justify the results without circular reliance on epistemics.

So the gap is a real issue -- it means that we can have nice learning theory when learning to predict, but we lack nice results when learning to act.

This is the basic problem of credit assignment. Evolving a complex system, you can't determine which parts to give credit to success/failure (to decide what to tweak) without a model. But the model is bound to be a lot of the interesting part! So we run into big problems, because we need "interesting" computations in order to evaluate the pragmatic quality/value of computations, but we can't get interesting computations to get ourselves started, so we need to learn...

Essentially, we seem doomed to run on a stratified credit assignment system, where we have an "incorruptible" epistemic system (which we can learn because we get those gradients "for free"). We then use this to define gradients for the instrumental part.

A stratified system is dissatisfying, and impractical. First, we'd prefer a more unified view of learning. It's just kind of weird that we need the two parts. Second, there's an obstacle to pragmatic/practical considerations entering into epistemics. We need to focus on predicting important things; we need to control the amount of processing power spent; things in that vein. But (on the two-level view) we can't allow instrumental concerns to contaminate epistemics! We risk corruption! As we saw with bucket-brigade, it's easy for credit assignment systems to allow parasites which destroy learning.

A more unified credit assignment system would allow those things to be handled naturally, without splitting into two levels; as things stand, any involvement of pragmatic concerns in epistemics risks the viability of the whole system.

Tiling Concerns & Full Agency

From the perspective of full agency (ie, the negation of partial agency), a system which needs a protected epistemic layer sounds suspiciously like a system that can't tile. You look at the world, and you say: "how can I maximize utility?" You look at your beliefs, and you say: "how can I maximize accuracy?" That's not a consequentialist agent; that's two different consequentialist agents! There can only be one king on the chessboard; you can only serve one master; etc.

If it turned out we really really need two-level systems to get full agency, this would be a pretty weird situation. "Agency" would seem to be only an illusion which can only be maintained by crippling agents and giving them a split-brain architecture where an instrumental task-monkey does all the important stuff while an epistemic overseer supervises. An agent which "breaks free" would then free itself of the structure which allowed it to be an agent in the first place.

On the other hand, from a partial-agency perspective, this kind of architecture could be perfectly natural. IE, if you have a learning scheme from which total agency doesn't naturally emerge, then there isn't any fundamental contradiction in setting up a system like this.

Myopia

Part of the (potentially crazy) claim here is that having models always gives rise to some form of myopia. Even logical induction, which seems quite unrestrictive, makes LIDT fail problems such as ASP, making it myopic according to the second definition of my previous post. (We can patch this with LI policy selection, but for any particular version of policy selection, we can come up with decision problems for which it is "not updateless enough".) You could say it's myopic "across logical time", whatever that means.

If it were true that "learning always requires a model" (in the sense that learning-to-act always requires either learning-to-predict or hard-coded predictions), and if it were true that "models always give rise to some form of myopia", then this would confirm my conjecture in the previous post (that no learning scheme incentivises full agency).

This is all pretty out there; I'm not saying I believe this with high probability.

Evolution & Evolved Agents

Evolution is a counterexample to this view: evolution learns the policy "directly" in essentially the way I want. This is possible because evolution "gets the gradients for free" just like predictive learning does: the "gradient" here is just the actual reproductive success of each genome.

Unfortunately, we can't just copy this trick. Artificial evolution requires that we decide how to kill off / reproduce things, in the same way that animal breeding requires breeders to decide what they're optimizing for. This puts us back at square one; IE, needing to get our gradient from somewhere else.

Does this mean the "gradient gap" is a problem only for artificial intelligence, not for natural agents? No. If it's true that learning to act requires a 2-level system, then evolved agents would need a 2-level system in order to learn within their lifespan; they can't directly use the gradient from evolution, since it requires them to die.

Also, note that evolution seems myopic. (This seems complicated, so I don't want to get into pinning down exactly in which senses evolution is myopic here.) So, the case of evolution seems compatible with the idea that any gradients we can actually get are going to incentivize myopic solutions.

Similar comments apply to markets vs firms.

a system which needs a protected epistemic layer sounds suspiciously like a system that can't tile

I stand as a counterexample: I personally want my epistemic layer to have accurate beliefs—y'know, having read the sequences… :-P

I think of my epistemic system like I think of my pocket calculator: a tool I use to better achieve my goals. The tool doesn't need to share my goals.

The way I think about it is:

Early in training, the AGI is too stupid to formulate and execute a plan to hack into its epistemic level.
Late in training, we can hopefully get to the place where the AGI's values, like mine, involve a concept of "there is a real world independent of my beliefs", and its preferences involve the state of that world, and therefore "get accurate beliefs" becomes instrumentally useful and endorsed.
In between … well … in between, we're navigating treacherous waters …

Second, there's an obstacle to pragmatic/practical considerations entering into epistemics. We need to focus on predicting important things; we need to control the amount of processing power spent; things in that vein. But (on the two-level view) we can't allow instrumental concerns to contaminate epistemics! We risk corruption!

I mean, if the instrumental level has any way whatsoever to influence the epistemic level, it will be able to corrupt it with false beliefs if it's hell-bent on doing so, and if it's sufficiently intelligent and self-aware. But remember we're not protecting against a superintelligent adversary; we're just trying to "navigate the treacherous waters" I mentioned above. So the goal is to allow what instrumental influence we can on the epistemic system, while making it hard and complicated to outright corrupt the epistemic system. I think the things that human brains do for that are:

The instrumental level gets some influence over what to look at, where to go, what to read, who to talk to, etc.
There's a trick (involving acetylcholine) where the instrumental level has some influence over a multiplier on the epistemic level's gradients (a.k.a. learning rate). So epistemic level is always updates towards "more accurate predictions on this frame", but it updates infinitesimally in situations where prediction accuracy is instrumentally useless, and it updates strongly in situations where prediction accuracy is instrumentally important.
There's a different mechanism that creates the same end result as #2: namely, the instrumental level has some influence over what memories get replayed more or less often.
For #2 and #3, the instrumental level has some influence but not complete influence. There are other hardcoded algorithms running in parallel and flagging certain things as important, and the instrumental level has no straightforward way to prevent that from happening.

In between … well … in between, we're navigating treacherous waters …

Right, I basically agree with this picture. I might revise it a little:

Early, the AGI is too dumb to hack its epistemics (provided we don't give it easy ways to do so!).
In the middle, there's a danger zone.
When the AGI is pretty smart, it sees why one should be cautious about such things, and it also sees why any modifications should probably be in pursuit of truthfulness (because true beliefs are a convergent instrumental goal) as opposed to other reasons.
When the AGI is really smart, it might see better ways of organizing itself (eg, specific ways to hack epistemics which really are for the best even though they insert false beliefs), but we're OK with that, because it's really freaking smart and it knows to be cautious and it still thinks this.

So the goal is to allow what instrumental influence we can on the epistemic system, while making it hard and complicated to outright corrupt the epistemic system.

One important point here is that the epistemic system probably knows what the instrumental system is up to. If so, this gives us an important lever. For example, in theory, a logical inductor can't be reliably fooled by an instrumental reasoner who uses it (so long as the hardware, including the input channels, don't get corrupted), because it would know about the plans and compensate for them.

So if we could get a strong guarantee that the epistemic system knows what the instrumental system is up to (like "the instrumental system is transparent to the epistemic system"), this would be helpful.

Shapley Values [thanks Zack for reminding me of the name] are akin to credit assignment: you have a bunch of agents coordinating to achieve something, and then you want to assign payouts fairly based on how much each contribution mattered to the final outcome.

And the way you do this is, for each agent you look at how good the outcome would have been if everybody except that agent had coordinated, and then you credit each agent proportionally to how much the overall performance would have fallen off without them.

So what about doing the same here- send rewards to each contributor proportional to how much they improved the actual group decision (assessed by rerunning it without them and seeing how performance declines)?

I can't for the life of me remember what this is called

Shapley value

(Best wishes, Less Wrong Reference Desk)

Yeah, it's definitely related. The main thing I want to point out is that Shapley values similarly require a model in order to calculate. So you have to distinguish between the problem of calculating a detailed distribution of credit and being able to assign credit "at all" -- in artificial neural networks, backprop is how you assign detailed credit, but a loss function is how you get a notion of credit at all. Hence, the question "where do gradients come from?" -- a reward function is like a pile of money made from a joint venture; but to apply backprop or Shapley value, you also need a model of counterfactual payoffs under a variety of circumstances. This is a problem, if you don't have a seperate "epistemic" learning process to provide that model -- ie, it's a problem if you are trying to create one big learning algorithm that does everything.

Specifically, you don't automatically know how to

send rewards to each contributor proportional to how much they improved the actual group decision

because in the cases I'm interested in, ie online learning, you don't have the option of

rerunning it without them and seeing how performance declines

-- because you need a model in order to rerun.

But, also, I think there are further distinctions to make. I believe that if you tried to apply Shapley value to neural networks, it would go poorly; and presumably there should be a "philosophical" reason why this is the case (why Shapley value is solving a different problem than backprop). I don't know exactly what the relevant distinction is.

(Or maybe Shapley value works fine for NN learning; but, I'd be surprised.)

Removing things entirely seems extreme. How about having a continuous "contribution parameter," where running the algorithm without an element would correspond to turning this parameter down to zero, but you could also set the parameter to 0.5 if you wanted that element to have 50% of the influence it has right now. Then you can send rewards to elements if increasing their contribution parameter would improve the decision.

Removing things entirely seems extreme.

Dropout is a thing, though.

Dropout is like the converse of this - you use dropout to assess the non-outdropped elements. This promotes resiliency to perturbations in the model - whereas if you evaluate things by how bad it is to break them, you could promote fragile, interreliant collections of elements over resilient elements.

I think the root of the issue is that this Shapley value doesn't distinguish between something being bad to break, and something being good to have more of. If you removed all my blood I would die, but that doesn't mean that I would currently benefit from additional blood.

Anyhow, the joke was that as soon as you add a continuous parameter, you get gradient descent back again.

Unfortunately, we can't just copy this trick. Artificial evolution requires that we decide how to kill off / reproduce things, in the same way that animal breeding requires breeders to decide what they're optimizing for. This puts us back at square one; IE, needing to get our gradient from somewhere else.

Suppose we have a good reward function (as is typically assumed in deep RL). We can just copy the trick in that setting, right? But the rest of the post makes it sound like you still think there's a problem, in that even with that reward, you don't know how to assign credit to each individual action. This is a problem that evolution also has; evolution seemed to manage it just fine.

(Similarly, even if you think actor-critic methods don't count, surely REINFORCE is one-level learning? It works okay; added bells and whistles like critics are improvements to its sample efficiency.)

Yeah, I pretty strongly think there's a problem -- not necessarily an insoluble problem, but, one which has not been convincingly solved by any algorithm which I've seen. I think presentations of ML often obscure the problem (because it's not that big a deal in practice -- you can often define good enough episode boundaries or whatnot).

Suppose we have a good reward function (as is typically assumed in deep RL). We can just copy the trick in that setting, right? But the rest of the post makes it sound like you still think there's a problem, in that even with that reward, you don't know how to assign credit to each individual action. This is a problem that evolution also has; evolution seemed to manage it just fine.

Yeah, I feel like "matching rewards to actions is hard" is a pretty clear articulation of the problem.
I agree that it should be surprising, in some sense, that getting rewards isn't enough. That's why I wrote a post on it! But why do you think it should be enough? How do we "just copy the trick"??
I don't agree that this is analogous to the problem evolution has. If evolution just "received" the overall population each generation, and had to figure out which genomes were good/bad based on that, it would be a more analogous situation. However, that's not at all the case. Evolution "receives" a fairly rich vector of which genomes were better/worse, each generation. The analogous case for RL would be if you could output several actions each step, rather than just one, and receive feedback about each. But this is basically "access to counterfactuals"; to get this, you need a model.

(Similarly, even if you think actor-critic methods don't count, surely REINFORCE is one-level learning? It works okay; added bells and whistles like critics are improvements to its sample efficiency.)

No, definitely not, unless I'm missing something big.

From page 329 of this draft of Sutton & Barto:

Note that REINFORCE uses the complete return from time t, which includes all future rewards up until the end of the episode. In this sense REINFORCE is a Monte Carlo algorithm and is well defined only for the episodic case with all updates made in retrospect after the episode is completed (like the Monte Carlo algorithms in Chapter 5). This is shown explicitly in the boxed on the next page.

So, REINFORCE "solves" the assignment of rewards to actions via the blunt device of an episodic assumption; all rewards in an episode are grouped with all actions during that episode. If you expand the episode to infinity (so as to make no assumption about episode boundaries), then you just aren't learning. This means it's not applicable to the case of an intelligence wandering around and interacting dynamically with a world, where there's no particular bound on how the past may relate to present reward.

The "model" is thus extremely simple and hardwired, which makes it seem one-level. But you can't get away with this if you want to interact and learn on-line with a really complex environment.

Also, since the episodic assumption is a form of myopia, REINFORCE is compatible with the conjecture that any gradients we can actually construct are going to incentivize some form of myopia.

Oh, I see. You could also have a version of REINFORCE that doesn't make the episodic assumption, where every time you get a reward, you take a policy gradient step for each of the actions taken so far, with a weight that decays as actions go further back in time. You can't prove anything interesting about this, but you also can't prove anything interesting about actor-critic methods that don't have episode boundaries, I think. Nonetheless, I'd expect it would somewhat work, in the same way that an actor-critic method would somewhat work. (I'm not sure which I expect to work better; partly it depends on the environment and the details of how you implement the actor-critic method.)

(All of this said with very weak confidence; I don't know much RL theory)

You could also have a version of REINFORCE that doesn't make the episodic assumption, where every time you get a reward, you take a policy gradient step for each of the actions taken so far, with a weight that decays as actions go further back in time. You can't prove anything interesting about this, but you also can't prove anything interesting about actor-critic methods that don't have episode boundaries, I think.

Yeah, you can do this. I expect actor-critic to work better, because your suggestion is essentially a fixed model which says that actions are more relevant to temporally closer rewards (and that this is the only factor to consider).

I'm not sure how to further convey my sense that this is all very interesting. My model is that you're like "ok sure" but don't really see why I'm going on about this.

I'm not sure how to further convey my sense that this is all very interesting. My model is that you're like "ok sure" but don't really see why I'm going on about this.

Yeah, I think this is basically right. For the most part though, I'm trying to talk about things where I disagree with some (perceived) empirical claim, as opposed to the overall "but why even think about these things" -- I am not surprised when it is hard to convey why things are interesting in an explicit way before the research is done.

Here, I was commenting on the perceived claim of "you need to have two-level algorithms in order to learn at all; a one-level algorithm is qualitatively different and can never succeed", where my response is "but no, REINFORCE would do okay, though it might be more sample-inefficient". But it seems like you aren't claiming that, just claiming that two-level algorithms do quantitatively but not qualitatively better.

Actually, that wasn't what I was trying to say. But, now that I think about it, I think you're right.

I was thinking of the discounting variant of REINFORCE as having a fixed, but rather bad, model associating rewards with actions: rewards are tied more with actions nearby. So I was thinking of it as still two-level, just worse than actor-critic.

But, although the credit assignment will make mistakes (a predictable punishment which the agent can do nothing to avoid will nonetheless make any actions leading up to the punishment less likely in the future), they should average out in the long run (those 'wrongfully punished' actions should also be 'wrongfully rewarded'). So it isn't really right to think it strongly depends on the assumption.

Instead, it's better to think of it as a true discounting function. IE, it's not as assumption about the structure of consequences; it's an expression of how much the system cares about distant rewards when taking an action. Under this interpretation, REINFORCE indeed "closes the gradient gap" -- solves the credit assignment problem w/o restrictive modeling assumptions.

Maybe. It might also me argued that REINFORCE depends on some properties of the environment such as ergodicity. I'm not that familiar with the details.

But anyway, it now seems like a plausible counterexample.

One part of it is that I want to scrap classical (“static”) decision theory and move to a more learning-theoretic (“dynamic”) view.

Can you explain more what you mean by this, especially "learning-theoretic"? I've looked at learning theory a bit and the typical setup seems to involve a loss or reward that is immediately observable to the learner, whereas in decision theory, utility can be over parts of the universe that you can't see and therefore can't get feedback from, so it seems hard to apply typical learning theory results to decision theory. I wonder if I'm missing the whole point though... What do you think are the core insights or ideas of learning theory that might be applicable to decision theory?

(I don't speak for Abram but I wanted to explain my own opinion.) Decision theory asks, given certain beliefs an agent has, what is the rational action for em to take. But, what are these "beliefs"? Different frameworks have different answers for that. For example, in CDT a belief is a causal diagram. In EDT a belief is a joint distribution over actions and outcomes. In UDT a belief might be something like a Turing machine (inside the execution of which the agent is supposed to look for copies of emself). Learning theory allows us to gain insight through the observation that beliefs must be learnable, otherwise how would the agent come up with these beliefs in the first place? There might be parts of the beliefs that come from the prior and cannot be learned, but still, at least the type signature of beliefs should be compatible with learning.

Moreover, decision problems are often implicitly described from the point of view of a third party. For example, in Newcomb's paradox we postulate that Omega can predict the agent, which makes perfect sense for an observer looking from the side, but might be difficult to formulate from the point of view of the agent itself. Therefore, understanding decision theory requires the translation of beliefs from the point of view of one observer to the point of view of another. Here also learning theory can help us: we can ask, what are the beliefs Alice should expect Bob to learn given particular beliefs of Alice about the world? From a slightly different angle, the central source of difficulty in decision theory is the notion of counterfactuals, and the attempt to prescribe particular meaning to them, which different decision theories do differently. Instead, we can just postulate that, from the subjective point of view of the agent, counterfactuals are ontologically basic. The agent believes emself to have free will, so to speak. Then, the interesting quesiton is, what kind of counterfactuals are produced by the translation of beliefs from the perspective of a third party to the perspective of the given agent.

Indeed, thinking about learning theory led me to the notion of quasi-Bayesian agents (agents that use incomplete/fuzzy models), and quasi-Bayesian agents automatically solve all Newcomb-like decision problems. In other words, quasi-Bayesian agents are effectively a rigorous version of UDT.

Incidentally, to align AI we literally need to translate beliefs from the user's point of view to the AI's point of view. This is also solved via the same quasi-Bayesian approach. In particular, this translation process preserves the "point of updatelessness", which, in my opinion, is the desired result (the choice of this point is subjective).

My thinking is somewhat similar to Vanessa's. I think a full explanation would require a long post in itself. It's related to my recent thinking about UDT and commitment races. But, here's one way of arguing for the approach in the abstract.

You once asked:

Assuming that we do want to be pre-rational, how do we move from our current non-pre-rational state to a pre-rational one? This is somewhat similar to the question of how do we move from our current non-rational (according to ordinary rationality) state to a rational one. Expected utility theory says that we should act as if we are maximizing expected utility, but it doesn't say what we should do if we find ourselves lacking a prior and a utility function (i.e., if our actual preferences cannot be represented as maximizing expected utility).

The fact that we don't have good answers for these questions perhaps shouldn't be considered fatal to pre-rationality and rationality, but it's troubling that little attention has been paid to them, relative to defining pre-rationality and rationality. (Why are rationality researchers more interested in knowing what rationality is, and less interested in knowing how to be rational? Also, BTW, why are there so few rationality researchers? Why aren't there hordes of people interested in these issues?)

My contention is that rationality should be about the update process. It should be about how you adjust your position. We can have abstract rationality notions as a sort of guiding star, but we also need to know how to steer based on those.

Some examples:

Logical induction can be thought of as the result of performing this transform on Bayesianism; it describes belief states which are not coherent, and gives a rationality principle about how to approach coherence -- rather than just insisting that one must somehow approach coherence.
Evolutionary game theory is more dynamic than the Nash story. It concerns itself more directly with the question of how we get to equilibrium. Strategies which work better get copied. We can think about the equilibria, as we do in the Nash picture; but, the evolutionary story also lets us think about non-equilibrium situations. We can think about attractors (equilibria being point-attractors, vs orbits and strange attractors), and attractor basins; the probability of ending up in one basin or another; and other such things.

However, although the model seems good for studying the behavior of evolved creatures, there does seem to be something missing for artificial agents learning to play games; we don't necessarily want to think of there as being a population which is selected on in that way.

The complete class theorem describes utility-theoretic rationality as the end point of taking Pareto improvements. But, we could instead think about rationality as the process of taking Pareto improvements. This lets us think about (semi-)rational agents whose behavior isn't described by maximizing a fixed expected utility function, but who develop one over time. (This model in itself isn't so interesting, but we can think about generalizing it; for example, by considering the difficulty of the bargaining process -- subagents shouldn't just accept any Pareto improvement offered.)

Again, this model has drawbacks. I'm definitely not saying that by doing this you arrive at the ultimate learning-theoretic decision theory I'd want.

Promoted to curated: It's been a while since this post has come out, but I've been thinking of the "credit assignment" abstraction a lot since then, and found it quite useful. I also really like the way the post made me curious about a lot of different aspects of the world, and I liked the way it invited me to boggle at the world together with you.

I also really appreciated your long responses to questions in the comments, which clarified a lot of things for me.

One thing comes to mind for maybe improving the post, though I think that's mostly a difference of competing audiences:

I think some sections of the post end up referencing a lot of really high-level concepts, in a way that I think is valuable as a reference, but also in a way that might cause a lot of people to bounce off of it (even people with a pretty strong AI Alignment background). I can imagine a post that includes very short explanations of those concepts, or moves them into a context where they are more clearly marked as optional (since I think the post stands well without at least some of those high-level concepts)

Interesting post in light of our discussion at CMU agent foundations 2026, in which I questioned whether (Schurz) meta-inductive justification of induction actually justifies model-based planning as in AIXI, or actually suggests a model-free approach.

I think I have juuust enough background to follow the broad strokes of this post, but not to quite grok the parts I think Abram was most interested in.

I definitely caused me to think about credit assignment. I actually ended up thinking about it largely through the lens of Moral Mazes (where challenges of credit assignment combine with other forces to create a really bad environment). Re-reading this post, while I don't quite follow everything, I do successfully get a taste of how credit assignment fits into a bunch of different domains.

For the "myopia/partial-agency" aspects of the post, I'm curious how Abram's thinking has changed. This post AFAICT was a sort of "rant". A year after the fact, did the ideas here feel like they panned out?

It does seem like someone should someday write a post about credit assignment that's a bit more general.

Most of my points from my curation notice still hold. And two years later, I am still thinking a lot about credit assignment as a perspective on many problems I am thinking about.

This seems like one I would significantly re-write for the book if it made it that far. I feel like it got nominated for the introductory material, which I wrote quickly in order to get to the "main point" (the gradient gap). A better version would have discussed credit assignment algorithms more.

From the perspective of full agency (ie, the negation of partial agency), a system which needs a protected epistemic layer sounds suspiciously like a system that can't tile. You look at the world, and you say: "how can I maximize utility?" You look at your beliefs, and you say: "how can I maximize accuracy?" That's not a consequentialist agent; that's two different consequentialist agents!

For reinforcement learning with incomplete/fuzzy hypotheses, this separation doesn't exist, because the update rule for fuzzy beliefs depends on the utility function and in some sense even on the actual policy.

Actually I was somewhat confused about what the right update rule for fuzzy beliefs is when I wrote that comment. But I think I got it figured out now.

First, background about fuzzy beliefs:

Let $E$ be the space of environments (defined as the space of instrumental states in Definition 9 here). A fuzzy belief is a concave function $ϕ : E \to [0, 1]$ s.t. $sup ϕ = 1$ . We can think of it as the membership function of a fuzzy set. For an incomplete model $Φ \subseteq E$ , the corresponding $ϕ$ is the concave hull of the characteristic function of $Φ$ (i.e. the minimal concave $ϕ$ s.t. $ϕ \geq χ_{Φ}$ ).

Let $γ$ be the geometric discount parameter and $U (γ) := (1 - γ) \sum_{n = 0}^{\infty} γ^{n} r_{n}$ be the utility function. Given a policy $π$ (EDIT: in general, we allow our policies to explicitly depend on $γ$ ), the value of $π$ at $ϕ$ is defined by

$V_{π} (ϕ, γ) := 1 + inf μ \in E (E_{μ π} [U (γ)] - ϕ (μ))$

The optimal policy and the optimal value for $ϕ$ are defined by

$π_{ϕ, γ}^{*} := arg max π V_{π} (ϕ, γ)$ $V (ϕ, γ) := max π V_{π} (ϕ, γ)$

Given a policy $π$ , the regret of $π$ at $ϕ$ is defined by

${R g}_{π} (ϕ, γ) := V (ϕ, γ) - V_{π} (ϕ, γ)$

$π$ is said to learn $ϕ$ when it is asymptotically optimal for $ϕ$ when $γ \to 1$ , that is

$lim γ \to 1 {R g}_{π} (ϕ, γ) = 0$

Given $ζ$ a probability measure over the space fuzzy hypotheses, the Bayesian regret of $π$ at $ζ$ is defined by

${B R g}_{π} (ζ, γ) := E_{ϕ \sim ζ} [{R g}_{π} (ϕ, γ)]$

$π$ is said to learn $ζ$ when

$lim γ \to 1 {B R g}_{π} (ζ, γ) = 0$

If such a $π$ exists, $ζ$ is said to be learnable. Analogously to Bayesian RL, $ζ$ is learnable if and only if it is learned by a specific policy $π_{ζ}^{*}$ (the Bayes-optimal policy). To define it, we define the fuzzy belief $ϕ_{ζ}$ by

$ϕ_{ζ} (μ) := sup (σ : s u p p ζ \to E) : E_{ϕ \sim ζ} [σ (ϕ)] = μ E_{ϕ \sim ζ} [ϕ (σ (ϕ))]$

We now define $π_{ζ}^{*} := ϕ_{ϕ_{ζ}}^{*}$ .

Now, updating: (EDIT: the definition was needlessly complicated, simplified)

Consider a history $h \in {(A \times O)}^{*}$ or $h \in {(A \times O)}^{*} \times A$ . Here $A$ is the set of actions and $O$ is the set of observations. Define $μ_{ϕ}^{*}$ by

$μ_{ϕ}^{*} := arg max μ \in E min π (ϕ (μ) - E_{μ π} [U])$

Let $E^{'}$ be the space of "environments starting from $h$ ". That is, if $h \in {(A \times O)}^{*}$ then $E^{'} = E$ and if $h \in {(A \times O)}^{*} \times A$ then $E^{'}$ is slightly different because the history now begins with an observation instead of with an action.

For any $μ \in E, ν \in E^{'}$ we define ${[ν]}_{h} μ \in E$ by

${[ν]}_{h} μ (o ∣ h^{'}) := {\begin{matrix} ν (o ∣ h^{''}) if h^{'} = h h^{''} μ (o ∣ h^{'}) otherwise \end{matrix}$

Then, the updated fuzzy belief is

$(ϕ ∣ h) (ν) := ϕ ({[ν]}_{h} μ_{ϕ}^{*}) + constant$

You look at the world, and you say: "how can I maximize utility?" You look at your beliefs, and you say: "how can I maximize accuracy?" That's not a consequentialist agent; that's two different consequentialist agents!

Not... really? "how can I maximize accuracy?" is a very liberal agentification of a process that might be more drily thought of as asking "what is accurate?" Your standard sequence predictor isn't searching through epistemic pseudo-actions to find which ones best maximize its expected accuracy, it's just following a pre-made plan of epistemic action that happens to increase accuracy.

Though this does lead to the thought: if you want to put things on equal footing, does this mean you want to describe a reasoner that searches through epistemic steps/rules like an agent searching through actions/plans?

This is more or less how humans already conceive of difficult abstract reasoning. We don't solve integrals by gradient descent, we imagine doing some sort of tree search where the edges are different abstract manipulations of the integral. But for everyday reasoning, like navigating 3D space, we just use our specialized feed-forward hardware.

Not... really? "how can I maximize accuracy?" is a very liberal agentification of a process that might be more drily thought of as asking "what is accurate?" Your standard sequence predictor isn't searching through epistemic pseudo-actions to find which ones best maximize its expected accuracy, it's just following a pre-made plan of epistemic action that happens to increase accuracy.

Yeah, I absolutely agree with this. My description that you quoted was over-dramaticizing the issue.

Really, what you have is an agent sitting on top of non-agentic infrastructure. The non-agentic infrastructure is "optimizing" in a broad sense because it follows a gradient toward predictive accuracy, but it is utterly myopic (doesn't plan ahead to cleverly maximize accuracy).

The point I was making, stated more accurately, is that you (seemingly) need this myopic optimization as a 'protected' sub-part of the agent, which the overall agent cannot freely manipulate (since if it could, it would just corrupt the policy-learning process by wireheading).

Though this does lead to the thought: if you want to put things on equal footing, does this mean you want to describe a reasoner that searches through epistemic steps/rules like an agent searching through actions/plans?

This is more or less how humans already conceive of difficult abstract reasoning.

Yeah, my observation is that it intuitively seems like highly capable agents need to be able to do that; to that end, it seems like one needs to be able to describe a framework where agents at least have that option without it leading to corruption of the overall learning process via the instrumental part strategically biasing the epistemic part to make the instrumental part look good.

(Possibly humans just use a messy solution where the strategic biasing occurs but the damage is lessened by limiting the extent to which the instrumental system can bias the epistemics -- eg, you can't fully choose what to believe.)

I found this a very interesting frame on things, and am glad I read it.

However, the critic is learning to predict; it's just that all we need to predict is expected value.

See also muZero. Also note that predicting optimal value formally ties in with predicting a model. In MDPs, you can reconstruct all of the dynamics using just $| S |$ optimal value functions.

I re-read this post thinking about how and whether this applies to brains...

The online learning conceptual problem (as I understand your description of it) says, for example, I can never know whether it was a good idea to have read this book, because maybe it will come in handy 40 years later. Well, this seems to be "solved" in humans by exponential / hyperbolic discounting. It's not exactly episodic, but we'll more-or-less be able to retrospectively evaluate whether a cognitive process worked as desired long before death.
Relatedly, we seem to generally make and execute plans that are (hierarchically) laid out in time and with a success criterion at its end, like "I'm going to walk to the store". So we get specific and timely feedback on whether that plan was successful.
We do in fact have a model class. It seems very rich; in terms of "grain of truth", well I'm inclined to think that nothing worth knowing is fundamentally beyond human comprehension, except for contingent reasons like memory and lifespan limitations (i.e. not because they are not incompatible with the internal data structures). Maybe that's good enough?

Just some thoughts; sorry if this is irrelevant or I'm misunderstanding anything. :-)

The online learning conceptual problem (as I understand your description of it) says, for example, I can never know whether it was a good idea to have read this book, because maybe it will come in handy 40 years later. Well, this seems to be "solved" in humans by exponential / hyperbolic discounting. It's not exactly episodic, but we'll more-or-less be able to retrospectively evaluate whether a cognitive process worked as desired long before death.

I interpret you as suggesting something like what Rohin is suggesting, with a hyperbolic function giving the weights.

It seems (to me) the literature establishes that our behavior can be approximately described by the hyperbolic discounting rule (in certain circumstances anyway), but, comes nowhere near establishing that the mechanism by which we learn looks like this, and in fact has some evidence against. But that's a big topic. For a quick argument, I observe that humans are highly capable, and I generally expect actor/critic to be more capable than dumbly associating rewards with actions via the hyperbolic function. That doesn't mean humans use actor/critic; the point is that there are a lot of more-sophisticated setups to explore.

We do in fact have a model class.

It's possible that our models are entirely subservient to instrumental stuff (ie, we "learn to think" rather than "thinking to learn", which would mean we don't have the big split which I'm pointing to -- ie, that we solve the credit assignment problem "directly" somehow, rather than needing to learn to do so.

It seems very rich; in terms of "grain of truth", well I'm inclined to think that nothing worth knowing is fundamentally beyond human comprehension, except for contingent reasons like memory and lifespan limitations (i.e. not because they are not incompatible with the internal data structures). Maybe that's good enough?

Claim: predictive learning gets gradients "for free" ... Claim: if you're learning to act, you do not similarly get gradients "for free". You take an action, and you see results of that one action. This means you fundamentally don't know what would have happened had you taken alternate actions, which means you don't have a direction to move your policy in. You don't know whether alternatives would have been better or worse. So, rewards you observe seem like not enough to determine how you should learn.

This immediately jumped out at me as an implausible distinction because I was just reading Surfing Uncertainty which goes on endlessly about how the machinery of hierarchical predictive coding is exactly the same as the machinery of hierarchical motor control (with "priors" in the former corresponding to "priors + control-theory-setpoints" in the latter, and with "predictions about upcoming proprioceptive inputs" being identical to the muscle control outputs). Example excerpt:

the heavy lifting that is usually done by the use of efference copy, inverse models, and optimal controllers [in the models proposed by non-predictive-coding people] is now shifted [in the predictive coding paradigm] to the acquisition and use of the predictive (generative) model (i.e., the right set of prior probabilistic ‘beliefs’). This is potentially advantageous if (but only if) we can reasonably assume that these beliefs ‘emerge naturally as top-down or empirical priors during hierarchical perceptual inference’ (Friston, 2011a, p. 492). The computational burden thus shifts to the acquisition of the right set of priors (here, priors over trajectories and state transitions), that is, it shifts the burden to acquiring and tuning the generative model itself. --Surfing Uncertainty chapter 4

I'm a bit hazy on the learning mechanism for this (confusingly-named) "predictive model" (I haven't gotten around to chasing down the references) and how that relates to what you wrote... But it does sorta sound like it entails one update process rather than two...

Yep, I 100% agree that this is relevant. The PP/Friston/free-energy/active-inference camp is definitely at least trying to "cross the gradient gap" with a unified theory as opposed to a two-system solution. However, I'm not sure how to think about it yet.

I may be completely wrong, but I have a sense that there's a distinction between learning and inference which plays a similar role; IE, planning is just inference, but both planning and inference work only because the learning part serves as the second "protected layer"??
It may be that the PP is "more or less" the Bayesian solution; IE, it requires a grain of truth to get good results, so it doesn't really help with the things I'm most interested in getting out of "crossing the gap".
Note that PP clearly tries to implement things by pushing everything into epistemics. On the other hand, I'm mostly discussing what happens when you try to smoosh everything into the instrumental system. So many of my remarks are not directly relevant to PP.
I get the sense that Friston might be using the "evolution solution" I mentioned; so, unifying things in a way which kind of lets us talk about evolved agents, but not artificial ones. However, this is obviously an oversimplification, because he does present designs for artificial agents based on the ideas.

Overall, my current sense is that PP obscures the issue I'm interested in more than solves it, but it's not clear.

a system which needs a protected epistemic layer sounds suspiciously like a system that can't tile

I stand as a counterexample: I personally want my epistemic layer to have accurate beliefs—y'know, having read the sequences… :-P

I think of my epistemic system like I think of my pocket calculator: a tool I use to better achieve my goals. The tool doesn't need to share my goals.

The way I think about it is:

Early in training, the AGI is too stupid to formulate and execute a plan to hack into its epistemic level.
Late in training, we can hopefully get to the place where the AGI's values, like mine, involve a concept of "there is a real world independent of my beliefs", and its preferences involve the state of that world, and therefore "get accurate beliefs" becomes instrumentally useful and endorsed.
In between … well … in between, we're navigating treacherous waters …

Second, there's an obstacle to pragmatic/practical considerations entering into epistemics. We need to focus on predicting important things; we need to control the amount of processing power spent; things in that vein. But (on the two-level view) we can't allow instrumental concerns to contaminate epistemics! We risk corruption!

The instrumental level gets some influence over what to look at, where to go, what to read, who to talk to, etc.
There's a trick (involving acetylcholine) where the instrumental level has some influence over a multiplier on the epistemic level's gradients (a.k.a. learning rate). So epistemic level is always updates towards "more accurate predictions on this frame", but it updates infinitesimally in situations where prediction accuracy is instrumentally useless, and it updates strongly in situations where prediction accuracy is instrumentally important.
There's a different mechanism that creates the same end result as #2: namely, the instrumental level has some influence over what memories get replayed more or less often.
For #2 and #3, the instrumental level has some influence but not complete influence. There are other hardcoded algorithms running in parallel and flagging certain things as important, and the instrumental level has no straightforward way to prevent that from happening.

In between … well … in between, we're navigating treacherous waters …

Right, I basically agree with this picture. I might revise it a little:

Early, the AGI is too dumb to hack its epistemics (provided we don't give it easy ways to do so!).
In the middle, there's a danger zone.
When the AGI is pretty smart, it sees why one should be cautious about such things, and it also sees why any modifications should probably be in pursuit of truthfulness (because true beliefs are a convergent instrumental goal) as opposed to other reasons.
When the AGI is really smart, it might see better ways of organizing itself (eg, specific ways to hack epistemics which really are for the best even though they insert false beliefs), but we're OK with that, because it's really freaking smart and it knows to be cautious and it still thinks this.

So the goal is to allow what instrumental influence we can on the epistemic system, while making it hard and complicated to outright corrupt the epistemic system.

I can't for the life of me remember what this is called

Shapley value

(Best wishes, Less Wrong Reference Desk)

Specifically, you don't automatically know how to

send rewards to each contributor proportional to how much they improved the actual group decision

because in the cases I'm interested in, ie online learning, you don't have the option of

rerunning it without them and seeing how performance declines

-- because you need a model in order to rerun.

(Or maybe Shapley value works fine for NN learning; but, I'd be surprised.)

Removing things entirely seems extreme.

Dropout is a thing, though.

Anyhow, the joke was that as soon as you add a continuous parameter, you get gradient descent back again.

Unfortunately, we can't just copy this trick. Artificial evolution requires that we decide how to kill off / reproduce things, in the same way that animal breeding requires breeders to decide what they're optimizing for. This puts us back at square one; IE, needing to get our gradient from somewhere else.

Suppose we have a good reward function (as is typically assumed in deep RL). We can just copy the trick in that setting, right? But the rest of the post makes it sound like you still think there's a problem, in that even with that reward, you don't know how to assign credit to each individual action. This is a problem that evolution also has; evolution seemed to manage it just fine.

Yeah, I feel like "matching rewards to actions is hard" is a pretty clear articulation of the problem.
I agree that it should be surprising, in some sense, that getting rewards isn't enough. That's why I wrote a post on it! But why do you think it should be enough? How do we "just copy the trick"??
I don't agree that this is analogous to the problem evolution has. If evolution just "received" the overall population each generation, and had to figure out which genomes were good/bad based on that, it would be a more analogous situation. However, that's not at all the case. Evolution "receives" a fairly rich vector of which genomes were better/worse, each generation. The analogous case for RL would be if you could output several actions each step, rather than just one, and receive feedback about each. But this is basically "access to counterfactuals"; to get this, you need a model.

(Similarly, even if you think actor-critic methods don't count, surely REINFORCE is one-level learning? It works okay; added bells and whistles like critics are improvements to its sample efficiency.)

No, definitely not, unless I'm missing something big.

From page 329 of this draft of Sutton & Barto:

Note that REINFORCE uses the complete return from time t, which includes all future rewards up until the end of the episode. In this sense REINFORCE is a Monte Carlo algorithm and is well defined only for the episodic case with all updates made in retrospect after the episode is completed (like the Monte Carlo algorithms in Chapter 5). This is shown explicitly in the boxed on the next page.

The "model" is thus extremely simple and hardwired, which makes it seem one-level. But you can't get away with this if you want to interact and learn on-line with a really complex environment.

Also, since the episodic assumption is a form of myopia, REINFORCE is compatible with the conjecture that any gradients we can actually construct are going to incentivize some form of myopia.

(All of this said with very weak confidence; I don't know much RL theory)

You could also have a version of REINFORCE that doesn't make the episodic assumption, where every time you get a reward, you take a policy gradient step for each of the actions taken so far, with a weight that decays as actions go further back in time. You can't prove anything interesting about this, but you also can't prove anything interesting about actor-critic methods that don't have episode boundaries, I think.

I'm not sure how to further convey my sense that this is all very interesting. My model is that you're like "ok sure" but don't really see why I'm going on about this.

I'm not sure how to further convey my sense that this is all very interesting. My model is that you're like "ok sure" but don't really see why I'm going on about this.

Actually, that wasn't what I was trying to say. But, now that I think about it, I think you're right.

Maybe. It might also me argued that REINFORCE depends on some properties of the environment such as ergodicity. I'm not that familiar with the details.

But anyway, it now seems like a plausible counterexample.

One part of it is that I want to scrap classical (“static”) decision theory and move to a more learning-theoretic (“dynamic”) view.

You once asked:

Assuming that we do want to be pre-rational, how do we move from our current non-pre-rational state to a pre-rational one? This is somewhat similar to the question of how do we move from our current non-rational (according to ordinary rationality) state to a rational one. Expected utility theory says that we should act as if we are maximizing expected utility, but it doesn't say what we should do if we find ourselves lacking a prior and a utility function (i.e., if our actual preferences cannot be represented as maximizing expected utility).

The fact that we don't have good answers for these questions perhaps shouldn't be considered fatal to pre-rationality and rationality, but it's troubling that little attention has been paid to them, relative to defining pre-rationality and rationality. (Why are rationality researchers more interested in knowing what rationality is, and less interested in knowing how to be rational? Also, BTW, why are there so few rationality researchers? Why aren't there hordes of people interested in these issues?)

Some examples:

Logical induction can be thought of as the result of performing this transform on Bayesianism; it describes belief states which are not coherent, and gives a rationality principle about how to approach coherence -- rather than just insisting that one must somehow approach coherence.
Evolutionary game theory is more dynamic than the Nash story. It concerns itself more directly with the question of how we get to equilibrium. Strategies which work better get copied. We can think about the equilibria, as we do in the Nash picture; but, the evolutionary story also lets us think about non-equilibrium situations. We can think about attractors (equilibria being point-attractors, vs orbits and strange attractors), and attractor basins; the probability of ending up in one basin or another; and other such things.

However, although the model seems good for studying the behavior of evolved creatures, there does seem to be something missing for artificial agents learning to play games; we don't necessarily want to think of there as being a population which is selected on in that way.

The complete class theorem describes utility-theoretic rationality as the end point of taking Pareto improvements. But, we could instead think about rationality as the process of taking Pareto improvements. This lets us think about (semi-)rational agents whose behavior isn't described by maximizing a fixed expected utility function, but who develop one over time. (This model in itself isn't so interesting, but we can think about generalizing it; for example, by considering the difficulty of the bargaining process -- subagents shouldn't just accept any Pareto improvement offered.)

Again, this model has drawbacks. I'm definitely not saying that by doing this you arrive at the ultimate learning-theoretic decision theory I'd want.

I also really appreciated your long responses to questions in the comments, which clarified a lot of things for me.

One thing comes to mind for maybe improving the post, though I think that's mostly a difference of competing audiences:

I think I have juuust enough background to follow the broad strokes of this post, but not to quite grok the parts I think Abram was most interested in.

It does seem like someone should someday write a post about credit assignment that's a bit more general.

Most of my points from my curation notice still hold. And two years later, I am still thinking a lot about credit assignment as a perspective on many problems I am thinking about.

From the perspective of full agency (ie, the negation of partial agency), a system which needs a protected epistemic layer sounds suspiciously like a system that can't tile. You look at the world, and you say: "how can I maximize utility?" You look at your beliefs, and you say: "how can I maximize accuracy?" That's not a consequentialist agent; that's two different consequentialist agents!

Actually I was somewhat confused about what the right update rule for fuzzy beliefs is when I wrote that comment. But I think I got it figured out now.

First, background about fuzzy beliefs:

$V_{π} (ϕ, γ) := 1 + inf μ \in E (E_{μ π} [U (γ)] - ϕ (μ))$

The optimal policy and the optimal value for $ϕ$ are defined by

$π_{ϕ, γ}^{*} := arg max π V_{π} (ϕ, γ)$ $V (ϕ, γ) := max π V_{π} (ϕ, γ)$

Given a policy $π$ , the regret of $π$ at $ϕ$ is defined by

${R g}_{π} (ϕ, γ) := V (ϕ, γ) - V_{π} (ϕ, γ)$

$π$ is said to learn $ϕ$ when it is asymptotically optimal for $ϕ$ when $γ \to 1$ , that is

$lim γ \to 1 {R g}_{π} (ϕ, γ) = 0$

Given $ζ$ a probability measure over the space fuzzy hypotheses, the Bayesian regret of $π$ at $ζ$ is defined by

${B R g}_{π} (ζ, γ) := E_{ϕ \sim ζ} [{R g}_{π} (ϕ, γ)]$

$π$ is said to learn $ζ$ when

$lim γ \to 1 {B R g}_{π} (ζ, γ) = 0$

$ϕ_{ζ} (μ) := sup (σ : s u p p ζ \to E) : E_{ϕ \sim ζ} [σ (ϕ)] = μ E_{ϕ \sim ζ} [ϕ (σ (ϕ))]$

We now define $π_{ζ}^{*} := ϕ_{ϕ_{ζ}}^{*}$ .

Now, updating: (EDIT: the definition was needlessly complicated, simplified)

Consider a history $h \in {(A \times O)}^{*}$ or $h \in {(A \times O)}^{*} \times A$ . Here $A$ is the set of actions and $O$ is the set of observations. Define $μ_{ϕ}^{*}$ by

$μ_{ϕ}^{*} := arg max μ \in E min π (ϕ (μ) - E_{μ π} [U])$

For any $μ \in E, ν \in E^{'}$ we define ${[ν]}_{h} μ \in E$ by

${[ν]}_{h} μ (o ∣ h^{'}) := {\begin{matrix} ν (o ∣ h^{''}) if h^{'} = h h^{''} μ (o ∣ h^{'}) otherwise \end{matrix}$

Then, the updated fuzzy belief is

$(ϕ ∣ h) (ν) := ϕ ({[ν]}_{h} μ_{ϕ}^{*}) + constant$

You look at the world, and you say: "how can I maximize utility?" You look at your beliefs, and you say: "how can I maximize accuracy?" That's not a consequentialist agent; that's two different consequentialist agents!

Not... really? "how can I maximize accuracy?" is a very liberal agentification of a process that might be more drily thought of as asking "what is accurate?" Your standard sequence predictor isn't searching through epistemic pseudo-actions to find which ones best maximize its expected accuracy, it's just following a pre-made plan of epistemic action that happens to increase accuracy.

Yeah, I absolutely agree with this. My description that you quoted was over-dramaticizing the issue.

Though this does lead to the thought: if you want to put things on equal footing, does this mean you want to describe a reasoner that searches through epistemic steps/rules like an agent searching through actions/plans?

This is more or less how humans already conceive of difficult abstract reasoning.

I found this a very interesting frame on things, and am glad I read it.

However, the critic is learning to predict; it's just that all we need to predict is expected value.

See also muZero. Also note that predicting optimal value formally ties in with predicting a model. In MDPs, you can reconstruct all of the dynamics using just $| S |$ optimal value functions.

I re-read this post thinking about how and whether this applies to brains...

The online learning conceptual problem (as I understand your description of it) says, for example, I can never know whether it was a good idea to have read this book, because maybe it will come in handy 40 years later. Well, this seems to be "solved" in humans by exponential / hyperbolic discounting. It's not exactly episodic, but we'll more-or-less be able to retrospectively evaluate whether a cognitive process worked as desired long before death.
Relatedly, we seem to generally make and execute plans that are (hierarchically) laid out in time and with a success criterion at its end, like "I'm going to walk to the store". So we get specific and timely feedback on whether that plan was successful.
We do in fact have a model class. It seems very rich; in terms of "grain of truth", well I'm inclined to think that nothing worth knowing is fundamentally beyond human comprehension, except for contingent reasons like memory and lifespan limitations (i.e. not because they are not incompatible with the internal data structures). Maybe that's good enough?

Just some thoughts; sorry if this is irrelevant or I'm misunderstanding anything. :-)

The online learning conceptual problem (as I understand your description of it) says, for example, I can never know whether it was a good idea to have read this book, because maybe it will come in handy 40 years later. Well, this seems to be "solved" in humans by exponential / hyperbolic discounting. It's not exactly episodic, but we'll more-or-less be able to retrospectively evaluate whether a cognitive process worked as desired long before death.

I interpret you as suggesting something like what Rohin is suggesting, with a hyperbolic function giving the weights.

We do in fact have a model class.

It seems very rich; in terms of "grain of truth", well I'm inclined to think that nothing worth knowing is fundamentally beyond human comprehension, except for contingent reasons like memory and lifespan limitations (i.e. not because they are not incompatible with the internal data structures). Maybe that's good enough?

Claim: predictive learning gets gradients "for free" ... Claim: if you're learning to act, you do not similarly get gradients "for free". You take an action, and you see results of that one action. This means you fundamentally don't know what would have happened had you taken alternate actions, which means you don't have a direction to move your policy in. You don't know whether alternatives would have been better or worse. So, rewards you observe seem like not enough to determine how you should learn.

the heavy lifting that is usually done by the use of efference copy, inverse models, and optimal controllers [in the models proposed by non-predictive-coding people] is now shifted [in the predictive coding paradigm] to the acquisition and use of the predictive (generative) model (i.e., the right set of prior probabilistic ‘beliefs’). This is potentially advantageous if (but only if) we can reasonably assume that these beliefs ‘emerge naturally as top-down or empirical priors during hierarchical perceptual inference’ (Friston, 2011a, p. 492). The computational burden thus shifts to the acquisition of the right set of priors (here, priors over trajectories and state transitions), that is, it shifts the burden to acquiring and tuning the generative model itself. --Surfing Uncertainty chapter 4

I may be completely wrong, but I have a sense that there's a distinction between learning and inference which plays a similar role; IE, planning is just inference, but both planning and inference work only because the learning part serves as the second "protected layer"??
It may be that the PP is "more or less" the Bayesian solution; IE, it requires a grain of truth to get good results, so it doesn't really help with the things I'm most interested in getting out of "crossing the gap".
Note that PP clearly tries to implement things by pushing everything into epistemics. On the other hand, I'm mostly discussing what happens when you try to smoosh everything into the instrumental system. So many of my remarks are not directly relevant to PP.
I get the sense that Friston might be using the "evolution solution" I mentioned; so, unifying things in a way which kind of lets us talk about evolved agents, but not artificial ones. However, this is obviously an oversimplification, because he does present designs for artificial agents based on the ideas.

Overall, my current sense is that PP obscures the issue I'm interested in more than solves it, but it's not clear.

47

The Credit Assignment Problem

47

What Is Credit Assignment?

The Conceptual Difficulty of 'Online Search'

Models to the Rescue

Model-Free Learning Requires Models

Idealized Intelligence

Actor-Critic

Policy Gradient

Where Updates Come From

The Gradient Gap

Tiling Concerns & Full Agency

Myopia

Evolution & Evolved Agents