Actually, that wasn't what I was trying to say. But, now that I think about it, I think you're right.
I was thinking of the discounting variant of REINFORCE as having a fixed, but rather bad, model associating rewards with actions: rewards are tied more with actions nearby. So I was thinking of it as still two-level, just worse than actor-critic.
But, although the credit assignment will make mistakes (a predictable punishment which the agent can do nothing to avoid will nonetheless make any actions leading up to the punishment less likely in the future), they should average out in the long run (those 'wrongfully punished' actions should also be 'wrongfully rewarded'). So it isn't really right to think it strongly depends on the assumption.
Instead, it's better to think of it as a true discounting function. IE, it's not as assumption about the structure of consequences; it's an expression of how much the system cares about distant rewards when taking an action. Under this interpretation, REINFORCE indeed "closes the gradient gap" -- solves the credit assignment problem w/o restrictive modeling assumptions.
Maybe. It might also me argued that REINFORCE depends on some properties of the environment such as ergodicity. I'm not that familiar with the details.
But anyway, it now seems like a plausible counterexample.
The online learning conceptual problem (as I understand your description of it) says, for example, I can never know whether it was a good idea to have read this book, because maybe it will come in handy 40 years later. Well, this seems to be "solved" in humans by exponential / hyperbolic discounting. It's not exactly episodic, but we'll more-or-less be able to retrospectively evaluate whether a cognitive process worked as desired long before death.
I interpret you as suggesting something like what Rohin is suggesting, with a hyperbolic function giving the weights.
It seems (to me) the literature establishes that our behavior can be approximately described by the hyperbolic discounting rule (in certain circumstances anyway), but, comes nowhere near establishing that the mechanism by which we learn looks like this, and in fact has some evidence against. But that's a big topic. For a quick argument, I observe that humans are highly capable, and I generally expect actor/critic to be more capable than dumbly associating rewards with actions via the hyperbolic function. That doesn't mean humans use actor/critic; the point is that there are a lot of more-sophisticated setups to explore.
We do in fact have a model class.
It's possible that our models are entirely subservient to instrumental stuff (ie, we "learn to think" rather than "thinking to learn", which would mean we don't have the big split which I'm pointing to -- ie, that we solve the credit assignment problem "directly" somehow, rather than needing to learn to do so.
It seems very rich; in terms of "grain of truth", well I'm inclined to think that nothing worth knowing is fundamentally beyond human comprehension, except for contingent reasons like memory and lifespan limitations (i.e. not because they are not incompatible with the internal data structures). Maybe that's good enough?
Not... really? "how can I maximize accuracy?" is a very liberal agentification of a process that might be more drily thought of as asking "what is accurate?" Your standard sequence predictor isn't searching through epistemic pseudo-actions to find which ones best maximize its expected accuracy, it's just following a pre-made plan of epistemic action that happens to increase accuracy.
Yeah, I absolutely agree with this. My description that you quoted was over-dramaticizing the issue.
Really, what you have is an agent sitting on top of non-agentic infrastructure. The non-agentic infrastructure is "optimizing" in a broad sense because it follows a gradient toward predictive accuracy, but it is utterly myopic (doesn't plan ahead to cleverly maximize accuracy).
The point I was making, stated more accurately, is that you (seemingly) need this myopic optimization as a 'protected' sub-part of the agent, which the overall agent cannot freely manipulate (since if it could, it would just corrupt the policy-learning process by wireheading).
Though this does lead to the thought: if you want to put things on equal footing, does this mean you want to describe a reasoner that searches through epistemic steps/rules like an agent searching through actions/plans?
This is more or less how humans already conceive of difficult abstract reasoning.
Yeah, my observation is that it intuitively seems like highly capable agents need to be able to do that; to that end, it seems like one needs to be able to describe a framework where agents at least have that option without it leading to corruption of the overall learning process via the instrumental part strategically biasing the epistemic part to make the instrumental part look good.
(Possibly humans just use a messy solution where the strategic biasing occurs but the damage is lessened by limiting the extent to which the instrumental system can bias the epistemics -- eg, you can't fully choose what to believe.)
How does that work?
My thinking is somewhat similar to Vanessa's. I think a full explanation would require a long post in itself. It's related to my recent thinking about UDT and commitment races. But, here's one way of arguing for the approach in the abstract.
You once asked:
Assuming that we do want to be pre-rational, how do we move from our current non-pre-rational state to a pre-rational one? This is somewhat similar to the question of how do we move from our current non-rational (according to ordinary rationality) state to a rational one. Expected utility theory says that we should act as if we are maximizing expected utility, but it doesn't say what we should do if we find ourselves lacking a prior and a utility function (i.e., if our actual preferences cannot be represented as maximizing expected utility).
The fact that we don't have good answers for these questions perhaps shouldn't be considered fatal to pre-rationality and rationality, but it's troubling that little attention has been paid to them, relative to defining pre-rationality and rationality. (Why are rationality researchers more interested in knowing what rationality is, and less interested in knowing how to be rational? Also, BTW, why are there so few rationality researchers? Why aren't there hordes of people interested in these issues?)
My contention is that rationality should be about the update process. It should be about how you adjust your position. We can have abstract rationality notions as a sort of guiding star, but we also need to know how to steer based on those.
You could also have a version of REINFORCE that doesn't make the episodic assumption, where every time you get a reward, you take a policy gradient step for each of the actions taken so far, with a weight that decays as actions go further back in time. You can't prove anything interesting about this, but you also can't prove anything interesting about actor-critic methods that don't have episode boundaries, I think.
Yeah, you can do this. I expect actor-critic to work better, because your suggestion is essentially a fixed model which says that actions are more relevant to temporally closer rewards (and that this is the only factor to consider).
I'm not sure how to further convey my sense that this is all very interesting. My model is that you're like "ok sure" but don't really see why I'm going on about this.
Yeah, it's definitely related. The main thing I want to point out is that Shapley values similarly require a model in order to calculate. So you have to distinguish between the problem of calculating a detailed distribution of credit and being able to assign credit "at all" -- in artificial neural networks, backprop is how you assign detailed credit, but a loss function is how you get a notion of credit at all. Hence, the question "where do gradients come from?" -- a reward function is like a pile of money made from a joint venture; but to apply backprop or Shapley value, you also need a model of counterfactual payoffs under a variety of circumstances. This is a problem, if you don't have a seperate "epistemic" learning process to provide that model -- ie, it's a problem if you are trying to create one big learning algorithm that does everything.
Specifically, you don't automatically know how to
send rewards to each contributor proportional to how much they improved the actual group decision
because in the cases I'm interested in, ie online learning, you don't have the option of
rerunning it without them and seeing how performance declines
-- because you need a model in order to rerun.
But, also, I think there are further distinctions to make. I believe that if you tried to apply Shapley value to neural networks, it would go poorly; and presumably there should be a "philosophical" reason why this is the case (why Shapley value is solving a different problem than backprop). I don't know exactly what the relevant distinction is.
(Or maybe Shapley value works fine for NN learning; but, I'd be surprised.)
What you call floor for Alpha Go, i.e. the move evaluations, are not even boundaries (in the sense nostalgebraist define it), that would just be the object level (no meta at all) policy.
I think in general the idea of the object level policy with no meta isn't well-defined, if the agent at least does a little meta all the time. In AlphaGo, it works fine to shut off the meta; but you could imagine a system where shutting off the meta would put it in such an abnormal state (like it's on drugs) that the observed behavior wouldn't mean very much in terms of its usual operation. Maybe this is the point you are making about humans not having a good floor/ceiling distinction.
But, I think we can conceive of the "floor" more generally. If the ceiling is the fixed structure, e.g. the update for the weights, the "floor" is the lowest-level content -- e.g. the weights themselves. Whether thinking at some meta-level or not, these weights determine the fast heuristics by which a system reasons.
I still think some of what nostalgebraist said about boundaries seems more like the floor than the ceiling.
The space "between" the floor and the ceiling involves constructed meta levels, which are larger computations (ie not just a single application of a heuristic function), but which are not fixed. This way we can think of the floor/ceiling spectrum as small-to-large: the floor is what happens in a very small amount of time; the ceiling is the whole entire process of the algorithm (learning and interacting with the world); the "interior" is anything in-between.
Of course, this makes it sort of trivial, in that you could apply the concept to anything at all. But the main interesting thing is how an agent's subjective experience seems to interact with floors and ceilings. IE, we can't access floors very well because they happen "too quickly", and besides, they're the thing that we do everything with (it's difficult to imagine what it would mean for a consciousness to have subjective "access to" its neurons/transistors). But we can observe the consequences very immediately, and reflect on that. And the fast operations can be adjusted relatively easy (e.g. updating neural weights). Intermediate-sized computational phenomena can be reasoned about, and accessed interactively, "from the outside" by the rest of the system. But the whole computation can be "reasoned about but not updated" in a sense, and becomes difficult to observe again (not "from the outside" the way smaller sub-computations can be observed).
Sorry for taking so long to respond to this one.
I don't get the last step in your argument:
In contrast, if our learning algorithm is some evolutionary computation algorithm, the models (in the population) in which θ8 happens to be larger are expected to outperform the other models, in iteration 2. Therefore, we should expect iteration 2 to increase the average value of θ8 (over the model population).
Why do those models outperform? I think you must be imagining a different setup, but I'm interpreting your setup as:
In other words, many members of the population can swoop in and reap the benefits caused by high-θ8 members. So high-θ8 carriers do not specifically benefit.
Yeah, I pretty strongly think there's a problem -- not necessarily an insoluble problem, but, one which has not been convincingly solved by any algorithm which I've seen. I think presentations of ML often obscure the problem (because it's not that big a deal in practice -- you can often define good enough episode boundaries or whatnot).
Suppose we have a good reward function (as is typically assumed in deep RL). We can just copy the trick in that setting, right? But the rest of the post makes it sound like you still think there's a problem, in that even with that reward, you don't know how to assign credit to each individual action. This is a problem that evolution also has; evolution seemed to manage it just fine.
(Similarly, even if you think actor-critic methods don't count, surely REINFORCE is one-level learning? It works okay; added bells and whistles like critics are improvements to its sample efficiency.)
No, definitely not, unless I'm missing something big.
From page 329 of this draft of Sutton & Barto:
Note that REINFORCE uses the complete return from time t, which includes all future rewards up until the end of the episode. In this sense REINFORCE is a Monte Carlo algorithm and is well defined only for the episodic case with all updates made in retrospect after the episode is completed (like the Monte Carlo algorithms in Chapter 5). This is shown explicitly in the boxed on the next page.
So, REINFORCE "solves" the assignment of rewards to actions via the blunt device of an episodic assumption; all rewards in an episode are grouped with all actions during that episode. If you expand the episode to infinity (so as to make no assumption about episode boundaries), then you just aren't learning. This means it's not applicable to the case of an intelligence wandering around and interacting dynamically with a world, where there's no particular bound on how the past may relate to present reward.
The "model" is thus extremely simple and hardwired, which makes it seem one-level. But you can't get away with this if you want to interact and learn on-line with a really complex environment.
Also, since the episodic assumption is a form of myopia, REINFORCE is compatible with the conjecture that any gradients we can actually construct are going to incentivize some form of myopia.