Epistemic status: Pretty sure about core, not about edges
A while ago, I noticed a possible bias in how I evaluated reinforcement learning agents. It tended to cause me to revise my estimation of their intelligence downwards, after I viewed a video of them in action.
I've seen other people fall into what I believe to be the same error. So I'm writing this to correct myself if I am wrong and to alert others if I am right.
Many reinforcement learning agents have "jitters." They alternate actions quickly, looking nearly palsied, apparently nullifying the effects of earlier actions with later ones. This is true across a wide variety of reinforcement learning agents.
Many people see these jitters as evidence of the relatively primitive nature of these agents. These actions look clearly stupid and sub-optimal.
For instance consider the original Deep Q Network paper. Even after training for some time on Breakout, it still erratically moves the paddle back and forth when the ball is not near it. One person mentions that it makes "erratic jerky movements that obviously could not in principle be optimal," which was once my impression as well.
Similarly, much more recently, consider DeepMind's recent work on generally capable agents. In the show reel the movement of the agents often looks erratic. Conversation around LessWrong sometimes alluded to these erratic movements as evidence against the intelligence of the agents.
Evolved intelligence on earth has energy conservation as a fundamental part of its optimization function.
Unnecessary movements spend energy. Spent energy must be recovered, at the cost of reproductive fitness. So generally only sick animals, insane animals, and so on, have the shakes or tremble continuously. Energy conservation applies to every animal on earth, which is why we probably feel intuitively confident applying this rule across the broad variety of animals.
Additionally, extremely erratic movements can result in injury to the animal which is making them. So this is another reason why, for creatures that are a result of evolution, erratic movements are a sign of insanity or injury.
Reinforcement learning agents are not energy-constrained. They do not draw on a finite store of glucose when acting. Nor do they have any possibility of injuring themselves. As a result, the policies resulting from reinforcement learning algorithms will not be strongly constrained to limit jitters in the way that policies resulting from evolution will be constrained.
You can go further than this. Given the way that most reinforcement learning agents are set up, they have no way to even distinguish any difference between action and non-action, and thus between non-rest and rest.
That is, consider a reinforcement learning agent which makes one of fifteen different categorical actions in each time-step, like those in OpenAI's ProcGen. For an agent controlling a side-scrolling avatar, for instance, one action would be moving right; another action would be jumping; another action would be doing nothing; etc. Each of these is only distinguished from the others as different indices on one hot-action encodings -- i.e., moving right could be [1,0,0,0...], jumping could be [0,1,0,0...], doing nothing could be [0,0,1,0...], and so on.
For a human controlling such a side-scrolling avatar, "doing nothing" stands out from all the other actions. If you put yourself in a situation where you are allowed to do nothing, you can rest your hands by not pressing any buttons. You can consider a more global strategy, and focus on the kind of strategy you will use when you resume acting. It also allows you to rest your mind, because humans can think harder or less hard. Doing nothing gives you an opportunity for reflection and meta-optimization in a way that no other alternative does.
None of this applies to a reinforcement learning agent. "Doing nothing" is one one-hot encoding just like all the other encodings. It cannot rest itself by doing nothing. It cannot focus on preparing for things further away in time; the vast majority of reinforcement learning agents must do a constant amount of thought in each time-step, about precisely the same things. So rest is not a preferred location in action-space that allows meta-optimization for these agents, as it is for evolved agents. They have no way to distinguish rest from non-rest, and thus no reason to pursue rest.
The above should also apply, mutatis mutandis, to reinforcement learning agents acting in a continuous rather than a discrete space.
This is a more speculative point.
When I act, I often trade between low-probability-of-success action, with little thought put into it, and high-probability-of-success action, with a lot of thought put into it. Put more simply, where attempted action is very cheap, I am willing to try a lot of times.
Battery doesn't fit? I'll wiggle it around. Command in the terminal doesn't work? I'll try changing a parameter. Pill bottle not opening? I'll cycle through different axes of twist and pressure. Generally, I'll start to apply thought more determinedly where there are no low-cost actions available with any reasonable probability of success.
Again, this makes sense from an evolutionary standpoint. Trying things takes energy. Thinking about things also takes energy. Along the boundary where each alternative has equal potential reward and equal probability of success, we would expect ourselves to be indifferent to trying things out versus thinking about things. Only where trying becomes more expensive than thinking about things would we expect that we would feel inclined to think about things rather than try things.
But again, this is not a trade off that reinforcement learning agents are able to make. They must always think about things to precisely the same amount. Which means that exposing yourself to a greater surface area of possible reward, in areas of phase-space where actions are not overdetermined, might generally be the ideal action. Jittering around could be the optimal solution.
Again, I'm less sure about this section.
When I see a reinforcement learning agent acting in a video, acting erratically, some part of me still says that it looks kind of stupid because of this. But I currently believe, for reasons given above, that it's best not to listen to this part of myself