Wiki Contributions


One reason that I doubt this story is that "try new things in case they're good" is itself the sort of thing that should be reinforced during training on a complicated environment, and would push towards some sort of obfuscated manipulation of humans (similar to how if you read about enough social hacks you'll probably be a bit scammy even tho you like people and don't want to scam them). In general, this motivation will push RL agents towards reward-optimal behaviour on the distribution of states they know how to reach and handle.

Actually I'm being silly, you don't need ring signatures, just signatures that are associated with identities and also used for financial transfers.

Note that for this to work you need a strong disincentive against people sharing their private keys. One way to do this would be if the keys were also used for the purpose of holding cryptocurrency.

Here's one way you can do it: Suppose we're doing public key cryptography, and every person is associated with one public key. Then when you write things online you could use a linkable ring signature. That means that you prove that you're using a private key that corresponds to one of the known public keys, and you also produce a hash of your keypair, such that (a) the world can tell you're one of the known public keys but not which public key you are, and (b) the world can tell that the key hash you used corresponds to the public key you 'committed' to when writing the proof.

Relevant quote I just found in the paper "Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents":

The primary measure of an agent’s performance is the score achieved during an episode, namely the undiscounted sum of rewards for that episode. While this performance measure is quite natural, it is important to realize that score, in and of itself, is not necessarily an indicator of AI progress. In some games, agents can maximize their score by “getting stuck” in a loop of “small” rewards, ignoring what human players would consider to be the game’s main goal. Nevertheless, score is currently the most common measure of agent performance so we focus on it here.

Here's a project idea that I wish someone would pick up (written as a shortform rather than as a post because that's much easier for me):

  • It would be nice to study competent misgeneralization empirically, to give examples and maybe help us develop theory around it.
  • Problem: how do you measure 'competence' without reference to a goal??
  • Prior work has used the 'agents vs devices' framework, where you have a distribution over all reward functions, some likelihood distribution over what 'real agents' would do given a certain reward function, and do Bayesian inference on that vs choosing actions randomly. If conditioned on your behaviour you're probably an agent rather than a random actor, then you're competent.
  • I don't like this:
    • Crucially relies on knowing the space of reward functions that the learner in question might have.
    • Crucially relies on knowing how agents act given certain motivations.
    • A priori it's not so obvious why we care about this metric.
  • Here's another option: throw out 'competence' and talk about 'consequential'.
    • This has a name collision with 'consequentialist' that you'll probably have to fix but whatever.
  • The setup: you have your learner do stuff in a multi-agent environment. You use the AUP metric on every agent other than your learner. You say that your learner is 'consequential' if it strongly affects the attainable utility of other agents.
  • How good is this?
    • It still relies on having a space of reward functions, but there's some more wiggle-room: you probably don't need to get the space exactly right, just to have goals that are similar to yours.
      • Note that this would no longer be true if this were a metric you were optimizing over.
    • You still need to have some idea about how agents will act realistically, because if you only look at the utility attainable by optimal policies, that might elide the fact that it's suddenly gotten much computationally harder to achieve that utility.
      • That said, I still feel like this is going to degrade more gracefully, as long as you include models that are roughly right. I guess this is because this model is no longer a likelihood ratio where misspecification can just rule out the right answer.
    • It's more obvious why we care about this metric.
  • Bonus round: you can probably do some thinking about why various setups would tend to reduce other agents' attainable utility, prove some little theorems, etc., in the style of the power-seeking paper.
    • Ideally you could even show a relation between this and the agents vs devices framing.
  • I think this is the sort of project a first-year PhD student could fruitfully make progress on.

Here is an example story I wrote (that has been minorly edited by TurnTrout) about how an agent trained by RL could plausibly not optimize reward, forsaking actions that it knew during training would get it high reward. I found it useful as a way to understand his views, and he has signed off on it. Just to be clear, this is not his proposal for why everything is fine, nor is it necessarily an accurate representation of my views, just a plausible-to-TurnTrout story for how agents won't end up wanting to game human approval:

  • Agent gets trained on a reward function that's 1 if it gets human approval, 0 otherwise (or something).
  • During an intermediate amount of training, the agent's honest and nice computations get reinforced by reward events.
  • That means it develops a motivation to act honestly and behave nicely etc., and no similarly strong motivation to gain human approval at all costs.
  • The agent then gets able to tell that it if it tricked the human, that would be reinforced.
  • It then decides to not get close in action-space to tricking the human, so that it doesn't get reinforced into wanting to gain human approval by tricking the human.
  • This works because:
    • it's enough action hops away and/or a small enough part of the space that epsilon-greedy strategies would be very unlikely to push it into the deception mode.
    • smarter exploration strategies will depend on the agent's value function to know which states are more or less promising to explore (e.g. something like thompson sampling), and the agent really disvalues deceiving the human, so that doesn't get reinforced.

dopamine or RPE or that-which-gets-discounted-and-summed-to-produce-the-return

Those are three pretty different things - the first is a chemical, the second I guess stands for 'reward prediction error', and the third is a mathematical quantity! Like, you also can't talk about the expected sum of dopamine, because dopamine is a chemical, not a number!

Here's how I interpret the paper: stuff in the world is associated with 'rewards', which are real numbers that represent how good the stuff is. Then the 'return' of some period of time is the discounted sum of rewards. Rewards represent 'utilities' of individual bits of time, but the return function is the actual utility function over trajectories. 'Predictions of reward' means predictions of stuff like bits of cheese that is associated with reward. I do think the authors do a bit of equivocation between the numbers and the things that the numbers represent (which IMO is typical for non-mathematicians, see also how physicists constantly conflate quantities like velocity with the functions that take other physical quantities and return the velocity of something), but given that AFAICT my interpretation accounts for the uses of 'reward' in that paper (and in the intro). That said, there are a bunch of them, and as a fallible human I'm probably not good at finding the uses that undermine my theory, so if you have a quote or two in mind that makes more sense under the interpretation that 'reward' refers to some function of a brain state rather than some function of cheese consumption or whatever, I'd appreciate you pointing them out to me.

(see also this shortform, which makes a rudimentary version of the arguments in the first two subsections)

Here's my general view on this topic:

  • Agents are reinforced by some reward function.
  • They then get more likely to do stuff that the reward function rewards.
  • This process, iterated a bunch, produces agents that are 'on-distribution optimal'.
  • In particular, in states that are 'easily reached' during training, the agent will do things that approximately maximize reward.
  • Some states aren't 'easily reached', e.g. states where there's a valid bitcoin blockchain of length 20,000,000 (current length as I write is 748,728), or states where you have messed around with your own internals while not intelligent enough to know how they work.
  • Other states are 'easily reached', e.g. states where you intervene on some cause-and-effect relationships in the 'external world' that don't impinge on your general training scheme. For example, if you're being reinforced to be approved of by people, lying to gain approval is easily reached.
  • Agents will probably have to be good at means-ends reasoning to approximately locally maximize a tricky reward function.
  • Agents' goals may not generalize to states that are not easily reached.
  • Agents' motivations likely will generalize to states that are easily reached.
  • Agents' motivations will likely be pretty coherent in states that are easily reached.
  • When I talk about 'the reward function', I mean a mathematical function from (state, action, next state) tuples to reals, that is implemented in a computer.
  • When I talk about 'reward', I mean values of this function, and sometimes by extension tuples that achieve high values of the function.
  • When other people talk about 'reward', I think they sometimes mean "the value contained in the antecedent-computation-reinforcer register" and sometimes mean "the value of the mathematical object called 'the reward function'", and sometimes I can't tell what they mean. This is bad, because in edge cases these have pretty different properties (e.g. they disagree on how 'valuable' it is to permanently set the ACR register to contain MAX_INT).
Load More