DanielFilan

Wiki Contributions

Comments

Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

(see also this shortform, which makes a rudimentary version of the arguments in the first two subsections)

Reward is not the optimization target

Here's my general view on this topic:

  • Agents are reinforced by some reward function.
  • They then get more likely to do stuff that the reward function rewards.
  • This process, iterated a bunch, produces agents that are 'on-distribution optimal'.
  • In particular, in states that are 'easily reached' during training, the agent will do things that approximately maximize reward.
  • Some states aren't 'easily reached', e.g. states where there's a valid bitcoin blockchain of length 20,000,000 (current length as I write is 748,728), or states where you have messed around with your own internals while not intelligent enough to know how they work.
  • Other states are 'easily reached', e.g. states where you intervene on some cause-and-effect relationships in the 'external world' that don't impinge on your general training scheme. For example, if you're being reinforced to be approved of by people, lying to gain approval is easily reached.
  • Agents will probably have to be good at means-ends reasoning to approximately locally maximize a tricky reward function.
  • Agents' goals may not generalize to states that are not easily reached.
  • Agents' motivations likely will generalize to states that are easily reached.
  • Agents' motivations will likely be pretty coherent in states that are easily reached.
  • When I talk about 'the reward function', I mean a mathematical function from (state, action, next state) tuples to reals, that is implemented in a computer.
  • When I talk about 'reward', I mean values of this function, and sometimes by extension tuples that achieve high values of the function.
  • When other people talk about 'reward', I think they sometimes mean "the value contained in the antecedent-computation-reinforcer register" and sometimes mean "the value of the mathematical object called 'the reward function'", and sometimes I can't tell what they mean. This is bad, because in edge cases these have pretty different properties (e.g. they disagree on how 'valuable' it is to permanently set the ACR register to contain MAX_INT).
Reward is not the optimization target

I'm not saying "These statements can make sense", I'm saying they do make sense and are correct under their most plain reading.

Re: a possible goal of animals being to optimize the expected sum of future rewards, in the cited paper "rewards" appears to refer to stuff like eating tasty food or mating, where it's assumed the animal can trade those off against each other consistently:

Decision-making environments are characterized by a few key concepts: a state space..., a set of actions..., and affectively important outcomes (finding cheese, obtaining water, and winning). Actions can move the decision-maker from one state to another (i.e. induce state transitions) and they can produce outcomes. The outcomes are assumed to have numerical (positive or negative) utilities, which can change according to the motivational state of the decision-maker (e.g. food is less valuable to a satiated animal) or direct experimental manipulation (e.g. poisoning)...

In instrumental conditioning, animals learn to choose actions to obtain rewards and avoid punishments, or, more generally to achieve goals. Various goals are possible, such as optimizing the average rate of acquisition of net rewards (i.e. rewards minus punishments), or some proxy for this such as the expected sum of future rewards[.]

It seems totally plausible to me that an animal could be motivated to optimize the expected sum of future rewards in this sense, given that 'reward' is basically defined as "things they value". It seems like the way this would be false would be if animals rewards are super unstable, or the animal doesn't coherently trade off things they value. This could happen, but I don't see why I should see it as overwhelmingly likely.

[EDIT: in other words, the reason the paper conflates 'rewards' with 'optimization target' is that that's how they're defining rewards]

Reward is not the optimization target

I think the quotes cited under "The field of RL thinks reward=optimization target" are all correct. One by one:

The agent's job is to find a policy… that maximizes some long-run measure of reinforcement.

Yes, that is the agent's job in RL, in the sense that if the training algorithm didn't do that we'd get another training algorithm (if we thought it was feasible for another algorithm to maximize reward). Basically, the field of RL uses a separation of concerns, where they design a reward function to incentivize good behaviour, and the agent maximizes that function. I think this is sensible, because it's relatively easier to think "what reward function represents what I want out of this agent" than "how do I achieve this difficult task".

In instrumental conditioning, animals learn to choose actions to obtain rewards and avoid punishments, or, more generally to achieve goals. Various goals are possible, such as optimizing the average rate of acquisition of net rewards (i.e. rewards minus punishments), or some proxy for this such as the expected sum of future rewards.

This describes some possible goals, and I don't see why you think the goals listed are impossible (and don't think they are).

We hypothesise that intelligence, and its associated abilities, can be understood as subserving the maximisation of reward.

This makes sense. RL selects agents that approximately maximize reward. Intelligence uncontroversially helps agents do that. When agents do smart thinking, they probably get reinforced (at least for the right kinds of smart thinking).

Law-Following AI 4: Don't Rely on Vicarious Liability

It looks like this is the 4th post in a sequence - any chance you can link to the earlier posts? (Or perhaps use LW's sequence feature)

Coherence arguments do not entail goal-directed behavior

I have no idea why I responded 'low' to 2. Does anybody think that's reasonable and fits in with what I wrote here, or did I just mean high?

Project Intro: Selection Theorems for Modularity

The method that is normally used for this in the biological literature (including the Kashtan & Alon paper mentioned above), and in papers by e.g. CHAI dealing with identifying modularity in deep modern networks, is taken from graph theory. It involves the measure Q, which is defined as follows:

FWIW I do not use this measure in my papers, but instead use a different graph-theoretic measure. (I also get the sense that Q is more of a network theory thing than a graph theory thing)

rohinmshah's Shortform

I think it's more concerning in cases where you're getting all of your info from goal-oriented behaviour and solving the inverse planning problem

It's also not super clear what you algorithmically do instead - words are kind of vague, and trajectory comparisons depend crucially on getting the right info about the trajectory, which is hard, as per the ELK document.

rohinmshah's Shortform

One objection: an assistive agent doesn’t let you turn it off, how could that be what we want? This just seems totally fine to me — if a toddler in a fit of anger wishes that its parents were dead, I don’t think the maximally-toddler-aligned parents would then commit suicide, that just seems obviously bad for the toddler.

I think this is way more worrying in the case where you're implementing an assistance game solver, where this lack of off-switchability means your margins for safety are much narrower.

Though [the claim that slightly wrong observation model => doom] isn’t totally clear, e.g. is it really that easy to mess up the observation model such that it leads to a reward function that’s fine with murdering humans? It seems like there’s a lot of evidence that humans don’t want to be murdered!

I think it's more concerning in cases where you're getting all of your info from goal-oriented behaviour and solving the inverse planning problem - in those cases, the way you know how 'human preferences' rank future hyperslavery vs wireheaded rat tiling vs humane utopia is by how human actions affect the likelihood of those possible worlds, but that's probably not well-modelled by Boltzmann rationality (e.g. the thing I'm most likely to do today is not to write a short computer program that implements humane utopia), and it seems like your inference is going to be very sensitive to plausible variations in the observation model.

Job Offering: Help Communicate Infrabayesianism

A future episode might include a brief distillation of that episode ;)

Load More