Interested in math, Game Theory, etc.


Sorted by New


Looking for adversarial collaborators to test our Debate protocol

If you would be interested in participating conditional on us offering pay or prizes, that's also useful to know.

Do you want this feedback at the same address?

[AN #106]: Evaluating generalization ability of learned reward models
The authors prove that EPIC is a pseudometric, that is, it behaves like a distance function, except that it is possible for EPIC(R1, R2) to be zero even if R1 and R2 are different. This is desirable, since if R1 and R2 differ by a potential shaping function, then their optimal policies are guaranteed to be the same regardless of transition dynamics, and so we should report the “distance” between them to be zero.

If EPIC(R1, R2) is thought of as two functions f(g(R1), g(R2)), where g returns the optimal policy of its input, and f is a distance function for optimal policies, then f(OptimalPolicy1, OptimalPolicy2) is a metric?

One nice thing is that, roughly speaking, rewards are judged to be equivalent if they would generalize to any possible transition function that is consistent with DT. This means that by designing DT appropriately, we can capture how much generalization we want to evaluate.

Can more than one DT be used, so there's more than one measure?

This is a useful knob to have: if we used the maximally large DT, the task would be far too difficult, as it would be expected to generalize far more than even humans can.

There's a maximum?

The ground of optimization
the exact same answer it would have output without the perturbation.

It always gives the same answer for the last digit?

Corrigibility as outside view

(The object which is not the object:)

So you just don't do it, even though it feels like a good idea.

More likely people don't do it because they can't, or a similar reason. (The point of saying "My life would be better if I was in charge of the world" is not to serve as a hypothesis, to be falsified.)

(The object:)

Beliefs intervene on action. (Not success, but choice.)

We are biased and corrupted. By taking the outside view on how our own algorithm performs in a given situation, we can adjust accordingly.

The piece seems biased towards the negative.

Calibrate yourself on the flaws of your own algorithm, and repair or minimize them.

Something like 'performance' seems more key than "flaws". Flaws can be improved, but so can working parts.

And the AI knows its own algorithm.

An interesting premise. Arguably, if human brains are NGI, this would be a difference between AGI and NGI, which might require justification.

If I'm about to wipe my boss's computer because I'm so super duper sure that my boss wants me to do it, I can consult OutsideView
and realize that I'm usually horribly wrong about what my boss wants in this situation. I don't do it.

The premise of "inadequacy" saturates this post.* At best this post characterizes the idea that "not doing bad things" stems from "recognizing them as bad" - probabilistically, via past experience policy wise (phrased in language suggestive of priors), etc. This sweeps the problem under the rug in favor of "experience" and 'recognizing similar situations'. [1]

In particular, calibrated deference would avoid the problem of fully updated deference.

"Irreversibility" seems relevant to making sure mistakes can be fixed, as does 'experience' in less high stake situations. Returning to the beginning of the post:

You run a country.

Hopefully you are "qualified"/experienced/etc. This is a high stakes situation.**

[1] OutsideView seems like it should be a (function of a) summary of the past, rather than a recursive call.

While reading this post...

  • From an LW standpoint I wished it had more clarity.
  • From an AF (Alignment Forum) view I appreciated it's direction. (It seems like it might be pointed somewhere important.)

*In contrast to the usual calls for 'maximizing' "expected value". While this point has been argued before, it seems to reflect an idea about how the world works (like a prior, or something learned).

**Ignoring the question of "what does it mean to run a country if you don't set all the rules", because that seems unrelated to this essay.

What is the alternative to intent alignment called?
What term do people use for the definition of alignment in which A is trying to achieve H's goals

Sounds like it should be called goal alignment, whatever it's name happens to be.

[AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement
The thing about Montezuma's revenge and similar hard exploration tasks is that there's only one trajectory you need to learn; and if you forget any part of it you fail drastically; I would by default expect this to be better than adversarial dynamics / populations at ensuring that the agent doesn't forget things.

But is it easier to remember things if there's more than one way to do them?

Attainable Utility Preservation: Empirical Results
Bumping into the human makes them disappear, reducing the agent's control over what the future looks like. This is penalized.

Decreases or increases?

AUPstarting state fails here,
but AUPstepwise does not.


1. Is "Model-free AUP" the same as "AUP stepwise"?

2. Why does "Model-free AUP" wait for the pallet to reach the human before moving, while the "Vanilla" agent does not?

There is one weird thing that's been pointed out, where stepwise inaction while driving a car leads to not-crashing being penalized at each time step. I think this is because you need to use an appropriate inaction rollout policy, not because stepwise itself is wrong. ↩︎

That might lead to interesting behavior in a game of chicken.

One interpretation is that AUP is approximately preserving access to states.

I wonder how this interacts with environments where access to states is always closing off. (StarCraft, Go, Chess, etc. - though it's harder to think of how state/agent are 'contained' in these games.)

To be frank, this is crazy. I'm not aware of any existing theory explaining these results, which is why I proved a bajillion theorems last summer to start to get a formal understanding (some of which became the results on instrumental convergence and power-seeking).

Is the code for the SafeLife PPO-AUP stuff you did on github?

Attainable Utility Preservation: Concepts
CCC says (for non-evil goals) "if the optimal policy is catastrophic, then it's because of power-seeking". So its contrapositive is indeed as stated.

That makes sense. One of the things I like about this approach is that it isn't immediately clear what else could be a problem, and that might just be implementation details or parameters: corrigibility from limited power only works if we make sure that power is low enough we can turn it off, if the agent will acquire power if that's the only way to achieve its goal rather than stopping at/before some limit then it might still acquire power and be catastrophic*, etc.

*Unless power seeking behavior is the cause of catastrophe, rather than having power.

Sorry for the ambiguity.

It wasn't ambiguous, I meant to gesture at stuff like 'astronomical waste' (and waste on smaller scales) - areas where we do want resources to be used. This was addressed at the end of your post already,:

So we can hope to build a non-catastrophic AUP agent and get useful work out of it. We just can’t directly ask it to solve all of our problems: it doesn’t make much sense to speak of a “low-impact singleton”.

-but I wanted to highlight the area where we might want powerful aligned agents, rather than AUP agents that don't seek power.

What do you mean by "AUP map"? The AU landscape?

That is what I meant originally, though upon reflection a small distinction could be made:

Territory: AU landscape*

Map: AUP map (an AUP agent's model of the landscape)

*Whether or not this is thought of as 'Territory' or a 'map', conceptually AUP agents will navigate (and/or create) a map of the AU landscape. (If AU landscape is a map, then AUP agents may navigate a map of a map. There also might be better ways this distinction could be made, like AU landscape is a style/type of map, just like there are maps of elevation and topology.)

The idea is it only penalizes expected power gain.

Gurkenglas previously commented that they didn't think that AUP solved 'agents learns how to convince people/agents to do things'. While it's not immediately clear how an agent could happen to find out how to convince humans of anything (the super-intelligent persuader), if an agent obtained that power, it continuing to operate could constitute a risk. (Though further up this comment I brought up the possibility that "power seeking behavior is the cause of catastrophe, rather than having power." This doesn't seem likely in its entirety, but seems possible in part - that is, powerful and power seeking might not be as dangerous as powerful and power seeking.)

Attainable Utility Preservation: Concepts

I liked this post, and look forward to the next one.

More specific, and critical commentary (It seems it is easier to notice surprise than agreement):

(With embedded footnotes)


If the CCC is right, then if power gain is disincentivised, the agent isn't incentivised to overfit and disrupt our AU landscape.

(The CCC didn't make reference to overfitting.)


If A is true then B will be true.


If A is false B will be false.

The conclusion doesn't follow from the premise.


Without even knowing who we are or what we want, the agent's actions preserve our attainable utilities.

Note that preserving our attainable utilities isn't a good thing, it's just not a bad thing.

Issues: Attainable utilities indefinitely 'preserved' are wasted.

Possible issues: If an AI just happened to discovered a cure for cancer, we'd probably want to know the cure. But if an AI didn't know what we wanted, and just focused on preserving utility*, then (perhaps as a side effect of considering both that we might want to know the cure, and might not want to know the cure) it might not tell us because that preserves utility. (The AI might operate on a framework that distinguishes between action and inaction, in a way that means it doesn't do thing that might be bad, at the cost of not doing things that might be good.)

*If we are going to calculate something and a reliable source (which has already done the calculation) tells us the result, we can save on energy (and preserve resources that can be converted into utility) by not doing the calculation. In theory this could include not only arithmetic, but simulations of different drugs or cancer treatments to come up with better options.


We can tell it:

Is this a metaphor for making an 'agent' with that goal, or actually creating an agent that we can give different commands to and switch out/modify/add to its goals? (Why ask it to 'make paperclips' if that's dangerous, when we can ask it to 'make 100 paperclips'?)


Narrowly improve paperclip production efficiency <- This is the kind of policy AUP_conceptual is designed to encourage and allow. We don't know if this is the optimal policy, but by CCC, the optimal policy won't be catastrophic.

Addressed in 1.


Imagine I take over a bunch of forever inaccessible stars and jumble them up. This is a huge change in state, but it doesn't matter to us.

It does a little bit.

It means we can't observe them for astronomical purposes. But this isn't the same as losing a telescope looking at them - it's (probably) permanent, and maybe we learn something different from it. We learn that stars can be jumbled up. This may have physics/stellar engineering consequences, etc.


AUP_conceptual solves this "locality" problem by regularizing the agent's impact on the nearby AU landscape.

Nearby from its perspective? (From a practical standpoint, if you're close to an airport you're close to a lot of places on earth, that you aren't from a 'space' perspective.)


For past-impact measures, it's not clear that their conceptual thrusts are well-aimed, even if we could formalize everything correctly. Past approaches focus either on minimizing physical change to some aspect of the world or on maintaining ability to reach many world states.

If there's a limited amount of energy, then using energy limits ability to reach many world states - perhaps in a different sense than above. If there's a machine that can turn all pebbles into something else (obsidian, precious stones, etc.) but it takes a lot of energy, then using up energy limits the number of times it can be used. (This might seem quantifiable, moving the world* from containing 101 units of energy -> 99 units an effect on how many times the machine can be used if it requires 100, or 10 units to use. But this isn't robust against random factors decreasing energy (or decreasing it), or future improvements in energy efficiency of the machine - if the cost is brought down to 1 unit of energy, then using up 2 units prevents it from being used twice.

*Properly formalizing this should take a lot of other things into account, like 'distant' and notions of inaccessible regions of space, etc.

Also the agent might be concerned with flows rather than actions.* We have an intuitive notion that 'building factories increases power', but what about redirecting a river/stream/etc. with dams or digging new paths for water to flow? What does the agent do if it unexpectedly gains power by some means, or realizes its paperclip machines can be used to move strawberries/make a copy itself which is weaker but less constrained? Can the agent make a machine that makes paperclips/make making paperclips easier?

*As a consequence of this being a more effective approach - it makes certain improvements obvious. If you have a really long commute to work, you might wish you lived closer to your work. (You might also be aware that houses closer to your work are more expensive, but humans are good at picking up on this kind of low hanging fruit. A capable agent that thinks about process seeing 'opportunities to gain power' is of some general concern. In this case because an agent that tries to minimize reducing/affecting** other agents attainable utility, without knowing/needing to know about other agents is somewhat counterintuitive.

**It's not clear if increasing shows up on the AUP map, or how that's handled.


Therefore, I consider AUP to conceptually be a solution to impact measurement.
Wait! Let's not get ahead of ourselves! I don't think we've fully bridged the concept/execution gap.
However for AUP, it seems possible - more on that later.

I appreciate this distinction being made. A post that explains the intuitions behind an approach is very useful, and my questions about the approach may largely relate to implementation details.


AUP aims to prevent catastrophes by stopping bad agents from gaining power to do bad things, but it symmetrically impedes otherwise-good agents.

A number of my comments above were anticipated then.

Bayesian Evolving-to-Extinction
we can think of Bayes' Law as myopically optimizing per-hypothesis, uncaring of overall harm to predictive accuracy.

Or just bad implementations do this - predict-o-matic as described sounds like a bad idea, and like it doesn't contain hypotheses, so much as "players"*. (And the reason there'd be a "side channel" is to understand theories - the point of which is transparency, which, if accomplished, would likely prevent manipulation.)

We can imagine different parts of the network fighting for control, much like the Bayesian hypotheses.

This seems a strange thing to imagine - how can fighting occur, especially on a training set? (I can almost imagine neurons passing on bad input, but a) it seems like gradient descent would get rid of that, and b) it's not clear where the "tickets" are.)

*I don't have a link to the claim, but it's been said before that 'the math behind Bayes' theorem requires each hypothesis to talk about all of the universe, as opposed to human models that can be domain limited.'

Load More