Interested in math, Game Theory, etc.
How do you try to discourage all "deliberate mistakes"?
1. Make something that has a goal. Does AlphaGo make deliberate mistakes at Go? Or does it try to win, and always make the best move* (with possible the limitation that, it might not be as good at playing from positions it wouldn't play itself into)?
*This may be different from 'maximize score, or wins long term'. If you try to avoid teaching your opponent how to play better, while seeking out wins, there can be a 'try to meta game' approach - though this might require games to have the right structure, especially in training to create a tournament, rather than game focus. And I would guess it is game focused, rather than tournament.
Why do you suppose it's not an agent? Isn't that essentially the question of inner optimizers? IE, does it get its own goals? Is it just trying to predict?
A fair point. Dealing with this at the level of 'does it have goals' is a question worth asking. I think that it, like AlphaGo, isn't engaging in particularly deliberate action because I don't think it is existing properly to do that, or learn to do that.
You think of the spelling errors as deception. Another way of characterizing it might be 'trying to speak the lingo'. For example we might think of as an agent, that, if it chatted with you for a while, and you don't use words like 'aint' a lot, might shift to not use words like that around you. (Is an agent that "knows its audience" deceptive? Maybe yes, maybe no.)
You think that there is a correct way to spell words. GPT might be more agnostic. For example, (it's weird to not put this in terms of prediction) if another version of GPT (GPT-Speller) somehow 'ignored context', or 'factored it 'better'', then we might imagine Speller would spell words right with a probability. You and I understand that 'words are spelled (mostly) one way'. But Speller, might come up with words as these probability distributions over strings - spelling things right most of the time (if the dataset has them spelled that way most of the time), but always getting them wrong sometimes because it:
**Maybe some new (or existing) methods might be required to fix this? The issue of 'imperfect feedback' sounds like something that's (probably) been an issue before - and not just in conjunction with the words 'Goodhart'.
I also lean towards 'this thing was created, and given something like a goal, and it's going to keep doing that goal like thing'. If it 'spells things wrong to fit in' that's because it was trained as a predictor, not a writer. If we want something to write, yeah, figuring out how to train that might be hard. If you want something out of GPT that differs from the objective 'predict' then maybe GPT needs to be modified, if prompting it correctly doesn't work. Given the way it 'can respond to prompts' characterizing it as 'deceptive' might make sense under some circumstances*, but if you're going to look at it that way, training something to do 'prediction' (of original text) and then have it 'write' is systematically going to result in 'deception' because it has been trained to be a chameleon. To blend in. To say what whoever wrote the string it is being tested against at the moment. It's abilities are shocking and it's easy to see them in an 'action framework'. However, if it developed a model of the world, and it was possible to factor that out from the goal - then pulling the model out and getting 'the truth' is possible. But the two might not be separable. If trained on say "a flat earther dataset" will it say "the earth is round"? Can it actually achieve insight?
If you want a good writer, train a good writer. I'm guessing garbage in, garbage out, is an AI rule as much as straight up programming.*** If we give something the wrong rewards, the system will be gamed (absent a system (successfully) designed and deployed to not do that).
*i.e., it might have a mind, but it also might not. Rather it might just be that
***More because the AI has to 'figure out' what it is that you want, from scratch.
If GPT, when asked 'is this spelled correctly: [string]' it tells us truthfully, then as deception, that's probably not an issue. As far as deception goes...arguably it's 'deceiving' everyone all the time, that it is a human (assuming most text in it's corpus is written by humans, and most prompts match that), or trying to. If it things it's supposed to play the part of a someone who is bad at spelling, it might be hard to read.
(I haven't heard of it making any new scientific discoveries*. Though if it hasn't read a lot of papers, it could be trained...)
*This would be surprising, and might change the way I look at it - if a predictor can do that, what else can it do, and is the distinction between an agent an a predictor a meaningful one? Maybe not. Though pre-registration might be key here. If most of the time it just produces awful or mediocre papers, then maybe it's just a 'monkey at a typewriter'.
The most useful definition of "mesa-optimizer" doesn't require them to perform explicit search, contrary to the current standard.
And presumably, the extent to which search takes place isn't important, a measure of risk, or optimizing. (In other words, it's not a part of the definition, and it shouldn't be a part of the definition.)
Some of the reasons we expect mesa-search also apply to mesa-control more broadly.
expect mesa-search might be a problem?
Highly knowledge-based strategies, such as calculus, which find solutions "directly" with no iteration -- but which still involve meaningful computation.
This explains 'search might not be the only problem' rather well (even if isn't the only alternative).
Dumb lookup tables.
Hm. Based on earlier:
Mesa-controller refers to any effective strategies, including mesa-searchers but also "dumber" strategies which nonetheless effectively steer toward a misaligned objective. For example, thermostat-like strategies, or strategies which have simply memorized a number of effective interventions.
It sounds like there's also a risk of smart lookup tables. That might not be the right terminology, but 'look up tables which contain really effective things', even if the tables themselves just execute and don't change, seems worth pointing out somehow.
I think mesa-control is thought of as a less concerning problem than mesa-search, primarily because: how would you even get severely misaligned mesa-controllers? For example, why would a neural network memorize highly effective strategies for pursuing an objective which it hasn't been trained on?
The point of inner alignment is to protect against those bad consequences. If mesa-controllers which don't search are truly less concerning, this just means it's an easier case to guard against. That's not an argument against including them in the definition of the inner alignment problem.
A controller, mesa- or otherwise, may be a tool another agent creates or employs to obtain their objectives. (For instance, if someone creates malware that hacks your thermostat to build a bigger botnet (yay Internet of Things!). It might be better to think of the 'intelligence/power/effectiveness of an object for reaching a goal' (even for a rock) to be seen as a function of the system, rather than the parts.)
If you used your chess experience to create a lookup table that could beat me at chess, it's 'intelligence' would be an expression of your int/optimization.
For non-search strategies, it's even more important that the goal actually simplify the problem as opposed to merely reiterate it; so there's even more reason to think that mesa-controllers of this type wouldn't be aligned with the outer goal.
How does a goal simplify a problem?
My model is that GPT-3 almost certainly is "hiding its intelligence" at least in small ways. For example, if its prompt introduces spelling mistakes, GPT-3 will 'intentionally' continue with more spelling mistakes in what it generates.
Yeah, because it's goal is prediction. Within prediction there isn't a right way to write a sentence. It's not a spelling mistake, it's a spelling prediction. (If you want it to not do that, then train it on...predicting the sentence, spelled correctly. Reward correct spelling, with a task of 'seeing through the noise'. You could try going further, and reinforce a particular style, or 'this word is better than that word'.)
Train a model to predict upvotes on Quara, Stackxchange, and similar question-answering websites. This serves as a function recognizing "intelligent and helpful responses".
Uh, that's not what I'd expect it to do. If you're worried about deception now, why don't you think that'd make it worse? (If nothing else, are you trying to create GPT-Flattery?)
If this procedure works exceedingly well, causing GPT to "wake up" and be a human-level conversation partner or greater, we should be very worried indeed. (Since we wouldn't then know the alignment of the resulting system, and could be virtually sure that it was an inner optimizer of significant power.)
It's not an agent. It's a predictor. (It doesn't want to make paperclips.)
I think you're anthropomorphizing it.
If you would be interested in participating conditional on us offering pay or prizes, that's also useful to know.
Do you want this feedback at the same address?
The authors prove that EPIC is a pseudometric, that is, it behaves like a distance function, except that it is possible for EPIC(R1, R2) to be zero even if R1 and R2 are different. This is desirable, since if R1 and R2 differ by a potential shaping function, then their optimal policies are guaranteed to be the same regardless of transition dynamics, and so we should report the “distance” between them to be zero.
If EPIC(R1, R2) is thought of as two functions f(g(R1), g(R2)), where g returns the optimal policy of its input, and f is a distance function for optimal policies, then f(OptimalPolicy1, OptimalPolicy2) is a metric?
One nice thing is that, roughly speaking, rewards are judged to be equivalent if they would generalize to any possible transition function that is consistent with DT. This means that by designing DT appropriately, we can capture how much generalization we want to evaluate.
Can more than one DT be used, so there's more than one measure?
This is a useful knob to have: if we used the maximally large DT, the task would be far too difficult, as it would be expected to generalize far more than even humans can.
There's a maximum?
the exact same answer it would have output without the perturbation.
It always gives the same answer for the last digit?
(The object which is not the object:)
So you just don't do it, even though it feels like a good idea.
More likely people don't do it because they can't, or a similar reason. (The point of saying "My life would be better if I was in charge of the world" is not to serve as a hypothesis, to be falsified.)
Beliefs intervene on action. (Not success, but choice.)
We are biased and corrupted. By taking the outside view on how our own algorithm performs in a given situation, we can adjust accordingly.
The piece seems biased towards the negative.
Calibrate yourself on the flaws of your own algorithm, and repair or minimize them.
Something like 'performance' seems more key than "flaws". Flaws can be improved, but so can working parts.
And the AI knows its own algorithm.
An interesting premise. Arguably, if human brains are NGI, this would be a difference between AGI and NGI, which might require justification.
If I'm about to wipe my boss's computer because I'm so super duper sure that my boss wants me to do it, I can consult OutsideView and realize that I'm usually horribly wrong about what my boss wants in this situation. I don't do it.
The premise of "inadequacy" saturates this post.* At best this post characterizes the idea that "not doing bad things" stems from "recognizing them as bad" - probabilistically, via past experience policy wise (phrased in language suggestive of priors), etc. This sweeps the problem under the rug in favor of "experience" and 'recognizing similar situations'. 
In particular, calibrated deference would avoid the problem of fully updated deference.
"Irreversibility" seems relevant to making sure mistakes can be fixed, as does 'experience' in less high stake situations. Returning to the beginning of the post:
You run a country.
Hopefully you are "qualified"/experienced/etc. This is a high stakes situation.**
 OutsideView seems like it should be a (function of a) summary of the past, rather than a recursive call.
While reading this post...
*In contrast to the usual calls for 'maximizing' "expected value". While this point has been argued before, it seems to reflect an idea about how the world works (like a prior, or something learned).
**Ignoring the question of "what does it mean to run a country if you don't set all the rules", because that seems unrelated to this essay.
What term do people use for the definition of alignment in which A is trying to achieve H's goals
Sounds like it should be called goal alignment, whatever it's name happens to be.
The thing about Montezuma's revenge and similar hard exploration tasks is that there's only one trajectory you need to learn; and if you forget any part of it you fail drastically; I would by default expect this to be better than adversarial dynamics / populations at ensuring that the agent doesn't forget things.
But is it easier to remember things if there's more than one way to do them?
Bumping into the human makes them disappear, reducing the agent's control over what the future looks like. This is penalized.
Decreases or increases?
AUPstarting state fails here,
but AUPstepwise does not.
1. Is "Model-free AUP" the same as "AUP stepwise"?
2. Why does "Model-free AUP" wait for the pallet to reach the human before moving, while the "Vanilla" agent does not?
There is one weird thing that's been pointed out, where stepwise inaction while driving a car leads to not-crashing being penalized at each time step. I think this is because you need to use an appropriate inaction rollout policy, not because stepwise itself is wrong. ↩︎
That might lead to interesting behavior in a game of chicken.
One interpretation is that AUP is approximately preserving access to states.
I wonder how this interacts with environments where access to states is always closing off. (StarCraft, Go, Chess, etc. - though it's harder to think of how state/agent are 'contained' in these games.)
To be frank, this is crazy. I'm not aware of any existing theory explaining these results, which is why I proved a bajillion theorems last summer to start to get a formal understanding (some of which became the results on instrumental convergence and power-seeking).
Is the code for the SafeLife PPO-AUP stuff you did on github?
CCC says (for non-evil goals) "if the optimal policy is catastrophic, then it's because of power-seeking". So its contrapositive is indeed as stated.
That makes sense. One of the things I like about this approach is that it isn't immediately clear what else could be a problem, and that might just be implementation details or parameters: corrigibility from limited power only works if we make sure that power is low enough we can turn it off, if the agent will acquire power if that's the only way to achieve its goal rather than stopping at/before some limit then it might still acquire power and be catastrophic*, etc.
*Unless power seeking behavior is the cause of catastrophe, rather than having power.
Sorry for the ambiguity.
It wasn't ambiguous, I meant to gesture at stuff like 'astronomical waste' (and waste on smaller scales) - areas where we do want resources to be used. This was addressed at the end of your post already,:
So we can hope to build a non-catastrophic AUP agent and get useful work out of it. We just can’t directly ask it to solve all of our problems: it doesn’t make much sense to speak of a “low-impact singleton”.
-but I wanted to highlight the area where we might want powerful aligned agents, rather than AUP agents that don't seek power.
What do you mean by "AUP map"? The AU landscape?
That is what I meant originally, though upon reflection a small distinction could be made:
Territory: AU landscape*
Map: AUP map (an AUP agent's model of the landscape)
*Whether or not this is thought of as 'Territory' or a 'map', conceptually AUP agents will navigate (and/or create) a map of the AU landscape. (If AU landscape is a map, then AUP agents may navigate a map of a map. There also might be better ways this distinction could be made, like AU landscape is a style/type of map, just like there are maps of elevation and topology.)
The idea is it only penalizes expected power gain.
Gurkenglas previously commented that they didn't think that AUP solved 'agents learns how to convince people/agents to do things'. While it's not immediately clear how an agent could happen to find out how to convince humans of anything (the super-intelligent persuader), if an agent obtained that power, it continuing to operate could constitute a risk. (Though further up this comment I brought up the possibility that "power seeking behavior is the cause of catastrophe, rather than having power." This doesn't seem likely in its entirety, but seems possible in part - that is, powerful and power seeking might not be as dangerous as powerful and power seeking.)