All of Pattern's Comments + Replies

Adele Lopez's Shortform

I think we get enough things referencing quantum mechanics that we should probably explain why that doesn't work (if I it doesn't) rather than just downvoting and moving on.

4gwern8mo
It probably does work with a Sufficiently Powerful™ quantum computer, if you could write down a meaningful predicate which can be computed: https://en.wikipedia.org/wiki/Counterfactual_quantum_computation [https://en.wikipedia.org/wiki/Counterfactual_quantum_computation]
1Adele Lopez8mo
Haha yeah, I'm not surprised if this ends up not working, but I'd appreciate hearing why.
Optimization Concepts in the Game of Life
We realized that if we consider an empty board an optimizing system then any finite pattern is an optimizing system (because it's similarly robust to adding non-viable collections of live cells)

Ah. I interpreted the statement about the empty board as being one of:

A small random perturbation, will probably be non-viable/collapse back to the empty board. (Whereas patterns that are viable don't (necessarily) have this property.)

I then, asked about whether the bottle cap example, had the same robustness.

2Vika8mo
Ah I see, thanks for the clarification! The 'bottle cap' (block) example is robust to removing any one cell but not robust to adding cells next to it (as mentioned in Oscar's comment [https://www.lesswrong.com/posts/mL8KdftNGBScmBcBg/optimization-concepts-in-the-game-of-life?commentId=e3JDshRfXucv22ovJ] ). So most random perturbations that overlap with the block will probably destroy it.
Optimization Concepts in the Game of Life
An empty board is also an example of an optimizing system that is robust to adding non-viable collections of live cells (e.g., fewer than 3 live cells next to each other). 

And the 'bottle cap' example is not (robust to adding cells, or cells colliding* with it)? But if it was, then it would be an 'optimizing system'?

*spreading out, and interacting with it

2Vika8mo
Thanks for pointing this out! We realized that if we consider an empty board an optimizing system then any finite pattern is an optimizing system (because it's similarly robust to adding non-viable collections of live cells), which is not very interesting. We have updated the post to reflect this. The 'bottle cap' example would be an optimizing system if it was robust to cells colliding / interacting with it, e.g. being hit by a glider (similarly to the eater).
NLP Position Paper: When Combatting Hype, Proceed with Caution
(Weird meta-note: Are you aware of something unusual about how this comment is posted? I saw a notification for it, but I didn't see it in the comments section for the post itself until initially submitting this reply. I'm newish to posting on Lightcone forums...)

Ah. When you say lightcone forums, what site are you on? What does the URL look like?


For this point, I'm not sure how it fits into the argument. Could you say more?

It's probably a tangent. The idea was:

1) Criticism is great.

2) Explaining how that could be improved is marginally better. (I then exp... (read more)

1Sam Bowman8mo
Forum I can see the comment at the comment-specific AF permalink here: https://www.alignmentforum.org/posts/RLHkSBQ7zmTzAjsio/nlp-position-paper-when-combatting-hype-proceed-with-caution?commentId=pSkdAanZQwyT4Xyit#pSkdAanZQwyT4Xyit [https://www.alignmentforum.org/posts/RLHkSBQ7zmTzAjsio/nlp-position-paper-when-combatting-hype-proceed-with-caution?commentId=pSkdAanZQwyT4Xyit#pSkdAanZQwyT4Xyit] ...but I can't see it among the comments at the base post URL here. https://www.alignmentforum.org/posts/RLHkSBQ7zmTzAjsio/nlp-position-paper-when-combatting-hype-proceed-with-caution [https://www.alignmentforum.org/posts/RLHkSBQ7zmTzAjsio/nlp-position-paper-when-combatting-hype-proceed-with-caution] From my experience with the previous comment, I expect it'll appear at the latter URL once I reply? [Old technique] had [problem]... Ah, got it. That makes sense! I'll plan to say a bit more about when/how it makes sense to cite older evidence in cases like this.
NLP Position Paper: When Combatting Hype, Proceed with Caution

The paper makes a slightly odd multi-step argument to try to connect to active debates in the field:

This comment is some quick feedback on those:

Weirdly, this even happens in papers that themselves to show positive results involving NNs.

 

citations to failures in old systems that we've since improved upon significantly.

Might not be a main point, but this could be padded out with an explanation of how something like that could be marginally better. Like adding:

"As opposed to explaining how that is relevant today, like:

[Old technique] had [problem]. As [... (read more)

1Sam Bowman9mo
Thanks! (Typo fixed.) For this point, I'm not sure how it fits into the argument. Could you say more? Yeah, this is a missed opportunity that I haven't had the time/expertise to take on. There probably are comparable situations in the histories of other applied research fields, but I'm not aware of any good analogies. I suspect that a deep dive into some history-and-sociology-of-science literature would be valuable here. I think this kind of discussion is already well underway within NLP and adjacent subfields like FaCCT. I don't have as much to add there. (Weird meta-note: Are you aware of something unusual about how this comment is posted? I saw a notification for it, but I didn't see it in the comments section for the post itself until initially submitting this reply. I'm newish to posting on Lightcone forums...)
Agency in Conway’s Game of Life
It's remarkable that googling "thermodynamics of the game of life" turns up zero results. 

It's not obvious that thermodynamics generalizes to the game of life, or what the equivalents of energy or order would be: at first glance it has perpetual motion machines ("gliders").

1Alex Flint1y
Yup, Life does not have time-reversibility, so it does not preserve the phase space volume under time evolution, so it does not obey the laws of thermodynamics that exist under our physics. But one could still investigate whether there is some analog of thermodynamics in Life. There also is a cellular automata called Critters that does have time reversibility.
rohinmshah's Shortform

This was a good post. I'd bookmark it, but unfortunately that functionality doesn't exist yet.* (Though if you have any open source bookmark plugins to recommend, that'd be helpful.) I'm mostly responding to say this though:

Designing Recommender Systems to Depolarize

While it wasn't otherwise mentioned in the abstract of the paper (above), this was stated once:

This paper examines algorithmic depolarization interventions with the goal of conflict transformation: not suppressing or eliminating conflict but moving towards more constructive conflict.

I though th... (read more)

2Rohin Shah1y
Possibly, but if so, I haven't seen them. My current belief is "who knows if there's a major problem with recommender systems or not". I'm not willing to defer to them, i.e. say "there probably is a problem based on the fact that the people who've studied them think there's a problem", because as far as I can tell all of those people got interested in recommender systems because of the bad arguments and so it feels a bit suspicious / selection-effect-y that they still think there are problems. I would engage with arguments they provide and come to my own conclusions (whereas I probably would not engage with arguments from other sources). No. I just have anecdotal experience + armchair speculation, which I don't expect to be much better at uncovering the truth than the arguments I'm critiquing.
Gradations of Inner Alignment Obstacles
How do you try to discourage all "deliberate mistakes"? 

1. Make something that has a goal. Does AlphaGo make deliberate mistakes at Go? Or does it try to win, and always make the best move* (with possible the limitation that, it might not be as good at playing from positions it wouldn't play itself into)?

*This may be different from 'maximize score, or wins long term'. If you try to avoid teaching your opponent how to play better, while seeking out wins, there can be a 'try to meta game' approach - though this might r... (read more)

2Abram Demski1y
I'm a bit confused about part of what we're disagreeing on, so, context trace: I originally said: Then you said: Then I said: Then you said: 1. It seems like the discussion was originally about hidden information, not deliberate mistakes -- deliberate mistakes were just an example of GPT taking information-hiding actions. I spuriously asked how to avoid all deliberate mistakes when what I intended had more to do with hidden information 2. The claim I was trying to support in that paragraph was (as stated in the directly preceding paragraph) it isn't easy to make it outer-aligned. AlphaGo isn't outer-aligned. 3. AlphaGo could be hiding a lot of information, like GPT. In AlphaGo's case, information which AlphaGo doesn't reveal to the user would include a lot of concepts about the state of the game, which aren't revealed to human users easily. This isn't particularly sinister, but, it is hidden information. 4. A hypothetical more-data-efficient AlphaGo which was trained only on playing humans (rather than self-play) could have an internal psychological model of humans. This would be "inaccessible information". It could also implement deliberate deception to increase its win rate. I get the vibe that I might be missing a broader point you're trying to make. Maybe something like "you get what you ask for" -- you're pointing out that hiding information like this isn't at all surprising given the loss function, and different loss functions imply different behavior, often in a straightforward way. If this were your point, I would respond: * The point of the inner alignment problem is that, it seems, you don't always get what you ask for. * I'm not trying to say it's surprising that GPT would hide things in this way. Rather, this is a way of thinking about how GPT thinks and how sophisticated/coherent its internal world-model is (in contrast to what we can see by asking it questions). This seems like i
Gradations of Inner Alignment Obstacles
The most useful definition of "mesa-optimizer" doesn't require them to perform explicit search, contrary to the current standard.

And presumably, the extent to which search takes place isn't important, a measure of risk, or optimizing. (In other words, it's not a part of the definition, and it shouldn't be a part of the definition.)


Some of the reasons we expect mesa-search also apply to mesa-control more broadly.

expect mesa-search might be a problem?


Highly knowledge-based strategies, such as calculus, which find solutions "
... (read more)
2Abram Demski1y
What I intended there was "expect mesa-search to happen at all" (particularly, mesa-search with its own goals) Sorry, by "dumb" I didn't really mean much, except that in some sense lookup tables are "not as smart" as the previous things in the list (not in terms of capabilities, but rather in terms of how much internal processing is going on). For example, you can often get better results out of RL methods if you include "shaping" rewards, which reward behaviors which you think will be useful in productive strategies, even though this technically creates misalignment and opportunities for perverse behavior. For example, if you wanted an RL agent to go to a specific square, you might do well to reward movement toward that square. Similarly, part of the common story about how mesa-optimizers develop is: if they have explicitly represented values, these same kinds of "shaping" values will be adaptive to include, since they guide the search toward useful answers. Without this effect, inner search might not be worthwhile at all, due to inefficiency. Yes, I agree that GPT's outer objective fn is misaligned with maximum usefulness, and a more aligned outer objective would make it do more of what we would want. However, I feel like your "if you don't want that, then..." seems to suppose that it's easy to make it outer-aligned. I don't think so. The spelling example is relatively easy (we could apply an automated spellcheck to all the data, which would have some failure rate of course but is maybe good enough for most situations -- or similarly, we could just apply a loss function for outputs which aren't spelled correctly). But what's the generalization of that?? How do you try to discourage all "deliberate mistakes"? I don't think it would be entirely aligned by any means. My prediction is that it'd be incentivized to reveal information (so you could say it's differentially more "honest" relative to GPT-3 trained only on predictive accuracy). I agree that in the ext
Looking for adversarial collaborators to test our Debate protocol

If you would be interested in participating conditional on us offering pay or prizes, that's also useful to know.

Do you want this feedback at the same address?

1Beth Barnes2y
Yep, or in comments. Thanks!
[AN #106]: Evaluating generalization ability of learned reward models
The authors prove that EPIC is a pseudometric, that is, it behaves like a distance function, except that it is possible for EPIC(R1, R2) to be zero even if R1 and R2 are different. This is desirable, since if R1 and R2 differ by a potential shaping function, then their optimal policies are guaranteed to be the same regardless of transition dynamics, and so we should report the “distance” between them to be zero.

If EPIC(R1, R2) is thought of as two functions f(g(R1), g(R2)), where g returns the optimal policy of its input, and f is a distance ... (read more)

3Rohin Shah2y
The authors don't prove it, but I believe yes, as long as DS and DA put support over the entire state space / action space (maybe you also need DT to put support over every possible transition). I usually think of this as "EPIC is a metric if defined over the space of equivalence classes of reward functions". Yes. For finite, discrete state/action spaces, the uniform distribution over (s, a, s') tuples has maximal entropy. However, it's not clear that that's the worst case for EPIC.
The ground of optimization
the exact same answer it would have output without the perturbation.

It always gives the same answer for the last digit?

1Alex Flint2y
Well we could always just set the last digit to 0 as a post-processing step to ensure perfect repeatability. But point taken, you're right that most numerical algorithms are not quite as perfectly stable as I claimed.
Corrigibility as outside view

(The object which is not the object:)

So you just don't do it, even though it feels like a good idea.

More likely people don't do it because they can't, or a similar reason. (The point of saying "My life would be better if I was in charge of the world" is not to serve as a hypothesis, to be falsified.)

(The object:)

Beliefs intervene on action. (Not success, but choice.)


We are biased and corrupted. By taking the outside view on how our own algorithm performs in a given situation, we can adjust accordingly.

The piece seems biased towards... (read more)

2Alex Turner2y
Yeah, I feel a bit confused about this idea still (hence the lack of clarity), but i'm excited about it as a conceptual tool. I figured it would be better to get my current thoughts out there now, rather than to sit on the idea for two more years.
What is the alternative to intent alignment called?
What term do people use for the definition of alignment in which A is trying to achieve H's goals

Sounds like it should be called goal alignment, whatever it's name happens to be.

5Rob Bensinger2y
That would imply that 'intent alignment' is about aligning AI systems with what humans intend. But 'intent alignment' is about making AI systems intend to 'do the good thing'. (Where 'do the good thing' could be cashed out as 'do what some or all humans want', 'achieve humans' goals', or many other things.) The thing I usually contrast with 'intent alignment' (≈ the AI's intentions match what's good) is something like 'outcome alignment' (≈ the AI's causal effects match what's good). As I personally think about it, the value of the former category is that it's more clean and natural, while being broad enough to include what I'd consider the most important problems today; while the value of the latter category is that it's closer to what we actually care about as a species. 'Outcome alignment' as defined above also has the problem that it doesn't distinguish alignment work from capabilities work. In my own head, I would think of research as more capabilities-flavored if it helps narrow the space of possible AGI outcomes to outcomes that are cognitively harder to achieve; and I'd think of it as more alignment-flavored if it helps narrow the space of possible AGI outcomes to outcomes that are more desirable within a given level of 'cognitively hard to achieve'.
[AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement
The thing about Montezuma's revenge and similar hard exploration tasks is that there's only one trajectory you need to learn; and if you forget any part of it you fail drastically; I would by default expect this to be better than adversarial dynamics / populations at ensuring that the agent doesn't forget things.

But is it easier to remember things if there's more than one way to do them?

2Rohin Shah2y
Unclear, seems like it could go either way. If you aren't forced to learn all the ways of doing the task, then you should expect the neural net to learn only one of the ways. So maybe it's that the adversarial nature of OpenAI Five forced it to learn all the ways, and it was then paradoxically easier to remember all of the ways than just one of the ways.
Attainable Utility Preservation: Empirical Results
Bumping into the human makes them disappear, reducing the agent's control over what the future looks like. This is penalized.

Decreases or increases?

AUPstarting state fails here,
but AUPstepwise does not.

Questions:

1. Is "Model-free AUP" the same as "AUP stepwise"?

2. Why does "Model-free AUP" wait for the pallet to reach the human before moving, while the "Vanilla" agent does not?

There is one weird thing that's been pointed out, where stepwise inaction while driving a car leads to not-crashing being penalized
... (read more)
1Alex Turner2y
Decreases. Here, the "human" is just a block which paces back and forth. Removing the block removes access to all states containing that block. Yes. See the paper for more details. I'm pretty sure it's just an artifact of the training process and the penalty term. I remember investigating it in 2018 and concluding it wasn't anything important, but unfortunately I don't recall the exact explanation. It would still try to preserve access to future states as much as possible with respect to doing nothing that turn. Here [https://github.com/neale/rl-safety/tree/new-code]. Note that we're still ironing things out, but the preliminary results have been pretty solid.
Attainable Utility Preservation: Concepts
CCC says (for non-evil goals) "if the optimal policy is catastrophic, then it's because of power-seeking". So its contrapositive is indeed as stated.

That makes sense. One of the things I like about this approach is that it isn't immediately clear what else could be a problem, and that might just be implementation details or parameters: corrigibility from limited power only works if we make sure that power is low enough we can turn it off, if the agent will acquire power if that's the only way to achieve its goal rather than stoppin... (read more)

1Alex Turner2y
Yeah. I have the math for this kind of tradeoff worked out - stay tuned! I think this is true, actually; if another agent already has a lot of power and it isn't already catastrophic for us, their continued existence isn't that big of a deal wrt the status quo. The bad stuff comes with the change in who has power. The act of taking away our power is generally only incentivized so the agent can become better able to achieve its own goal. The question is, why is the agent trying to convince us of something / get someone else to do something catastrophic, if the agent isn't trying to increase its own AU?
Attainable Utility Preservation: Concepts

I liked this post, and look forward to the next one.


More specific, and critical commentary (It seems it is easier to notice surprise than agreement):

(With embedded footnotes)

1.

If the CCC is right, then if power gain is disincentivised, the agent isn't incentivised to overfit and disrupt our AU landscape.

(The CCC didn't make reference to overfitting.)

Premise:

If A is true then B will be true.

Conclusion:

If A is false B will be false.


The conclusion doesn't follow from the premise.


2.

Without even knowing who we are or what we want, the agent'
... (read more)
2Alex Turner2y
CCC says (for non-evil goals) "if the optimal policy is catastrophic, then it's because of power-seeking". So its contrapositive is indeed as stated. I meant "preserving" as in "not incentivized to take away power from us", not "keeps us from benefitting from anything", but you're right about the implication as stated. Sorry for the ambiguity. Metaphor. Nearby wrt this kind of "AU distance/practical perspective", yes. Great catch. Great thoughts. I think some of this will be answered in a few posts by the specific implementation details. What do you mean by "AUP map"? The AU landscape? The idea is it only penalizes expected power gain.
Bayesian Evolving-to-Extinction
we can think of Bayes' Law as myopically optimizing per-hypothesis, uncaring of overall harm to predictive accuracy.

Or just bad implementations do this - predict-o-matic as described sounds like a bad idea, and like it doesn't contain hypotheses, so much as "players"*. (And the reason there'd be a "side channel" is to understand theories - the point of which is transparency, which, if accomplished, would likely prevent manipulation.)

We can imagine different parts of the network fighting for control, much like the Bayesia
... (read more)
2Abram Demski2y
You can think of the side-channel as a "bad implementation" issue, but do you really want to say that we have to forego diagnostic logs in order to have a good implementation of "hypotheses" instead of "players"? Going to the extreme, every brain has side-channels such as EEG. But more importantly, as Daniel K pointed out [https://www.lesswrong.com/posts/u9Azdu6Z7zFAhd4rK/bayesian-evolving-to-extinction#H9P5Dd8BSiEYtgrcs] , you don't need the side-channel. If the predictions are being used in a complicated way to make decisions, the hypotheses/players have an incentive to fight each other through the consequences of those decisions. So, the interesting question is, what's necessary for a *good* implementation of this? If the training set doesn't provide any opportunity for manipulation/corruption, then I agree that my argument isn't relevant for the training set. It's most directly relevant for online learning. However, keep in mind also that deep learning might be pushing in the direction of learning to learn. Something like a Memory Network is trained to "keep learning" in a significant sense. So you then have to ask if its learned learning strategy has these same issues, because that will be used on-line. Simplifying the picture greatly, imagine that the second-back layer of neurons is one-neuron-per-ticket. Gradient descent can choose which of these to pay the most attention to, but little else; according to the lottery ticket hypothesis, the gradients passing through the 'tickets' themselves aren't doing that much for learning, besides reinforcing good tickets and weakening bad. So imagine that there is one ticket which is actually malign, and has a sophisticated manipulative strategy. Sometimes it passes on bad input in service of its manipulations, but overall it is the best of the lottery tickets so while the gradient descent punishes it on those rounds, it is more than made up for in other cases. Furthermore, the manipulations of the malign ticket ensu
Attainable Utility Landscape: How The World Is Changed
Going to the green state means you can't get to the purple state as quickly.
On a deep level, why is the world structured such that this happens? Could you imagine a world without opportunity cost of any kind?

In a complete graph, all nodes are directly connected.


Equivalently, we assumed the agent isn't infinitely farsighted (γ<1); if it were, it would be possible to be in "more than one place at the same time", in a sense (thanks to Rohin Shah for this interpretation).

The opposite of this, is that if it were possible for an agen... (read more)

2Alex Turner2y
Surprisingly, unless you're talking about K1 (complete 1-graph), opportunity cost still exists in Kn (n>1). Each round, you choose where to go next (and you can go to any state immediately). Going to one state next round means you can't go to a different state next round, so for any given action there exists a reward function which incurs opportunity cost. Definition. We say opportunity cost exists at a state s if there exist child states s1,s2 of state s such that V∗R(s1)≠V∗R(s2) for some reward function R. That is, s has successor states with different (optimal) AUs for some reward function. Things get weird here, depending on your theory of identity and how that factors into the planning / reward process? Can you spell this out some more?
Instrumental Occam?
So: is it possible to formulate an instrumental version of Occam? Can we justify a simplicity bias in our policies?

Justification has the downside of being wrong, a) if what you are arguing is wrong/false, b) can be wrong even if what you are arguing is right/true. That being said/Epistemic warnings concluded...


1. A more complicated policy:

  • Is harder to keep track of
  • Is harder to calculate*

2. We don't have the right policy.

  • Simpler goals, that pay out in terms that are useful even if plans change are preferred as a consequence of this uncertainty.
  • This is
... (read more)
1Alex Turner2y
Can you elaborate? Does this argue against instrumental convergence because it would paperclip itself?
[AN #83]: Sample-efficient deep learning with ReMixMatch
However, this paper wants the answers to actually be correct. Thus, they claim that for sufficiently complicated questions, since the debate can't reach the right answer, the debate isn't truth-seeking -- but in these cases, the answer is still in expectation more accurate than the answer the judge would come up with by themselves.

Truth-seeking: better than the answer the judge would have come up with by themself (how does this work? making an observation at random instead of choosing the observation that's recommended by the debate?)

Truth-finding: the truth is found.

2Rohin Shah2y
You have a prior; you choose to do the experiment with highest VOI to get a posterior, and then you choose the best answer given that posterior. I'm pretty sure I could calculate this for many of their scenarios.
[AN #64]: Using Deep RL and Reward Uncertainty to Incentivize Preference Learning
Neither classical adversarial training nor training on a version of ImageNet designed to reduce the reliance on texture helps a lot, but modifying the network architecture can increase the accuracy on ImageNet-A from around 5% to 15%.

(Section link.)

Wow, 15% sounds really low. How well do people perform on said dataset?

This reminds me of:

https://www.lesswrong.com/posts/s4mqFdgTfsjfwGFiQ/who-s-an-unusual-thinker-that-you-recommend-following#9m26yMR9TtbxCK7m5

David Ha's
most recent paper, Weight Agnostic Neural Networks looks at what happens when you do a
... (read more)
1Rohin Shah3y
Given that there was a round of manual review, I would expect human accuracy to be over 80% and probably over 90%.
1Matthew Barnett3y
You can download the dataset here [https://github.com/hendrycks/natural-adv-examples] and see how well you can classify them yourself.
"Designing agent incentives to avoid reward tampering", DeepMind
I feel like this same set of problems gets re-solved a lot. I'm worried that it's a sign of ill health for the field.

Maybe the problem is getting everyone on the same page.

4Tom Everitt3y
Yes, that is partly what we are trying to do here. By summarizing some of the "folklore" in the community, we'll hopefully be able to get new members up to speed quicker.
Mesa-Optimizers and Over-optimization Failure (Optimizing and Goodhart Effects, Clarifying Thoughts - Part 4)
This isn't quite embedded agency, but it requires the base optimizer to be "larger" than the mesa-optimizer, only allowing mesa-suboptimizers, which is unlikely to be guaranteed in general.

Size might be easier to handle if some parts of the design are shared. For example, if the mesa-optimizer's design was the same as the agent, and the agent understood itself, and knew the mesa-optimizer's design, then it seems like them being the same size wouldn't be (as much of) an issue.

Principal optimization failures occur either if the
... (read more)
1David Manheim3y
I really like this point. I think it's parallel to the human issue where different models of the world can lead to misinterpretation of the "same" goal. So "terminology issues" would include, for example, two different measurements of what we would assume is the same quantity. If the base optimizer is looking to set the temperature and using a wall-thermometer, while the mesa-optimizer is using one located on the floor, the mesa-optimizer might be mis-aligned because it interprets "temperature" as referring to a different fact than the base-optimizer. On the other hand, when the same metric is being used by both parties, the class of possible mistakes does not include what we're not calling terminology issues. I think this also points to a fundamental epistemological issue, one even broader than goal-representation. It's possible that two models disagree on representation, but agree on all object level claims - think of using different coordinate systems. Because terminology issues can cause mistakes, I'd suggest that agents with non-shared world models can only reliably communicate via object-level claims. The implication for AI alignment might be that we need AI to either fundamentally model the world the same way as humans, or need to communicate only via object-level goals and constraints.
Deceptive Alignment

And wow, this turned out longer than I thought it would. It's in 6 sections:

1. Starting with models versus learning models.

2. Is the third conditions for deceptive alignment necessary?

3. An alternative to, or form of, treacherous turn: Building a successor.

4. Time management: How deceptive alignment might be not be a lot more computationally expensive, and why treacherous turns might have a time delay.

5. The model of a distributional shift, and it's relationship to the model of training followed by deployment.

6. Miscellaneous


1.

The mesa-optimizer
... (read more)
5Evan Hubinger3y
This is a response to point 2 before Pattern's post was modified to include the other points. Interesting point! First of all, I think condition three is mostly a requirement that must be met for a deceptively aligned mesa-optimizer to actually defect at some point, rather than for it to be deceptive in the first place. That being said, I think the situation you raise is particularly interesting because it depends on whether the mesa-optimizer cares about its own survival. If the mesa-optimizer does care about its own continued existence, then it would want to do as you were reasoning and not defect, keeping itself around. Alternatively, however, suppose the mesa-optimizer just cares about Omesa but doesn't care that it's around to do the optimization for it. Then, defecting and optimizing for Omesa instead of Obase when you expect to be modified afterwards won't actually hurt the long-term fulfillment of Obase, since another learned algorithm will just take your place to optimize for it instead. Thus, even if the mesa-optimizer prefers lots of Obase to a bit of Omesa, that's not actually the choice presented to it; rather, it's actual choice is between a bit of Omesa and a lot of Obase versus only a lot of Obase. Thus, in this case, it would defect even if it thought it would be modified. (Also, on point 6, thanks for the catch; it should be fixed now!)
Subsystem Alignment

How could a (relatively) 'too-strong' epistemic subsystem be a bad thing?

So if we view an epistemic subsystem as an super intelligent agent who has control over the map and has the goal of make the map match the territory, one extreme failure mode is that it takes a hit to short term accuracy by slightly modifying the map in such a way as to trick the things looking at the map into giving the epistemic subsystem more control. Then, once it has more control, it can use it to manipulate the territory to make the territory more predictable. If your goal is to minimize surprise, you should destroy all the surprising things.

Note th... (read more)