# All of Gurkenglas's Comments + Replies

we fix the BOS token attention to 0.5

1.5 does conjunction without my dummy.

When at most one clause is true, 0.5 does disjunction instead.

Ah, yep, I think you're right -- it should be pretty easy to add support for and in selectors then.

(To show my idea compatible with Boolean attention.)

You're right that accounting for the softmax could be used to get around the argument in the appendix. We'll mention this when we update the paper. The scheme as you described it relies on an always-on dummy token, which would conflict with our implementation of default values in aggregates: we fix the BOS token attention to 0.5, so at low softmax temp it's attended to iff nothing else is; however with this scheme we'd want 0.5 to always round off to zero after softmax. This is plausibly surmountable but we put it out of scope for the pilot release since we didn't seem to need this feature, whereas default values come up pretty often. Also, while this construction will work for just and (with arbitrarily many conjuncts), I don't think it works for arbitrary compositions using and and or. (Of course it would be better than no composition at all.) In general, I expect there to be a number of potentially low-hanging improvements to be made to the compiler -- many of them we deliberately omitted and are mentioned in the limitations section, and many we haven't yet come up with. There's tons of features one could add, each of which takes time to think about and increases overall complexity, so we had to be judicious which lines to pursue -- and even then, we barely had time to get into superposition experiments. We currently aren't prioritising further Tracr development until we see it used for research in practice, but I'd be excited to help anyone who's interested in working with it or contributing.

Appendix C attempts to discard the softargmax, but it's an integral part of my suggestion. If the inputs to softargmax take values {0,1000,2000}, the outputs will take only two values.

From the RASP paper: https://arxiv.org/pdf/2106.06981

The uniform weights dictated by our selectors reflect an attention pattern in which ‘unselected’ pairs are all given strongly negative scores, while the selected pairs all have higher, similar, scores.

Addition of such attention patterns corresponds to anding of such selectors.

1Tom Lieberum14d
(The quote refers to the usage of binary attention patterns in general, so I'm not sure why you're quoting it) I obv agree that if you take the softmax over {0, 1000, 2000}, you will get 0 and 1 entries. iiuc, the statement in the tracr paper is not that you can't have attention patterns which implement this logical operation, but that you can't have a single head implementing this attention pattern (without exponential blowup)

Isn't (select(a, b, ==) and select(c, d, ==)) just the sum of the two QK matrices? It'll produce NxN entries in {0,1,2}, and softargmax with low temp treats 1s like 0s in the presence of a dummy token's 2.

1Tom Lieberum14d
I don't think that's right. Iiuc this is a logical and, so the values would be in {0, 1} (as required, since tracr operates with Boolean attention). For a more extensive discussion of the original problem see appendix C.

It seems important to clarify the difference between From ⊢A, conclude ⊢□A and ⊢□A→□□A. I don't feel like I get why we don't just set conclude := → and ⊢:=□.

1Andrew Critch1mo
Well,A→Bis just short for¬A∨B, i.e., "(not A) or B". By contrast,A⊢Bmeans that there exists a sequence of (very mechanical) applications of modus ponens, starting from the axioms of Peano Arithmetic (PA) withAappended, ending inB. We tried hard to make the rules of⊢so that it would agree with→in a lot of cases (i.e., we tried to design⊢to make the deduction theorem true), but it took a lot of work in the design of Peano Arithmetic and can't be taken for granted. For instance, consider the statement¬□(1=0). If you believe Peano Arithmetic is consistent, then you believe that¬□(1=0), and therefore you also believe that□(1 =0)→(2=3). But PA cannot prove that¬□(1=0)(by Gödel's Theorem, or Löb's theorem withp=(1=0)), so we don't have⊢□(1=0)→(2=3).

Similarly:

löb = □ (□ A → A) → □ A
□löb = □ (□ (□ A → A) → □ A)
□löb -> löb:
löb premises □ (□ A → A).
By internal necessitation, □ (□ (□ A → A)).
By □löb, □ (□ A).
By löb's premise, □ A.
1Andrew Critch2mo
Noice :)

The recursive self-improvement isn't necessarily human-out-of-the-loop: If an AGI comes up with simpler math, everything gets easier.

1Donald Hobson3mo
I have strong doubts this is actually any help. Ok, maybe the AI comes up with new simpler math. It's still new math that will take the humans a while to get used to. There will probably be lots of options to add performance and complexity. So either we are banning the extra complexity, using our advanced understanding, and just using the simpler math at the cost of performance, or the AI soon gets complex again before humans have understood the simple math.

It wouldn't just guess the next line of the proof, it'd guess the next conjecture. It could translate our math papers into proofs that compile, then write libraries that reduce code duplication. I expect canonical solutions to our confusions to fall out of a good enough such library.

1Donald Hobson6mo
I would expect such libraries to be a million lines of unenlightening trivialities. And the "libraries to reduce code duplication", to mostly contain theorems that were proved in multiple places.

Oh, I wasn't expecting you to have addressed the issue! 10.2.4 says L wouldn't be S if it were calculated from projected actions instead of given actions. How so? Mightn't it predict the given actions correctly?

You're right on all counts in your last paragraph.

1Koen Holtman1y
Not sure if a short answer will help, so I will write a long one. In 10.2.4 I talk about the possibility of an unwanted learned predictive function L−(s′,s,a) that makes predictions without using the argument a. This is possible for example by using s′ together with a (learned) model πl of the compute core to predict a: so a viable L−could be defined as L−(s′,s,a)=S(s′,s,π l(s)). This L− could make predictions fully compatible with the observational record o, but I claim it would not be a reasonable learned L according to the reasonableness criterion L≈S. How so? The reasonableness criterion L≈S is similar to that used in supervised machine learning: we evaluate the learned L not primarily by how it matches the training set (how well it predicts the observations ino), but by evaluating it on a separate test set. This test set can be constructed by sampling S to create samples not contained in o. Mathematically, perfect reasonableness is defined as L=S, which implies that L predicts all samples from S fully accurately. Philosophically/ontologically speaking, an the agent specification in my paper, specifically the learning world diagram and the descriptive text around it of how this diagram is a model of reality, gives the engineer an unambiguous prescription of how they might build experimental equipment that can measure the properties of the S in the learning world diagram by sampling reality. A version of this equipment must of course be built into the agent, to create the observations that drive machine learning of L, but another version can be used stand-alone to construct a test set. A sampling action to construct a member of the test set would set up a desired state s and action a, and then observe the resultings′. Mathematically speaking, this observation gives additional information about the numeric value of S(s′,s, a) and of all S(s′′,s,a) for all s′′≠s′. I discuss in the section that, if we take an observational record o sampled from S, then two le

it is very interpretable to humans

Misunderstanding: I expect we can't construct a counterfactual planner because we can't pick out the compute core in the black-box learned model.

And my Eliezer's problem with counterfactual planning is that the plan may start by unleashing a dozen memetic, biological, technological, magical, political and/or untyped existential hazards on the world which then may not even be coordinated correctly when one of your safeguards takes out one of the resulting silicon entities.

3Koen Holtman1y
Agree it is hard to pick the compute core out of a black-box learned model that includes the compute core. But one important point I am trying to make in the counterfactual planning sequence/paper is that you do not have to solve that problem. I show that it is tractable to route around it, and still get an AGI. I don't understand your second paragraph 'And my Eliezer's problem...'. Can you unpack this a bit more? Do you mean that counterfactual planning does not automatically solve the problem of cleaning up an already in-progress mess when you press the emergency stop button too late? It does not intend to, and I do not think that the cleanup issue is among the corrigibility-related problems Eliezer has been emphasizing in the discussion above.

If you don't wish to reply to Eliezer, I'm an other and also ask what incoherence allows what corrigibility. I expect counterfactual planning to fail for want of basic interpretability. It would also coherently plan about the planning world - my Eliezer says we might as well equivalently assume superintelligent musings about agency to drive human readers mad.

2Koen Holtman1y
See above for my reply to Eliezer. Indeed, a counterfactual planner [https://www.lesswrong.com/posts/7EnZgaepSBwaZXA5y/counterfactual-planning-in-agi-systems] will plan coherently inside its planning world. In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don't see that fact mentioned often on this forum, so I will expand. An agent that plans coherently given a reward function Rp to maximize paperclips will be an incoherent planner if you judge its actions by a reward function Rs that values the maximization of staples instead. In section 6.3 of the paper [https://arxiv.org/abs/2102.00834] I show that you can perfectly well interpret a counterfactual planner as an agent that plans coherently even inside its learning world (inside the real world), as long as you are willing to evaluate its coherency according to the somewhat strange reward function Rπ. Armstrong's indifference methods use this approach to create corrigibility without losing coherency: they construct an equivalent somewhat strange reward function by including balancing terms. One thing I like about counterfactual planning is that, in my view, it is very interpretable to humans. Humans are very good at predicting what other humans will do, when these other humans are planning coherently inside a specifically incorrect world model, for example in a world model where global warming is a hoax. The same skill can also be applied to interpreting and anticipating the actions of AIs which are counterfactual planners. But maybe I am misunderstanding your concern about interpretability.

Someone needs to check if we can use ML to guess activations in one set of neurons from activations in another set of neurons. The losses would give straightforward estimates of such statistical quantities as mutual information. Generating inputs that have the same activations in a set of neurons illustrates what the set of neurons does. I might do this myself if nobody else does.

2Paul Christiano1y
I'm not clear on what you'd do with the results of that exercise. Suppose that on a certain distribution of texts you can explain 40% of the variance in half of layer 7 by using the other half of layer 7 (and the % gradually increases as you use make the activation-predicting-model bigger, perhaps you guess it's approaching 55% in the limit). What's the upshot of models being that predictable rather than more or less, or the use of the actual predictor that you learned? Given an input x, generating other inputs that "look the same as x" to part of the model but not other parts seems like it reveals something about what that part of the model does. As a component of interpretability research that seems pretty similar to doing feature visualization or selecting input examples that activate a given neuron, and I'd guess it would fit in the same way into the project of doing interpretability. I'd mostly be excited about people developing these techniques as part of a focused project to understand what models are thinking. I'm not really sure what to make of them in isolation.

Suppose the bridge is safe iff there's a proof that the bridge is safe. Then you would forbid the reasoning "Suppose I cross. I must have proven it's safe. Then it's safe, and I get 10. Let's cross.", which seems sane enough in the face of Löb.

Translating to a tree of natural language descriptions and back lets you

1. use different instances of GPT at each step, so that any coded messages would have to rely on the common training set rather than shared internals.
2. do a random subset of the tasks yourself, so that any subversion of the target system would have to be based on subtle biases, perhaps with a small number of outright lies that it hopes we don't implement manually, rather than the description being fabricated from whole cloth.

Having the thing write papers is merely an existence proof of...

1Donald Hobson2y
Maybe you did. I find it hard to distinguish inventing and half remembering ideas. If the training procedure either 1. Reliably produces mesaoptimisers with about the same values. or 2. Reliably produces mesaoptimizers that can acausally cooperate 3. The rest of the procedure allows one mesaoptimizer to take control of the whole output Then using different copies of GPT-n trained from different seeds doesn't help. If you just convert 1% of the english into network yourself, then all it needs to use is some error correction. Even without that, neural net structures are somewhat redundant. A malicious optimizer with a random 1% removed is probably still a malicious optimizer. Secondly, its goal here is to give us subtly wrong alignment papers. Suppose we produce an examination and transparency step so thorough that no mesa optimizer can slip through. The original mesaoptimiser can just design a network that is doing its best as a mesaoptimiser free predictor, but has subtly maliciously chosen beliefs about AI research.

Suppose instead of a timeline with probabilistic events, the coalition experiences the full tree of all possible futures - but we translate everything to preserve behavior. Then beliefs encode which timelines each member cares about, and bets trade influence (governance tokens) between timelines.

2Abram Demski2y
Can you justify Kelly "directly" in terms of Pareto-improvement trades rather than "indirectly" through Pareto-optimality? I feel this gets at the distinction between the selfish vs altruistic view.

In category theory, one learns that good math is like kabbalah, where nothing is a coincidence. All short terms ought to mean something, and when everything fits together better than expected, that is a sign that one is on the right track, and that there is a pattern to formalize. and  can be replaced by  and . I expect that the latter formation is better because it is shorter. Its only direct effect would be tha...

1Koen Holtman2y
OK, I think I see what inspired your question. If you want to give this kind of give the math the kabbalah treatment, you may also look at the math in [EFDH16], which produces agents similar to my definitions (4) (5), and also some variants that have different types of self-reflection. In the later paper here [https://arxiv.org/abs/1908.04734], Everitt et al. develop some diagrammatic models of this type of agent self-awareness, but the models are not full definitions of the agent. For me, the main questions I have about the math developed in the paper is how exactly I can map the model and the constraints (C1-3) back to things I can or should build in physical reality. There is a thing going on here (when developing agent models, especially when treating AGI/superintelligence and embeddeness) that also often happens in post-Newtonian physics. The equations work, but if we attempt to map these equations to some prior intuitive mental model we have about how reality or decision making must necessarily work, we have to conclude that this attempt raises some strange and troubling questions. I'm with modern physics here (I used to be an experimental physicist for a while), where the (mainstream) response to this is that 'the math works, your intuitive feelings about how X must necessarily work are wrong, you will get used to it eventually'. BTW, I offer some additional interpretation of a difficult-to-interpret part of the math in section 10 of my 2020 paper here [https://arxiv.org/abs/2007.05411]. You could insert quantilization in several ways in the model. Most obvious way is to change the basic definition (4). You might also define a transformation that takes any reward function R and returns a quantilized reward function Rq, this gives you a different type of quantilization, but I feel it would be in the same spirit. In a more general sense, I do not feel that quantilization can produce the kind of corrigibility I am after in the paper. The effects you get o

I am not sure if I understand the question.

pi has form afV, V has form mfV, f is a long reused term. Expand recursion to get afmfmf... and mfmfmf.... Define E=fmE and you get pi=aE without writing f twice. Sure, you use V a lot but my intuition is that there should be some a priori knowable argument for putting the definitions your way or your theory is going to end up with the wrong prior.

1Koen Holtman2y
Thanks for expanding on your question about the use of V. Unfortunately. I still have a hard time understanding your question, so I'll say a few things and hope that will clarify. If you expand the V term defined in (5) recursively, you get a tree-like structure. Each node in the tree has as many sub nodes as there are elements in the set W. The tree is in fact a tree of branching world lines. Hope this helps you visualize what is going on. I could shuffle around some symbols and terms in the definitions (4) and (5) and still create a model of exactly the same agent that will behave in exactly the same way. So the exact way in which these two equations are written down and recurse on each other is somewhat contingent. My equations stay close to what is used when you model an agent or 'rational' decision making process with a Bellman equation. If your default mental model of an agent is a set of Q-learning equations, the model I develop will look strange, maybe even unnatural at first sight. OK, maybe this is the main point that inspired your question. The agency/world models developed in the paper are not a 'theory', in the sense that theories have predictive power. A mathematical model used as a theory, like f=m∗a, predicts how objects will accelerate when subjected to a force. The agent model in the paper does not really 'predict' how agents will behave. The model is compatible with almost every possible agent construction and agent behavior, if we are allowed to pick the agent's reward function R freely after observing of reverse-engineering the agent to be modeled. On purpose, the agent model is constructed with so many 'free parameters' that is has no real predictive power. What you get here is an agent model that can describe almost every possible agent and world in which it could operate. In mathematics. the technique I am using in the paper is sometimes called 'without loss of generality' [https://en.wikipedia.org/wiki/Without_loss_of_generality]. I am

Damn it, I came to write about the monad1 then saw the edit. You may want to add it to this list, and compare it with the other entries.

Here's a dissertation and blog post by Jared Tobin on using (X -> R) -> R with flat reals to represent usual distributions in Haskell. He appears open to get hired.

Maybe you want a more powerful type system? I think Coq allows constructing that subtype of a type which satisfies a property. Agda's cubical type theory places a lot of emphasis for its for the unit interval. Might dependent types be enough express lipsch...

I like your "Corrigibility with Utility Preservation" paper. I don't get why you prefer not using the usual conditional probability notation.  leads to TurnTrout's attainable utility preservation. Why not use  in the definition of ? Could you change the definition of  to , and give the agent the ability to self-modify arbitrarily? The idea is that it would edit itself into its original form in order to make sure  is large and ...

1Koen Holtman2y
In general if you would forcefully change the agent's reward function into some R′, it will self-preserve from that moment on and try to maintain this R′, so it won't self-edit its R′ back into the original form. There are exceptions to this general rule, for special versions of R′ and special versions of agent environments (see section 7.2), where you can get the agent to self-edit, but on first glance, your example above does not seem to be one. If you remove the dntu bits from the agent definition then you can get an agent that self-edits a lot, but without changing its fundamental goals. The proofs of 'without changing its fundamental goals' will get even longer and less readable than the current proofs in the paper, so that is why I did the dntu privileging.
1Koen Holtman2y
Thanks! Well, I wrote in the paper (section 5) that I used p(st,a,st+1)instead of the usual conditional probability notationP(st+1|st,a) because it 'fits better with the mathematical logic style used in the definitions and proofs below.' i.e. the proofs use the mathematics of second order logic, not probability theory. However this was not my only reason for this preference. The other reason what that I had an intuitive suspicion back in 2019 that the use of conditional probability notation, in the then existing papers and web pages on balancing terms, acted as an of impediment to mathematical progress. My suspicion was that it acted as an overly Bayesian framing that made it more difficult to clarify and generalize the mathematics of this technique any further. In hindsight in 2021, I can be a bit more clear about my 2019 intuition. Armstrong's original balancing term elementsE(v|u→v) and E(u|u→u), where u→v and u→v are low-probability near-future events, can be usefully generalized (and simplified) as the Pearlian E(v|do(v)) and E(u|do(u)) where the do terms are interventions (or 'edits') on the current world state. The notation E(v|u→v) makes it look like the balancing terms might have some deep connection to Bayesian updating or Bayesian philosophy, whereas I feel they do not have any such deep connection. That being said, in my 2020 paper [https://arxiv.org/abs/2007.05411]I present a simplified version of the math in the 2019 paper using the traditional P(st+1|st ,a) notation again, and without having to introduce do. Yes it is very related: I explore that connection in more detail in section 12 of my 2020 paper [https://arxiv.org/abs/2007.05411]. In general I think that counterfactual expected-utility reward function terms are a Swiss army knifes with many interesting uses. I feel that as a community, we have not yet gotten to the bottom of their possibilities (and their possible failure modes). In definition of π∗ (section 5.3 equation 4) I am using a

I'm not convinced that we can do nothing if the human wants ghosts to be happy. The AI would simply have to do what would make ghosts happy if they were real. In the worst case, the human's (coherent extrapolated) beliefs are your only source of information on how ghosts work. Any proper general solution to the pointers problem will surely handle this case. Apparently, each state of the agent corresponds to some probability distribution over worlds.

3Abram Demski2y
This seems like it's only true if the humans would truly cling to their belief in spite of all evidence (IE if they believed in ghosts dogmatically), which seems untrue for many things (although I grant that some humans may have some beliefs like this). I believe the idea of the ghost example is to point at cases where there's an ontological crisis, not cases where the ontology is so dogmatic that there can be no crisis (though, obviously, both cases are theoretically important). However, I agree with you in either case -- it's not clear there's "nothing to be done" for the ghost case (in either interpretation).

with the exception of people who decided to gamble on being part of the elite in outcome B

Game-theoretically, there's a better way. Assume that after winning the AI race, it is easy to figure out everyone else's win probability, utility function and what they would do if they won. Human utility functions have diminishing returns, so there's opportunity for acausal trade. Human ancestry gives a common notion of fairness, so the bargaining problem is easier than with aliens.

Most of us care some even about those who would take all for themselves, so instead o...

1Vanessa Kosoy2y
Good point, acausal trade can at least ameliorate the problem, pushing towards atomic alignment. However, we understand acausal trade too poorly to be highly confident it will work. And, "making acausal trade work" might in itself be considered outside of the desiderata of atomic alignment (since it involves multiple AIs). Moreover, there are also actors that have a very low probability of becoming TAI users but whose support is beneficial for TAI projects (e.g. small donors). Since they have no counterfactual AI to bargain on their behalf, it is less likely acausal trade works here.

This all sounds reasonable. I just saw that you were arguing for more being learned at runtime (as some sort of Steven Reknip), and I thought that surely not all the salt machinery can be learnt, and I wanted to see which of those expectations would win.

2Steve Byrnes2y
(Oh, I get it, Reknip is Pinker backwards.) If you're interested in my take more generally see My Computational Framework for the Brain [https://www.lesswrong.com/posts/diruo47z32eprenTg/my-computational-framework-for-the-brain] . :-)

Do you posit that it learns over the course of its life that salt taste cures salt definiency, or do you allow this information to be encoded in the genome?

3Steve Byrnes2y
I think the circuitry which monitors salt homeostasis, and which sends a reward signal when both salt is deficient and the neocortex starts imagining the taste of salt ... I think that circuitry is innate and in the genome. I don't think it's learned. That's just a guess from first principles: there's no reason it couldn't be innate, it's not that complicated, and it's very important to get the behavior right. (Trial-and-error learning of salt homeostasis would, I imagine, be often fatal.) I do think there are some non-neocortex aspects of food-related behavior that are learned—I'm thinking especially of how, if you eat food X, then get sick a few hours later, you develop a revulsion to the taste of food X. That's obviously learned, and I really don't think that this learning is happening in the neocortex. It's too specific. It's only one specific type of association, it has to occur in a specific time-window, etc. But I suspect that the subcortical systems governing salt homeostasis in particular are entirely innate (or at least mostly innate), i.e. not involving learning. Does that answer your question? Sorry if I'm misunderstanding.

The hypotheses after the modification are supposed to have knowledge that they're in training, for example because they have enough compute to find themselves in the multiverse. Among hypotheses with equal behavior in training, we select the simpler one. We want this to be the one that disregards that knowledge. If the hypothesis has form "Return whatever maximizes property _ of the multiverse", the simpler one uses that knowledge. It is this form of hypothesis which I suggest to remove by inspection.

1johnswentworth2y
Ok, that should work assuming something analogous to Paul's hypothesis about minimal circuits being daemon-free.

Take an outer-aligned system, then add a 0 to each training input and a 1 to each deployment input. Wouldn't this add only malicious hypotheses that can be removed by inspection without any adverse selection effects?

3johnswentworth2y
After thinking about it for a couple minutes, this question is both more interesting and less trivial than it seemed. The answer is not obvious to me. On the face of it, passing in a bit which is always constant in training should do basically nothing - the system has no reason to use a constant bit. But if the system becomes reflective (i.e. an inner optimizer shows up and figures out that it's in a training environment), then that bit could be used. In principle, this wouldn't necessarily be malicious - the bit could be used even by aligned inner optimizers, as data about the world just like any other data about the world. That doesn't seem likely with anything like current architectures, but maybe in some weird architecture which systematically produced aligned inner optimizers.

I don't buy the lottery example. You never encoded the fact that you know tomorrow's numbers. Shouldn't the prior be that you win a million guranteed if you buy the ticket?

3Abram Demski2y
No! You also have to enter the right numbers. What I'm doing is modeling "gamble with the money" as a simple action - you can imaging there's a big red button that gives you \$200 1/16th of the time and takes all your money otherwise. And then I'm modeling "but a lotto ticket" as a compound action consisting of entering each number individually. "Knowing the numbers" means your world model understands that if you've entered the right numbers, you get the money. But it doesn't make "enter the right numbers" probable in the prior. Of course the conclusion is reverse if we make "enter the right numbers" into a primitive action.
1Steve Byrnes2y
I also didn't understand that. I was thinking of it more like AlphaStar in the sense that your prior is that you're going to continue using your current (probabilistic) policy for all the steps involved in what you're thinking about. (But not like AlphaStar in that the brain is more likely to do one-or-a-few-steps of rollout with clever hierarchical abstract representations of plans, rather than dozens-of-steps rollouts in a simple one-step-at-a-time way.)

It sounds like you want to use it as a component for alignment of a larger AI, which would somehow turn its natural-language directives into action. I say use it as the capability core: Ask it to do armchair alignment research. If we give it subjective time, a command line interface and internet access, I see no reason it would do worse than the rest of us.

1Charlie Steiner2y
In retrospect, I was totally unclear that I wan't necessarily talking about something that has a complicated internal state, such that it can behave like one human over long time scales. I was thinking more about the "minimum human-imitating unit" necessary to get things like IDA off the ground. In fact this post was originally titled "What to do with a GAN of a human?"

The problem arises whenever the environment changes. Natural selection was a continual process, and yet humans still aren't fitness-aligned.

Re claim 1: If you let it use the page as a scratch pad, you can also let it output commands to a command line interface so it can outsource these hard-to-emulate calculations to the CPU.

2David Manheim3y
I'm unsure that GPT3 can output, say, a ipython notebook to get the values it wants. That would be really interesting to try...

It appears to me that a more natural adjustment to the stepwise impact measurement in Correction than appending waiting times would be to make Q also incorporate AUP. Then instead of comparing "Disable the Off-Switch, then achieve the random goal whatever the cost" to "Wait, then achieve the random goal whatever the cost", you would compare "Disable the Off-Switch, then achieve the random goal with low impact" to "Wait, then achieve the random goal with low impact".

The scaling term makes R_AUP vary under adding a constant to all utilities. That doesn't see

...
1Alex Turner3y
This has been an idea I’ve been intrigued by ever since AUP came out. My main concern with it is the increase in compute required and loss of competitiveness. Still probably worth running the experiments. Correct. Proposition 4 in the AUP paper guarantees penalty invariance to affine transformation only if the denominator is also the penalty for taking some action (absolute difference in Q values). You could, for example, consider the penalty of some mild action: |Q(s,amild)−Q(s,∅)|. It’s really up to the designer in the near-term. We’ll talk about more streamlined designs for superhuman use cases in two posts. Don’t think so. Moving generates tiny penalties, and going in circles usually isn’t a great way to accrue primary reward.

Then that minimum does not make a good denominator because it's always extremely small. It will pick phi to be as powerful as possible to make L small, aka set phi to bottom. (If the denominator before that version is defined at all, bottom is a propositional tautology given A.)

1Diffractor3y
Oh, I see what the issue is. Propositional tautology given A means A⊢pcϕ, not A⊢ ϕ. So yeah, when A is a boolean that is equivalent to ⊥ via boolean logic alone, we can't use that A for the exact reason you said, but if A isn't equivalent to ⊥ via boolean logic alone (although it may be possible to infer ⊥ by other means), then the denominator isn't necessarily small.
a magma [with] some distinguished element

A monoid?

min,ϕ(A,ϕ⊢⊥) where ϕ is a propositional tautology given A

Propositional tautology given A means A⊢ϕ, right? So ϕ=⊥ would make L small.

1Diffractor3y
Yup, a monoid, because ϕ∨⊥=ϕ and A∪∅=A, so it acts as an identitity element, and we don't care about the order. Nice catch. You're also correct about what propositional tautology given A means.

If it is capable of becoming more able to maximize its utility function, does it then not already have that ability to maximize its utility function? Do you propose that we reward it only for those plans that pay off after only one "action"?

1Alex Turner3y
Not quite. I'm proposing penalizing it for gaining power, a la my recent post [https://www.lesswrong.com/s/7CdoznhJaLEKHwvJW/p/6DuJxY8X45Sco4bS2]. There's a big difference between "able to get 10 return from my current vantage point" and "I've taken over the planet and can ensure i get 100 return with high probability". We're penalizing it for increasing its ability like that (concretely, see Conservative Agency [https://arxiv.org/abs/1902.09725] for an analogous formalization, or if none of this makes sense still, wait till the end of Reframing Impact).

What do you mean by equivalent? The entire history doesn't say what the opponent will do later or would do against other agents, and the source code may not allow you to prove what the agent does if it involves statements that are true but not provable.

1Vanessa Kosoy3y
For a fixed policy, the history is the only thing you need to know in order to simulate the agent on a given round. In this sense, seeing the history is equivalent to seeing the source code. The claim is: In settings where the agent has unlimited memory and sees the entire history or source code, you can't get good guarantees (as in the folk theorem for repeated games). On the other hand, in settings where the agent sees part of the history, or is constrained to have finite memory (possibly of size O (log11−γ)?), you can (maybe?) prove convergence to Pareto efficient outcomes or some other strong desideratum that deserves to be called "superrationality".

The bottom left picture on page 21 in the paper shows that this is not just regularization coming through only after the error on the training set is ironed out: 0 regularization (1/lambda=inf) still shows the effect.

Can we switch to the interpolation regime early if we, before reaching the peak, tell it to keep the loss constant? Aka we are at loss l* and replace the loss function l(theta) with |l(theta)-l*| or (l(theta)-l*)^2.

2Isnasene3y
Interesting! Given that stochastic gradient descent (SGD) does provide an inductive bias towards models that generalize better, it does seem like changing the loss function in this way could enhance generalization performance. Broadly speaking, SGD's bias only provides a benefit when it is searching over many possible models: it performs badly at the interpolation threshold because the lowish complexity limits convergence to a small number of overfitted models. Creating a loss function that allows SGD more reign over the model it selects could therefore improve generalization. If #1 SGD is inductively biased to more generalizeable models in general #2 an (l(θ)−l∗)2 loss-function gives all models with near l∗ a wider local minimum #3 there are many different models where l(θ)≈l∗ at a given level of complexity as long as l∗>0 then it's plausible that changing the loss-function in this way will help emphasize SGD's bias towards models that generalize better. Point #1 is an explanation for double-descent. Point #2 seems intuitive to me (it makes the loss-function more convex and flatter when models are better performing) and Point #3 does too: there are many different sets of prediction that will all partially fit the training-dataset and yield the same loss function value of l∗, which implies that there are also many different predictive models that yield such a loss function. To illustrate point #3 above, imagine we're trying to fit the set of training observations {→x1,→x2,→x3,...,→xi,...→xn}. Fully overfitting this set (getting l (θ)≈0) requires us to get all →xi from 1 to ncorrect. However, we can partially overfit this set (getting l(θ)=l∗) in a variety of different ways. For instance, if we get all →xi correct except for →xj, we may have roughly n different ways we can pick →xj that could yield the same l(θ)).[1] Consequently, our stochastic gradient descent process is free to apply its inductive bias to a broad set of models that have similar performance

You assume that one oracle outputting null implies that the other knows this. Specifying this in the query requires that the querier models the other oracle at all.

1Donald Hobson3y
Each oracle is running a simulation of the world. Within that simulation, they search for any computational process with the same logical structure as themselves. This will find both their virtual model of their own hardware, as well as any other agenty processes trying to predict them. The oracle then deletes the output of all these processes within its simulation. Imagine running a super realistic simulation of everything, except that any time anything in the simulation tries to compute the millionth digit of pi, you notice, pause the simulation and edit it to make the result come out as 7. While it might be hard to formally specify what counts as a computation, I think that this intuitively seems like meaningful behavior. I would expect the simulation to contain maths books that said that the millionth digit of pi was 7, and that were correspondingly off by one about how many 7s were in the first n digits for any n>1000000. The principle here is the same.

Our usual objective is "Make it safe, and if we aligned it correctly make it useful.". A microscope is useful even if it's not aligned, because having a world model is a convergent instrumental goal. We increase the bandwidth from it to us, but we decrease the bandwidth from us to it. By telling it almost nothing, we hide our position in the mathematical universe and any attack it devises cannot be specialized on humanity. Imagine finding the shortest-to-specify abstract game that needs AGI to solve (Nomic?), then instantiating an AGI to solve it just to l

...

As I understood it, an Oracle AI is asked a question and produces an answer. A microscope is shown a situation and constructs an internal model that we then extract by reading its innards. Oracles must somehow be incentivized to give useful answers, microscopes cannot help but understand.

4Ofer3y
A microscope model must also be trained somehow, for example with unsupervised learning. Therefore, I expect such a model to also look like it's "incentivized to give useful answers" (e.g. an answer to the question: "what is the next word in the text?"). My understanding is that what distinguishes a microscope model is the way it is being used after it's already trained (namely, allowing researchers to look at its internals for the purpose of gaining insights etcetera, rather than making inferences for the sake of using its valuable output). If this is correct, it seems that we should only use safe training procedures for the purpose of training useful microscopes, rather than training arbitrarily capable models.

I started asking for a chess example because you implied that the reasoning in the top-level comment stops being sane in iterated games.

In a simple iteration of Troll bridge, whether we're dumb is clear after the first time we cross the bridge. In a simple variation, the troll requires smartness even given past observations. In either case, the best worst-case utility bound requires never to cross the bridge, and A knows crossing blows A up. You seemed to expect more.

Suppose my chess skill varies by day. If my last few moves were dumb, I shouldn't rely on

...
1Abram Demski3y
Right, OK. I would say "sequential" rather than "iterated" -- my point was about making a weird assessment of your own future behavior, not what you can do if you face the same scenario repeatedly. IE: Troll Bridge might be seen as artificial in that the environment is explicitly designed to punish you if you're "dumb"; but, perhaps a sequential game can punish you more naturally by virtue of poor future choices. Yep, I agree with this. I concede the following points: * If there is a mistake in the troll-bridge reasoning, predicting that your next actions are likely to be dumb conditional on a dumb-looking action is not an example of the mistake. * Furthermore, that inference makes perfect sense, and if it is as analogous to the troll-bridge reasoning as I was previously suggesting, the troll-bridge reasoning makes sense. However, I still assert the following: * Predicting that your next actions are likely to be dumb conditional on a dumb looking action doesn't make sense if the very reason why you think the action looks dumb is that the next actions are probably dumb if you take it. IE, you don't have a prior heuristic judgement that a move is one which you make when you're dumb; rather, you've circularly concluded that the move would be dumb -- because it's likely to lead to a bad outcome -- because if you take that move your subsequent moves are likely to be bad -- because it is a dumb move. I don't have a natural setup which would lead to this, but the point is that it's a crazy way to reason rather than a natural one. The question, then, is whether the troll-bridge reasoning is analogous to to this. I think we should probably focus on the probabilistic case (recently added to the OP), rather than the proof-based agent. I could see myself deciding that the proof-based agent is more analogous to the sane case than the crazy one. But the probabilistic case seems completely wrong. In the proof-based case, the question is: do we see th

If I'm a poor enough player that I merely have evidence, not proof, that the queen move mates in four, then the heuristic that queen sacrifices usually don't work out is fine and I might use it in real life. If I can prove that queen sacrifices don't work out, the reasoning is fine even for a proof-requiring agent. Can you give a chesslike game where some proof-requiring agent can prove from the rules and perhaps the player source codes that queen sacrifices don't work out, and therefore scores worse than some other agent would have? (Perhaps through mechanisms as in Troll bridge.)

1Abram Demski3y
The heuristic can override mere evidence, agreed. The problem I'm pointing at isn't that the heuristic is fundamentally bad and shouldn't be used, but rather that it shouldn't circularly reinforce its own conclusion by counting a hypothesized move as differentially suggesting you're a bad player in the hypothetical where you make that move. Thinking that way seems contrary to the spirit of the hypothetical (whose purpose is to help evaluate the move). It's fine for the heuristic to suggest things are bad in that hypothetical (because you heuristically think the move is bad); it seems much more questionable to suppose that your subsequent moves will be worse in that hypothetical, particularly if that inference is a lynchpin if your overall negative assessment of the move. What do you want out of the chess-like example? Is it enough for me to say the troll could be the other player, and the bridge could be a strategy which you want to employ? (The other player defeats the strategy if they think you did it for a dumb reason, and they let it work if they think you did it smartly, and they know you well, but you don't know whether they think you're dumb, but you do know that if you were being dumb then you would use the strategy.) This is can be exactly troll bridge as stated in the post, but set in chess with player source code visible. I'm guessing that's not what you want, but I'm not sure what you want.

Correct. I am trying to pin down exactly what you mean by an agent controlling a logical statement. To that end, I ask whether an agent that takes an action iff a statement is true controls the statement through choosing whether to take the action. ("The Killing Curse doesn't crack your soul. It just takes a cracked soul to cast.")

Perhaps we could equip logic with a "causation" preorder such that all tautologies are equivalent, causation implies implication, and whenever we define an agent, we equip its control

...
1Abram Demski3y
The point here is that the agent described is acting like EDT is supposed to -- it is checking whether its action implies X. If yes, it is acting as if it controls X in the sense that it is deciding which action to take using those implications. I'm not arguing at all that we should think "implies X" is causal, nor even that the agent has opinions on the matter; only that the agent seems to be doing something wrong, and one way of analyzing what it is doing wrong is to take a CDT stance and say "the agent is behaving as if it controls X" -- in the same way that CDT says to EDT "you are behaving as if correlation implies causation" even though EDT would not assent to this interpretation of its decision. I think you have me the wrong way around; I was suggesting that certain causally-backwards reasoning would be unwise in chess, not the reverse. In particular, I was suggesting that we should not judge a move poor because we think the move is something only a poor player would do, but always the other way around. For example, suppose we have a prior on moves which suggests that moving a queen into danger is something only a poor player would do. Further suppose we are in a position to move our queen into danger in a way which forces checkmate in 4 moves. I'm saying that if we reason "I could move my queen into danger to open up a path to checkmate in 4. However, only poor players move their queen into danger. Poor players would not successfully navigate a checkmate-in-4. Therefore, if I move my queen into danger, I expect to make a mistake costing me the checkmate in 4. Therefore, I will not move my queen into danger." That's an example of the mistake I was pointing at. Note: I do not personally endorse this as an argument for CDT! I am expressing these arguments because it is part of the significance of Troll Bridge. I think these arguments are the kinds of things one should grapple with if one is grappling with Troll Bridge. I have defended EDT from these kinds of

Troll Bridge is a rare case where agents that require proof to take action can prove they would be insane to take some action before they've thought through its consequences. Can you show how they could unwisely do this in chess, or some sort of Troll Chess?

I don't see how this agent seems to control his sanity. Does the agent who jumps off a roof iff he can (falsely) prove it wise choose whether he's insane by choosing whether he jumps?

I don't see how this agent seems to control his sanity.

The agent in Troll Bridge thinks that it can make itself insane by crossing the bridge. (Maybe this doesn't answer your question?)

Troll Bridge is a rare case where agents that require proof to take action can prove they would be insane to take some action before they've thought through its consequences. Can you show how they could unwisely do this in chess, or some sort of Troll Chess?

I make no claim that this sort of case is common. Scenarios where it comes up and is relevant to X-risk ...

Nothing stops the Halting problem being solved in particular instances. I can prove that some agent halts, and so can it. See FairBot in Robust Cooperation in the Prisoner's Dilemma.

Written slightly differently, the reasoning seems sane: Suppose I cross. I must have proven it's a good idea. Aka I proved that I'm consistent. Aka I'm inconsistent. Aka the bridge blows up. Better not cross.

I agree with your English characterization, and I also agree that it isn't really obvious that the reasoning is pathological. However, I don't think it is so obviously sane, either.

• It seems like counterfactual reasoning about alternative actions should avoid going through "I'm obviously insane" in almost every case; possibly in every case. If you think about what would happen if you made a particular chess move, you need to divorce the consequences from any "I'm obviously insane in that scenario, so the rest of my moves i
...

Conjecture: Every short proof of agentic behavior points out agentic architecture.

Aren't they just averaging together to yield yet another somewhat-but-not-quite-right function?

Indeed we don't want such linear behavior. The AI should preserve the potential for maximization of any candidate utility function - first so it has time to acquire all the environment's evidence about the utility function, and then for the hypothetical future scenario of us deciding to shut it off.

1Abram Demski4y
See this comment. [https://www.lesswrong.com/posts/YJq6R9Wgk5Atjx54D/does-bayes-beat-goodhart#cxBtJQZPN2szwdd6F] Stuart and I are discussing what happens after things have converged as much as they're going to [https://www.lesswrong.com/posts/PADPJ3xac5ogjEGwA/defeating-goodhart-and-the-closest-unblocked-strategy#fhpmnzMqLxQsiE7CW] , but there's still uncertainty left.

How do you know MCTS doesn't preserve alignment?

3Ofer4y
As I understand it - MCTS is used to maximize a given computable utility function, and so it is non alignment-preserving in the general sense that a sufficiently strong optimization of a non-perfect utility function is non alignment-preserving.

So you want to align the AI with us rather than its user by choosing the alignment approach it uses. If it's corrigible towards its user, won't it acquire the capabilities of the other approach in short order to better serve its user? Or is retrofitting the other approach also a blind spot of your proposed approach?

1Wei Dai4y
Yes, that seems like an issue. That's one possible solution. Another one might be to create an aligned AI that is especially good at coordinating with other AIs, so that these AIs can make an agreement with each other to not develop nuclear weapons before they invent the AI that is especially good at developing nuclear weapons. (But would corrigibility imply that the user can always override such agreements?) There may be other solutions that I'm not thinking of. If all else fails, it may be that the only way to avoid AI-caused differential intellectual progress in a bad direction is to stop the development of AI.

Reading the link and some reference abstracts, I think my last comment already had that in mind. The idea here is that a certain kind of AI would accelerate a certain kind of progress more than another, because of the approach we used to align it, and on reflection we would not want this. But surely if it is aligned, and therefore corrigible, this should be no problem?

2Wei Dai4y
Here's a toy example that might make the idea clearer. Suppose we lived in a world that hasn't invented nuclear weapons yet, and someone creates an aligned AI that is really good at developing nuclear weapon technology and only a little bit better than humans on everything else. Even though everyone would prefer that nobody develops nuclear weapons, the invention of this aligned AI (if more than one nation had access to it, and "aligned" means aligned to the user) would accelerate the development of nuclear weapons relative to every other kind of intellectual progress and thereby reduce the expected value of the universe. Does that make more sense now?

Please reword your last idea. There is a possible aligned AI that is biased in its research and will ignore people telling it so?

2Wei Dai4y
I think that section will only make sense if you're familiar with the concept of differential intellectual progress. The wiki page I linked to is a bit outdated, so try https://concepts.effectivealtruism.org/concepts/differential-progress/ and its references instead.