(To show my idea compatible with Boolean attention.)

I use a single head, and the ranks add up linearly.

113d

You're right that accounting for the softmax could be used to get around the
argument in the appendix. We'll mention this when we update the paper.
The scheme as you described it relies on an always-on dummy token, which would
conflict with our implementation of default values in aggregates: we fix the BOS
token attention to 0.5, so at low softmax temp it's attended to iff nothing else
is; however with this scheme we'd want 0.5 to always round off to zero after
softmax. This is plausibly surmountable but we put it out of scope for the pilot
release since we didn't seem to need this feature, whereas default values come
up pretty often.
Also, while this construction will work for just and (with arbitrarily many
conjuncts), I don't think it works for arbitrary compositions using and and or.
(Of course it would be better than no composition at all.)
In general, I expect there to be a number of potentially low-hanging
improvements to be made to the compiler -- many of them we deliberately omitted
and are mentioned in the limitations section, and many we haven't yet come up
with. There's tons of features one could add, each of which takes time to think
about and increases overall complexity, so we had to be judicious which lines to
pursue -- and even then, we barely had time to get into superposition
experiments. We currently aren't prioritising further Tracr development until we
see it used for research in practice, but I'd be excited to help anyone who's
interested in working with it or contributing.

Appendix C attempts to discard the softargmax, but it's an integral part of my suggestion. If the inputs to softargmax take values {0,1000,2000}, the outputs will take only two values.

From the RASP paper: https://arxiv.org/pdf/2106.06981

The uniform weights dictated by our selectors reflect an attention pattern in which ‘unselected’ pairs are all given strongly negative scores, while the selected pairs all have higher, similar, scores.

Addition of such attention patterns corresponds to anding of such selectors.

114d

(The quote refers to the usage of binary attention patterns in general, so I'm
not sure why you're quoting it)
I obv agree that if you take the softmax over {0, 1000, 2000}, you will get 0
and 1 entries.
iiuc, the statement in the tracr paper is not that you can't have attention
patterns which implement this logical operation, but that you can't have a
single head implementing this attention pattern (without exponential blowup)

Isn't (select(a, b, ==) and select(c, d, ==)) just the sum of the two QK matrices? It'll produce NxN entries in {0,1,2}, and softargmax with low temp treats 1s like 0s in the presence of a dummy token's 2.

114d

I don't think that's right. Iiuc this is a logical and, so the values would be
in {0, 1} (as required, since tracr operates with Boolean attention). For a more
extensive discussion of the original problem see appendix C.

It seems important to clarify the difference between `From ⊢A, conclude ⊢□A`

and `⊢□A→□□A`

. I don't feel like I *get* why we don't just set `conclude := →`

and `⊢:=□`

.

11mo

Well,A→Bis just short for¬A∨B, i.e., "(not A) or B". By contrast,A⊢Bmeans that
there exists a sequence of (very mechanical) applications of modus ponens,
starting from the axioms of Peano Arithmetic (PA) withAappended, ending inB. We
tried hard to make the rules of⊢so that it would agree with→in a lot of cases
(i.e., we tried to design⊢to make the deduction theorem true), but it took a lot
of work in the design of Peano Arithmetic and can't be taken for granted.
For instance, consider the statement¬□(1=0). If you believe Peano Arithmetic is
consistent, then you believe that¬□(1=0), and therefore you also believe that□(1
=0)→(2=3). But PA cannot prove that¬□(1=0)(by Gödel's Theorem, or Löb's theorem
withp=(1=0)), so we don't have⊢□(1=0)→(2=3).

Similarly:

```
löb = □ (□ A → A) → □ A
□löb = □ (□ (□ A → A) → □ A)
□löb -> löb:
löb premises □ (□ A → A).
By internal necessitation, □ (□ (□ A → A)).
By □löb, □ (□ A).
By löb's premise, □ A.
```

12mo

Noice :)

The recursive self-improvement isn't necessarily human-out-of-the-loop: If an AGI comes up with simpler math, everything gets easier.

13mo

I have strong doubts this is actually any help. Ok, maybe the AI comes up with
new simpler math. It's still new math that will take the humans a while to get
used to. There will probably be lots of options to add performance and
complexity. So either we are banning the extra complexity, using our advanced
understanding, and just using the simpler math at the cost of performance, or
the AI soon gets complex again before humans have understood the simple math.

It wouldn't just guess the next line of the proof, it'd guess the next conjecture. It could translate our math papers into proofs that compile, then write libraries that reduce code duplication. I expect canonical solutions to our confusions to fall out of a good enough such library.

16mo

I would expect such libraries to be a million lines of unenlightening
trivialities. And the "libraries to reduce code duplication", to mostly contain
theorems that were proved in multiple places.

Oh, I wasn't expecting you to have addressed the issue! 10.2.4 says L wouldn't be S if it were calculated from projected actions instead of given actions. How so? Mightn't it predict the given actions correctly?

You're right on all counts in your last paragraph.

11y

Not sure if a short answer will help, so I will write a long one.
In 10.2.4 I talk about the possibility of an unwanted learned predictive
function L−(s′,s,a) that makes predictions without using the argument a. This is
possible for example by using s′ together with a (learned) model πl of the
compute core to predict a: so a viable L−could be defined as L−(s′,s,a)=S(s′,s,π
l(s)). This L− could make predictions fully compatible with the observational
record o, but I claim it would not be a reasonable learned L according to the
reasonableness criterion L≈S. How so?
The reasonableness criterion L≈S is similar to that used in supervised machine
learning: we evaluate the learned L not primarily by how it matches the training
set (how well it predicts the observations ino), but by evaluating it on a
separate test set. This test set can be constructed by sampling S to create
samples not contained in o. Mathematically, perfect reasonableness is defined as
L=S, which implies that L predicts all samples from S fully accurately.
Philosophically/ontologically speaking, an the agent specification in my paper,
specifically the learning world diagram and the descriptive text around it of
how this diagram is a model of reality, gives the engineer an unambiguous
prescription of how they might build experimental equipment that can measure the
properties of the S in the learning world diagram by sampling reality. A version
of this equipment must of course be built into the agent, to create the
observations that drive machine learning of L, but another version can be used
stand-alone to construct a test set.
A sampling action to construct a member of the test set would set up a desired
state s and action a, and then observe the resultings′. Mathematically speaking,
this observation gives additional information about the numeric value of S(s′,s,
a) and of all S(s′′,s,a) for all s′′≠s′.
I discuss in the section that, if we take an observational record o sampled from
S, then two le

it is very interpretable to humans

Misunderstanding: I expect we can't construct a counterfactual planner because we can't pick out the compute core in the black-box learned model.

And my Eliezer's problem with counterfactual planning is that the plan may start by unleashing a dozen memetic, biological, technological, magical, political and/or untyped existential hazards on the world which then may not even be coordinated correctly when one of your safeguards takes out one of the resulting silicon entities.

31y

Agree it is hard to pick the compute core out of a black-box learned model that
includes the compute core.
But one important point I am trying to make in the counterfactual planning
sequence/paper is that you do not have to solve that problem. I show that it is
tractable to route around it, and still get an AGI.
I don't understand your second paragraph 'And my Eliezer's problem...'. Can you
unpack this a bit more? Do you mean that counterfactual planning does not
automatically solve the problem of cleaning up an already in-progress mess when
you press the emergency stop button too late? It does not intend to, and I do
not think that the cleanup issue is among the corrigibility-related problems
Eliezer has been emphasizing in the discussion above.

If you don't wish to reply to Eliezer, I'm an other and also ask what incoherence allows what corrigibility. I expect counterfactual planning to fail for want of basic interpretability. It would also coherently plan about the planning world - my Eliezer says we might as well equivalently assume superintelligent musings about agency to drive human readers mad.

21y

See above for my reply to Eliezer.
Indeed, a counterfactual planner
[https://www.lesswrong.com/posts/7EnZgaepSBwaZXA5y/counterfactual-planning-in-agi-systems]
will plan coherently inside its planning world.
In general, when you want to think about coherence without getting deeply
confused, you need to keep track of what reward function you are using to rule
on your coherency criterion. I don't see that fact mentioned often on this
forum, so I will expand.
An agent that plans coherently given a reward function Rp to maximize paperclips
will be an incoherent planner if you judge its actions by a reward function Rs
that values the maximization of staples instead. In section 6.3 of the paper
[https://arxiv.org/abs/2102.00834] I show that you can perfectly well interpret
a counterfactual planner as an agent that plans coherently even inside its
learning world (inside the real world), as long as you are willing to evaluate
its coherency according to the somewhat strange reward function Rπ. Armstrong's
indifference methods use this approach to create corrigibility without losing
coherency: they construct an equivalent somewhat strange reward function by
including balancing terms.
One thing I like about counterfactual planning is that, in my view, it is very
interpretable to humans. Humans are very good at predicting what other humans
will do, when these other humans are planning coherently inside a specifically
incorrect world model, for example in a world model where global warming is a
hoax. The same skill can also be applied to interpreting and anticipating the
actions of AIs which are counterfactual planners. But maybe I am
misunderstanding your concern about interpretability.

Someone needs to check if we can use ML to guess activations in one set of neurons from activations in another set of neurons. The losses would give straightforward estimates of such statistical quantities as mutual information. Generating inputs that have the same activations in a set of neurons illustrates what the set of neurons does. I might do this myself if nobody else does.

21y

I'm not clear on what you'd do with the results of that exercise. Suppose that
on a certain distribution of texts you can explain 40% of the variance in half
of layer 7 by using the other half of layer 7 (and the % gradually increases as
you use make the activation-predicting-model bigger, perhaps you guess it's
approaching 55% in the limit). What's the upshot of models being that
predictable rather than more or less, or the use of the actual predictor that
you learned?
Given an input x, generating other inputs that "look the same as x" to part of
the model but not other parts seems like it reveals something about what that
part of the model does. As a component of interpretability research that seems
pretty similar to doing feature visualization or selecting input examples that
activate a given neuron, and I'd guess it would fit in the same way into the
project of doing interpretability.
I'd mostly be excited about people developing these techniques as part of a
focused project to understand what models are thinking. I'm not really sure what
to make of them in isolation.

Suppose the bridge is safe iff there's a proof that the bridge is safe. Then you would forbid the reasoning "Suppose I cross. I must have proven it's safe. Then it's safe, and I get 10. Let's cross.", which seems sane enough in the face of Löb.

Translating to a tree of natural language descriptions and back lets you

- use different instances of GPT at each step, so that any coded messages would have to rely on the common training set rather than shared internals.
- do a random subset of the tasks yourself, so that any subversion of the target system would have to be based on subtle biases, perhaps with a small number of outright lies that it hopes we don't implement manually, rather than the description being fabricated from whole cloth.

Having the thing write papers is merely an existence proof of...

12y

Maybe you did. I find it hard to distinguish inventing and half remembering
ideas.
If the training procedure either
1. Reliably produces mesaoptimisers with about the same values. or
2. Reliably produces mesaoptimizers that can acausally cooperate
3. The rest of the procedure allows one mesaoptimizer to take control of the
whole output
Then using different copies of GPT-n trained from different seeds doesn't help.
If you just convert 1% of the english into network yourself, then all it needs
to use is some error correction. Even without that, neural net structures are
somewhat redundant. A malicious optimizer with a random 1% removed is probably
still a malicious optimizer.
Secondly, its goal here is to give us subtly wrong alignment papers. Suppose we
produce an examination and transparency step so thorough that no mesa optimizer
can slip through. The original mesaoptimiser can just design a network that is
doing its best as a mesaoptimiser free predictor, but has subtly maliciously
chosen beliefs about AI research.

Suppose instead of a timeline with probabilistic events, the coalition experiences the full tree of all possible futures - but we translate everything to preserve behavior. Then beliefs encode which timelines each member cares about, and bets trade influence (governance tokens) between timelines.

22y

Can you justify Kelly "directly" in terms of Pareto-improvement trades rather
than "indirectly" through Pareto-optimality? I feel this gets at the distinction
between the selfish vs altruistic view.

In category theory, one learns that good math is like kabbalah, where nothing is a coincidence. All short terms ought to mean something, and when everything fits together better than expected, that is a sign that one is on the right track, and that there is a pattern to formalize. and can be replaced by and . I expect that the latter formation is better because it is shorter. Its only direct effect would be tha...

12y

OK, I think I see what inspired your question.
If you want to give this kind of give the math the kabbalah treatment, you may
also look at the math in [EFDH16], which produces agents similar to my
definitions (4) (5), and also some variants that have different types of
self-reflection. In the later paper here [https://arxiv.org/abs/1908.04734],
Everitt et al. develop some diagrammatic models of this type of agent
self-awareness, but the models are not full definitions of the agent.
For me, the main questions I have about the math developed in the paper is how
exactly I can map the model and the constraints (C1-3) back to things I can or
should build in physical reality.
There is a thing going on here (when developing agent models, especially when
treating AGI/superintelligence and embeddeness) that also often happens in
post-Newtonian physics. The equations work, but if we attempt to map these
equations to some prior intuitive mental model we have about how reality or
decision making must necessarily work, we have to conclude that this attempt
raises some strange and troubling questions.
I'm with modern physics here (I used to be an experimental physicist for a
while), where the (mainstream) response to this is that 'the math works, your
intuitive feelings about how X must necessarily work are wrong, you will get
used to it eventually'.
BTW, I offer some additional interpretation of a difficult-to-interpret part of
the math in section 10 of my 2020 paper here [https://arxiv.org/abs/2007.05411].
You could insert quantilization in several ways in the model. Most obvious way
is to change the basic definition (4). You might also define a transformation
that takes any reward function R and returns a quantilized reward function Rq,
this gives you a different type of quantilization, but I feel it would be in the
same spirit.
In a more general sense, I do not feel that quantilization can produce the kind
of corrigibility I am after in the paper. The effects you get o

I am not sure if I understand the question.

pi has form afV, V has form mfV, f is a long reused term. Expand recursion to get afmfmf... and mfmfmf.... Define E=fmE and you get pi=aE without writing f twice. Sure, you use V a lot but my intuition is that there should be some a priori knowable argument for putting the definitions your way or your theory is going to end up with the wrong prior.

12y

Thanks for expanding on your question about the use of V. Unfortunately. I still
have a hard time understanding your question, so I'll say a few things and hope
that will clarify.
If you expand the V term defined in (5) recursively, you get a tree-like
structure. Each node in the tree has as many sub nodes as there are elements in
the set W. The tree is in fact a tree of branching world lines. Hope this helps
you visualize what is going on.
I could shuffle around some symbols and terms in the definitions (4) and (5) and
still create a model of exactly the same agent that will behave in exactly the
same way. So the exact way in which these two equations are written down and
recurse on each other is somewhat contingent. My equations stay close to what is
used when you model an agent or 'rational' decision making process with a
Bellman equation. If your default mental model of an agent is a set of
Q-learning equations, the model I develop will look strange, maybe even
unnatural at first sight.
OK, maybe this is the main point that inspired your question. The agency/world
models developed in the paper are not a 'theory', in the sense that theories
have predictive power. A mathematical model used as a theory, like f=m∗a,
predicts how objects will accelerate when subjected to a force.
The agent model in the paper does not really 'predict' how agents will behave.
The model is compatible with almost every possible agent construction and agent
behavior, if we are allowed to pick the agent's reward function R freely after
observing of reverse-engineering the agent to be modeled.
On purpose, the agent model is constructed with so many 'free parameters' that
is has no real predictive power. What you get here is an agent model that can
describe almost every possible agent and world in which it could operate.
In mathematics. the technique I am using in the paper is sometimes called
'without loss of generality'
[https://en.wikipedia.org/wiki/Without_loss_of_generality]. I am

Damn it, I came to write about the monad^{1} then saw the edit. You may want to add it to this list, and compare it with the other entries.

Here's a dissertation and blog post by Jared Tobin on using `(X -> R) -> R`

with flat reals to represent usual distributions in Haskell. He appears open to get hired.

Maybe you want a more powerful type system? I think Coq allows constructing that subtype of a type which satisfies a property. Agda's cubical type theory places a lot of emphasis for its for the unit interval. Might dependent types be enough express lipsch...

I like your "Corrigibility with Utility Preservation" paper. I don't get why you prefer not using the usual conditional probability notation. leads to TurnTrout's attainable utility preservation. Why not use in the definition of ? Could you change the definition of to , and give the agent the ability to self-modify arbitrarily? The idea is that it would edit itself into its original form in order to make sure is large and ...

12y

In general if you would forcefully change the agent's reward function into some
R′, it will self-preserve from that moment on and try to maintain this R′, so it
won't self-edit its R′ back into the original form.
There are exceptions to this general rule, for special versions of R′ and
special versions of agent environments (see section 7.2), where you can get the
agent to self-edit, but on first glance, your example above does not seem to be
one.
If you remove the dntu bits from the agent definition then you can get an agent
that self-edits a lot, but without changing its fundamental goals. The proofs of
'without changing its fundamental goals' will get even longer and less readable
than the current proofs in the paper, so that is why I did the dntu privileging.

12y

Thanks!
Well, I wrote in the paper (section 5) that I used p(st,a,st+1)instead of the
usual conditional probability notationP(st+1|st,a) because it 'fits better with
the mathematical logic style used in the definitions and proofs below.' i.e. the
proofs use the mathematics of second order logic, not probability theory.
However this was not my only reason for this preference. The other reason what
that I had an intuitive suspicion back in 2019 that the use of conditional
probability notation, in the then existing papers and web pages on balancing
terms, acted as an of impediment to mathematical progress. My suspicion was that
it acted as an overly Bayesian framing that made it more difficult to clarify
and generalize the mathematics of this technique any further.
In hindsight in 2021, I can be a bit more clear about my 2019 intuition.
Armstrong's original balancing term elementsE(v|u→v) and E(u|u→u), where u→v and
u→v are low-probability near-future events, can be usefully generalized (and
simplified) as the Pearlian E(v|do(v)) and E(u|do(u)) where the do terms are
interventions (or 'edits') on the current world state.
The notation E(v|u→v) makes it look like the balancing terms might have some
deep connection to Bayesian updating or Bayesian philosophy, whereas I feel they
do not have any such deep connection.
That being said, in my 2020 paper [https://arxiv.org/abs/2007.05411]I present a
simplified version of the math in the 2019 paper using the traditional P(st+1|st
,a) notation again, and without having to introduce do.
Yes it is very related: I explore that connection in more detail in section 12
of my 2020 paper [https://arxiv.org/abs/2007.05411]. In general I think that
counterfactual expected-utility reward function terms are a Swiss army knifes
with many interesting uses. I feel that as a community, we have not yet gotten
to the bottom of their possibilities (and their possible failure modes).
In definition of π∗ (section 5.3 equation 4) I am using a

I'm not convinced that we can do nothing if the human wants ghosts to be happy. The AI would simply have to do what would make ghosts happy if they were real. In the worst case, the human's (coherent extrapolated) beliefs are your only source of information on how ghosts work. Any proper general solution to the pointers problem will surely handle this case. Apparently, each state of the agent corresponds to some probability distribution over worlds.

32y

This seems like it's only true if the humans would truly cling to their belief
in spite of all evidence (IE if they believed in ghosts dogmatically), which
seems untrue for many things (although I grant that some humans may have some
beliefs like this). I believe the idea of the ghost example is to point at cases
where there's an ontological crisis, not cases where the ontology is so dogmatic
that there can be no crisis (though, obviously, both cases are theoretically
important).
However, I agree with you in either case -- it's not clear there's "nothing to
be done" for the ghost case (in either interpretation).

with the exception of people who decided to gamble on being part of the elite in outcome B

Game-theoretically, there's a better way. Assume that after winning the AI race, it is easy to figure out everyone else's win probability, utility function and what they would do if they won. Human utility functions have diminishing returns, so there's opportunity for acausal trade. Human ancestry gives a common notion of fairness, so the bargaining problem is easier than with aliens.

Most of us care some even about those who would take all for themselves, so instead o...

12y

Good point, acausal trade can at least ameliorate the problem, pushing towards
atomic alignment. However, we understand acausal trade too poorly to be highly
confident it will work. And, "making acausal trade work" might in itself be
considered outside of the desiderata of atomic alignment (since it involves
multiple AIs). Moreover, there are also actors that have a very low probability
of becoming TAI users but whose support is beneficial for TAI projects (e.g.
small donors). Since they have no counterfactual AI to bargain on their behalf,
it is less likely acausal trade works here.

This all sounds reasonable. I just saw that you were arguing for more being learned at runtime (as some sort of Steven Reknip), and I thought that surely not all the salt machinery can be learnt, and I wanted to see which of those expectations would win.

22y

(Oh, I get it, Reknip is Pinker backwards.) If you're interested in my take more
generally see My Computational Framework for the Brain
[https://www.lesswrong.com/posts/diruo47z32eprenTg/my-computational-framework-for-the-brain]
. :-)

Do you posit that it learns over the course of its life that salt taste cures salt definiency, or do you allow this information to be encoded in the genome?

32y

I think the circuitry which monitors salt homeostasis, and which sends a reward
signal when both salt is deficient and the neocortex starts imagining the taste
of salt ... I think that circuitry is innate and in the genome. I don't think
it's learned.
That's just a guess from first principles: there's no reason it couldn't be
innate, it's not that complicated, and it's very important to get the behavior
right. (Trial-and-error learning of salt homeostasis would, I imagine, be often
fatal.)
I do think there are some non-neocortex aspects of food-related behavior that
are learned—I'm thinking especially of how, if you eat food X, then get sick a
few hours later, you develop a revulsion to the taste of food X. That's
obviously learned, and I really don't think that this learning is happening in
the neocortex. It's too specific. It's only one specific type of association, it
has to occur in a specific time-window, etc.
But I suspect that the subcortical systems governing salt homeostasis in
particular are entirely innate (or at least mostly innate), i.e. not involving
learning.
Does that answer your question? Sorry if I'm misunderstanding.

The hypotheses after the modification are supposed to have knowledge that they're in training, for example because they have enough compute to find themselves in the multiverse. Among hypotheses with equal behavior in training, we select the simpler one. We want this to be the one that disregards that knowledge. If the hypothesis has form "Return whatever maximizes property _ of the multiverse", the simpler one uses that knowledge. It is this form of hypothesis which I suggest to remove by inspection.

12y

Ok, that should work assuming something analogous to Paul's hypothesis about
minimal circuits being daemon-free.

Take an outer-aligned system, then add a 0 to each training input and a 1 to each deployment input. Wouldn't this add only malicious hypotheses that can be removed by inspection without any adverse selection effects?

32y

After thinking about it for a couple minutes, this question is both more
interesting and less trivial than it seemed. The answer is not obvious to me.
On the face of it, passing in a bit which is always constant in training should
do basically nothing - the system has no reason to use a constant bit. But if
the system becomes reflective (i.e. an inner optimizer shows up and figures out
that it's in a training environment), then that bit could be used. In principle,
this wouldn't necessarily be malicious - the bit could be used even by aligned
inner optimizers, as data about the world just like any other data about the
world. That doesn't seem likely with anything like current architectures, but
maybe in some weird architecture which systematically produced aligned inner
optimizers.

I don't buy the lottery example. You never encoded the fact that you know tomorrow's numbers. Shouldn't the prior be that you win a million guranteed if you buy the ticket?

32y

No! You also have to enter the right numbers.
What I'm doing is modeling "gamble with the money" as a simple action - you can
imaging there's a big red button that gives you $200 1/16th of the time and
takes all your money otherwise.
And then I'm modeling "but a lotto ticket" as a compound action consisting of
entering each number individually.
"Knowing the numbers" means your world model understands that if you've entered
the right numbers, you get the money. But it doesn't make "enter the right
numbers" probable in the prior.
Of course the conclusion is reverse if we make "enter the right numbers" into a
primitive action.

12y

I also didn't understand that. I was thinking of it more like AlphaStar in the
sense that your prior is that you're going to continue using your current
(probabilistic) policy for all the steps involved in what you're thinking about.
(But not like AlphaStar in that the brain is more likely to do
one-or-a-few-steps of rollout with clever hierarchical abstract representations
of plans, rather than dozens-of-steps rollouts in a simple one-step-at-a-time
way.)

It sounds like you want to use it as a component for alignment of a larger AI, which would somehow turn its natural-language directives into action. I say use it as the capability core: Ask it to do armchair alignment research. If we give it subjective time, a command line interface and internet access, I see no reason it would do worse than the rest of us.

12y

In retrospect, I was totally unclear that I wan't necessarily talking about
something that has a complicated internal state, such that it can behave like
one human over long time scales. I was thinking more about the "minimum
human-imitating unit" necessary to get things like IDA off the ground.
In fact this post was originally titled "What to do with a GAN of a human?"

The problem arises whenever the environment changes. Natural selection was a continual process, and yet humans still aren't fitness-aligned.

Re claim 1: If you let it use the page as a scratch pad, you can also let it output commands to a command line interface so it can outsource these hard-to-emulate calculations to the CPU.

23y

I'm unsure that GPT3 can output, say, a ipython notebook to get the values it
wants.
That would be really interesting to try...

It appears to me that a more natural adjustment to the stepwise impact measurement in Correction than appending waiting times would be to make Q also incorporate AUP. Then instead of comparing "Disable the Off-Switch, then achieve the random goal whatever the cost" to "Wait, then achieve the random goal whatever the cost", you would compare "Disable the Off-Switch, then achieve the random goal with low impact" to "Wait, then achieve the random goal with low impact".

The scaling term makes R_AUP vary under adding a constant to all utilities. That doesn't see

...13y

This has been an idea I’ve been intrigued by ever since AUP came out. My main
concern with it is the increase in compute required and loss of competitiveness.
Still probably worth running the experiments.
Correct. Proposition 4 in the AUP paper guarantees penalty invariance to affine
transformation only if the denominator is also the penalty for taking some
action (absolute difference in Q values). You could, for example, consider the
penalty of some mild action: |Q(s,amild)−Q(s,∅)|. It’s really up to the designer
in the near-term. We’ll talk about more streamlined designs for superhuman use
cases in two posts.
Don’t think so. Moving generates tiny penalties, and going in circles usually
isn’t a great way to accrue primary reward.

Then that minimum does not make a good denominator because it's always extremely small. It will pick phi to be as powerful as possible to make L small, aka set phi to bottom. (If the denominator before that version is defined at all, bottom is a propositional tautology given A.)

13y

Oh, I see what the issue is. Propositional tautology given A means A⊢pcϕ, not A⊢
ϕ. So yeah, when A is a boolean that is equivalent to ⊥ via boolean logic alone,
we can't use that A for the exact reason you said, but if A isn't equivalent to
⊥ via boolean logic alone (although it may be possible to infer ⊥ by other
means), then the denominator isn't necessarily small.

a magma [with] some distinguished element

A monoid?

min,ϕ(A,ϕ⊢⊥) where ϕ is a propositional tautology given A

Propositional tautology given A means A⊢ϕ, right? So ϕ=⊥ would make L small.

13y

Yup, a monoid, because ϕ∨⊥=ϕ and A∪∅=A, so it acts as an identitity element, and
we don't care about the order. Nice catch.
You're also correct about what propositional tautology given A means.

If it is capable of becoming more able to maximize its utility function, does it then not already have that ability to maximize its utility function? Do you propose that we reward it only for those plans that pay off after only one "action"?

13y

Not quite. I'm proposing penalizing it for gaining power, a la my recent post
[https://www.lesswrong.com/s/7CdoznhJaLEKHwvJW/p/6DuJxY8X45Sco4bS2]. There's a
big difference between "able to get 10 return from my current vantage point" and
"I've taken over the planet and can ensure i get 100 return with high
probability". We're penalizing it for increasing its ability like that
(concretely, see Conservative Agency [https://arxiv.org/abs/1902.09725] for an
analogous formalization, or if none of this makes sense still, wait till the end
of Reframing Impact).

What do you mean by equivalent? The entire history doesn't say what the opponent will do later or would do against other agents, and the source code may not allow you to prove what the agent does if it involves statements that are true but not provable.

13y

For a fixed policy, the history is the only thing you need to know in order to
simulate the agent on a given round. In this sense, seeing the history is
equivalent to seeing the source code.
The claim is: In settings where the agent has unlimited memory and sees the
entire history or source code, you can't get good guarantees (as in the folk
theorem for repeated games). On the other hand, in settings where the agent sees
part of the history, or is constrained to have finite memory (possibly of size O
(log11−γ)?), you can (maybe?) prove convergence to Pareto efficient outcomes or
some other strong desideratum that deserves to be called "superrationality".

The bottom left picture on page 21 in the paper shows that this is not just regularization coming through only after the error on the training set is ironed out: 0 regularization (1/lambda=inf) still shows the effect.

Can we switch to the interpolation regime early if we, before reaching the peak, tell it to keep the loss constant? Aka we are at loss l* and replace the loss function l(theta) with |l(theta)-l*| or (l(theta)-l*)^2.

23y

Interesting! Given that stochastic gradient descent (SGD) does provide an
inductive bias towards models that generalize better, it does seem like changing
the loss function in this way could enhance generalization performance. Broadly
speaking, SGD's bias only provides a benefit when it is searching over many
possible models: it performs badly at the interpolation threshold because the
lowish complexity limits convergence to a small number of overfitted models.
Creating a loss function that allows SGD more reign over the model it selects
could therefore improve generalization.
If
#1 SGD is inductively biased to more generalizeable models in general
#2 an (l(θ)−l∗)2 loss-function gives all models with near l∗ a wider local
minimum
#3 there are many different models where l(θ)≈l∗ at a given level of complexity
as long as l∗>0
then it's plausible that changing the loss-function in this way will help
emphasize SGD's bias towards models that generalize better. Point #1 is an
explanation for double-descent. Point #2 seems intuitive to me (it makes the
loss-function more convex and flatter when models are better performing) and
Point #3 does too: there are many different sets of prediction that will all
partially fit the training-dataset and yield the same loss function value of l∗,
which implies that there are also many different predictive models that yield
such a loss function.
To illustrate point #3 above, imagine we're trying to fit the set of training
observations {→x1,→x2,→x3,...,→xi,...→xn}. Fully overfitting this set (getting l
(θ)≈0) requires us to get all →xi from 1 to ncorrect. However, we can partially
overfit this set (getting l(θ)=l∗) in a variety of different ways. For instance,
if we get all →xi correct except for →xj, we may have roughly n different ways
we can pick →xj that could yield the same l(θ)).[1] Consequently, our stochastic
gradient descent process is free to apply its inductive bias to a broad set of
models that have similar performance

You assume that one oracle outputting null implies that the other knows this. Specifying this in the query requires that the querier models the other oracle at all.

13y

Each oracle is running a simulation of the world. Within that simulation, they
search for any computational process with the same logical structure as
themselves. This will find both their virtual model of their own hardware, as
well as any other agenty processes trying to predict them. The oracle then
deletes the output of all these processes within its simulation.
Imagine running a super realistic simulation of everything, except that any time
anything in the simulation tries to compute the millionth digit of pi, you
notice, pause the simulation and edit it to make the result come out as 7. While
it might be hard to formally specify what counts as a computation, I think that
this intuitively seems like meaningful behavior. I would expect the simulation
to contain maths books that said that the millionth digit of pi was 7, and that
were correspondingly off by one about how many 7s were in the first n digits for
any n>1000000.
The principle here is the same.

Our usual objective is "Make it safe, and if we aligned it correctly make it useful.". A microscope is useful even if it's not aligned, because having a world model is a convergent instrumental goal. We increase the bandwidth from it to us, but we decrease the bandwidth from us to it. By telling it almost nothing, we hide our position in the mathematical universe and any attack it devises cannot be specialized on humanity. Imagine finding the shortest-to-specify abstract game that needs AGI to solve (Nomic?), then instantiating an AGI to solve it just to l

...As I understood it, an Oracle AI is asked a question and produces an answer. A microscope is shown a situation and constructs an internal model that we then extract by reading its innards. Oracles must somehow be incentivized to give useful answers, microscopes cannot help but understand.

43y

A microscope model must also be trained somehow, for example with unsupervised
learning. Therefore, I expect such a model to also look like it's "incentivized
to give useful answers" (e.g. an answer to the question: "what is the next word
in the text?").
My understanding is that what distinguishes a microscope model is the way it is
being used after it's already trained (namely, allowing researchers to look at
its internals for the purpose of gaining insights etcetera, rather than making
inferences for the sake of using its valuable output). If this is correct, it
seems that we should only use safe training procedures for the purpose of
training useful microscopes, rather than training arbitrarily capable models.

I started asking for a chess example because you implied that the reasoning in the top-level comment stops being sane in iterated games.

In a simple iteration of Troll bridge, whether we're dumb is clear after the first time we cross the bridge. In a simple variation, the troll requires smartness even given past observations. In either case, the best worst-case utility bound requires never to cross the bridge, and A knows crossing blows A up. You seemed to expect more.

Suppose my chess skill varies by day. If my last few moves were dumb, I shouldn't rely on

...13y

Right, OK. I would say "sequential" rather than "iterated" -- my point was about
making a weird assessment of your own future behavior, not what you can do if
you face the same scenario repeatedly. IE: Troll Bridge might be seen as
artificial in that the environment is explicitly designed to punish you if
you're "dumb"; but, perhaps a sequential game can punish you more naturally by
virtue of poor future choices.
Yep, I agree with this.
I concede the following points:
* If there is a mistake in the troll-bridge reasoning, predicting that your
next actions are likely to be dumb conditional on a dumb-looking action is
not an example of the mistake.
* Furthermore, that inference makes perfect sense, and if it is as analogous to
the troll-bridge reasoning as I was previously suggesting, the troll-bridge
reasoning makes sense.
However, I still assert the following:
* Predicting that your next actions are likely to be dumb conditional on a dumb
looking action doesn't make sense if the very reason why you think the action
looks dumb is that the next actions are probably dumb if you take it.
IE, you don't have a prior heuristic judgement that a move is one which you make
when you're dumb; rather, you've circularly concluded that the move would be
dumb -- because it's likely to lead to a bad outcome -- because if you take that
move your subsequent moves are likely to be bad -- because it is a dumb move.
I don't have a natural setup which would lead to this, but the point is that
it's a crazy way to reason rather than a natural one.
The question, then, is whether the troll-bridge reasoning is analogous to to
this.
I think we should probably focus on the probabilistic case (recently added to
the OP), rather than the proof-based agent. I could see myself deciding that the
proof-based agent is more analogous to the sane case than the crazy one. But the
probabilistic case seems completely wrong.
In the proof-based case, the question is: do we see th

If I'm a poor enough player that I merely have evidence, not proof, that the queen move mates in four, then the heuristic that queen sacrifices usually don't work out is fine and I might use it in real life. If I can prove that queen sacrifices don't work out, the reasoning is fine even for a proof-requiring agent. Can you give a chesslike game where some proof-requiring agent can prove from the rules and perhaps the player source codes that queen sacrifices don't work out, and therefore scores worse than some other agent would have? (Perhaps through mechanisms as in Troll bridge.)

13y

The heuristic can override mere evidence, agreed. The problem I'm pointing at
isn't that the heuristic is fundamentally bad and shouldn't be used, but rather
that it shouldn't circularly reinforce its own conclusion by counting a
hypothesized move as differentially suggesting you're a bad player in the
hypothetical where you make that move. Thinking that way seems contrary to the
spirit of the hypothetical (whose purpose is to help evaluate the move). It's
fine for the heuristic to suggest things are bad in that hypothetical (because
you heuristically think the move is bad); it seems much more questionable to
suppose that your subsequent moves will be worse in that hypothetical,
particularly if that inference is a lynchpin if your overall negative assessment
of the move.
What do you want out of the chess-like example? Is it enough for me to say the
troll could be the other player, and the bridge could be a strategy which you
want to employ? (The other player defeats the strategy if they think you did it
for a dumb reason, and they let it work if they think you did it smartly, and
they know you well, but you don't know whether they think you're dumb, but you
do know that if you were being dumb then you would use the strategy.) This is
can be exactly troll bridge as stated in the post, but set in chess with player
source code visible.
I'm guessing that's not what you want, but I'm not sure what you want.

(Maybe this doesn't answer your question?)

Correct. I am trying to pin down exactly what you mean by an agent controlling a logical statement. To that end, I ask whether an agent that takes an action iff a statement is true controls the statement through choosing whether to take the action. ("The Killing Curse doesn't crack your soul. It just takes a cracked soul to cast.")

Perhaps we could equip logic with a "causation" preorder such that all tautologies are equivalent, causation implies implication, and whenever we define an agent, we equip its control

...13y

The point here is that the agent described is acting like EDT is supposed to --
it is checking whether its action implies X. If yes, it is acting as if it
controls X in the sense that it is deciding which action to take using those
implications. I'm not arguing at all that we should think "implies X" is causal,
nor even that the agent has opinions on the matter; only that the agent seems to
be doing something wrong, and one way of analyzing what it is doing wrong is to
take a CDT stance and say "the agent is behaving as if it controls X" -- in the
same way that CDT says to EDT "you are behaving as if correlation implies
causation" even though EDT would not assent to this interpretation of its
decision.
I think you have me the wrong way around; I was suggesting that certain
causally-backwards reasoning would be unwise in chess, not the reverse. In
particular, I was suggesting that we should not judge a move poor because we
think the move is something only a poor player would do, but always the other
way around. For example, suppose we have a prior on moves which suggests that
moving a queen into danger is something only a poor player would do. Further
suppose we are in a position to move our queen into danger in a way which forces
checkmate in 4 moves. I'm saying that if we reason "I could move my queen into
danger to open up a path to checkmate in 4. However, only poor players move
their queen into danger. Poor players would not successfully navigate a
checkmate-in-4. Therefore, if I move my queen into danger, I expect to make a
mistake costing me the checkmate in 4. Therefore, I will not move my queen into
danger." That's an example of the mistake I was pointing at.
Note: I do not personally endorse this as an argument for CDT! I am expressing
these arguments because it is part of the significance of Troll Bridge. I think
these arguments are the kinds of things one should grapple with if one is
grappling with Troll Bridge. I have defended EDT from these kinds of

Troll Bridge is a rare case where agents that require proof to take action can prove they would be insane to take some action before they've thought through its consequences. Can you show how they could unwisely do this in chess, or some sort of Troll Chess?

I don't see how this agent seems to control his sanity. Does the agent who jumps off a roof iff he can (falsely) prove it wise choose whether he's insane by choosing whether he jumps?

I don't see how this agent seems to control his sanity.

The agent in Troll Bridge thinks that it can make itself insane by crossing the bridge. (Maybe this doesn't answer your question?)

Troll Bridge is a rare case where agents that require proof to take action can prove they would be insane to take some action before they've thought through its consequences. Can you show how they could unwisely do this in chess, or some sort of Troll Chess?

I make no claim that this sort of case is common. Scenarios where it comes up and is relevant to X-risk ...

Nothing stops the Halting problem being solved in particular instances. I can prove that some agent halts, and so can it. See FairBot in Robust Cooperation in the Prisoner's Dilemma.

Written slightly differently, the reasoning seems sane: Suppose I cross. I must have proven it's a good idea. Aka I proved that I'm consistent. Aka I'm inconsistent. Aka the bridge blows up. Better not cross.

I agree with your English characterization, and I also agree that it isn't really obvious that the reasoning is pathological. However, I don't think it is so obviously sane, either.

- It seems like counterfactual reasoning about alternative actions should avoid going through "I'm obviously insane" in almost every case; possibly in every case. If you think about what would happen if you made a particular chess move, you need to divorce the consequences from any "I'm obviously insane in that scenario, so the rest of my moves i

Conjecture: Every short proof of agentic behavior points out agentic architecture.

Aren't they just averaging together to yield yet another somewhat-but-not-quite-right function?

Indeed we don't want such linear behavior. The AI should preserve the potential for maximization of any candidate utility function - first so it has time to acquire all the environment's evidence about the utility function, and then for the hypothetical future scenario of us deciding to shut it off.

14y

See this comment.
[https://www.lesswrong.com/posts/YJq6R9Wgk5Atjx54D/does-bayes-beat-goodhart#cxBtJQZPN2szwdd6F]
Stuart and I are discussing what happens after things have converged as much as
they're going to
[https://www.lesswrong.com/posts/PADPJ3xac5ogjEGwA/defeating-goodhart-and-the-closest-unblocked-strategy#fhpmnzMqLxQsiE7CW]
, but there's still uncertainty left.

How do you know MCTS doesn't preserve alignment?

34y

As I understand it - MCTS is used to maximize a given computable utility
function, and so it is non alignment-preserving in the general sense that a
sufficiently strong optimization of a non-perfect utility function is non
alignment-preserving.

So you want to align the AI with us rather than its user by choosing the alignment approach it uses. If it's corrigible towards its user, won't it acquire the capabilities of the other approach in short order to better serve its user? Or is retrofitting the other approach also a blind spot of your proposed approach?

14y

Yes, that seems like an issue.
That's one possible solution. Another one might be to create an aligned AI that
is especially good at coordinating with other AIs, so that these AIs can make an
agreement with each other to not develop nuclear weapons before they invent the
AI that is especially good at developing nuclear weapons. (But would
corrigibility imply that the user can always override such agreements?) There
may be other solutions that I'm not thinking of. If all else fails, it may be
that the only way to avoid AI-caused differential intellectual progress in a bad
direction is to stop the development of AI.

Reading the link and some reference abstracts, I think my last comment already had that in mind. The idea here is that a certain kind of AI would accelerate a certain kind of progress more than another, because of the approach we used to align it, and on reflection we would not want this. But surely if it is aligned, and therefore corrigible, this should be no problem?

24y

Here's a toy example that might make the idea clearer. Suppose we lived in a
world that hasn't invented nuclear weapons yet, and someone creates an aligned
AI that is really good at developing nuclear weapon technology and only a little
bit better than humans on everything else. Even though everyone would prefer
that nobody develops nuclear weapons, the invention of this aligned AI (if more
than one nation had access to it, and "aligned" means aligned to the user) would
accelerate the development of nuclear weapons relative to every other kind of
intellectual progress and thereby reduce the expected value of the universe.
Does that make more sense now?

Please reword your last idea. There is a possible aligned AI that is biased in its research and will ignore people telling it so?

24y

I think that section will only make sense if you're familiar with the concept of
differential intellectual progress. The wiki page I linked to is a bit outdated,
so try https://concepts.effectivealtruism.org/concepts/differential-progress/
and its references instead.

1.5 does conjunction without my dummy.

When at most one clause is true, 0.5 does disjunction instead.