(To show my idea compatible with Boolean attention.)
I use a single head, and the ranks add up linearly.
Appendix C attempts to discard the softargmax, but it's an integral part of my suggestion. If the inputs to softargmax take values {0,1000,2000}, the outputs will take only two values.
From the RASP paper: https://arxiv.org/pdf/2106.06981
The uniform weights dictated by our selectors reflect an attention pattern in which ‘unselected’ pairs are all given strongly negative scores, while the selected pairs all have higher, similar, scores.
Addition of such attention patterns corresponds to anding of such selectors.
Isn't (select(a, b, ==) and select(c, d, ==)) just the sum of the two QK matrices? It'll produce NxN entries in {0,1,2}, and softargmax with low temp treats 1s like 0s in the presence of a dummy token's 2.
It seems important to clarify the difference between From ⊢A, conclude ⊢□A
and ⊢□A→□□A
. I don't feel like I get why we don't just set conclude := →
and ⊢:=□
.
Similarly:
löb = □ (□ A → A) → □ A
□löb = □ (□ (□ A → A) → □ A)
□löb -> löb:
löb premises □ (□ A → A).
By internal necessitation, □ (□ (□ A → A)).
By □löb, □ (□ A).
By löb's premise, □ A.
The recursive self-improvement isn't necessarily human-out-of-the-loop: If an AGI comes up with simpler math, everything gets easier.
It wouldn't just guess the next line of the proof, it'd guess the next conjecture. It could translate our math papers into proofs that compile, then write libraries that reduce code duplication. I expect canonical solutions to our confusions to fall out of a good enough such library.
Oh, I wasn't expecting you to have addressed the issue! 10.2.4 says L wouldn't be S if it were calculated from projected actions instead of given actions. How so? Mightn't it predict the given actions correctly?
You're right on all counts in your last paragraph.
it is very interpretable to humans
Misunderstanding: I expect we can't construct a counterfactual planner because we can't pick out the compute core in the black-box learned model.
And my Eliezer's problem with counterfactual planning is that the plan may start by unleashing a dozen memetic, biological, technological, magical, political and/or untyped existential hazards on the world which then may not even be coordinated correctly when one of your safeguards takes out one of the resulting silicon entities.
If you don't wish to reply to Eliezer, I'm an other and also ask what incoherence allows what corrigibility. I expect counterfactual planning to fail for want of basic interpretability. It would also coherently plan about the planning world - my Eliezer says we might as well equivalently assume superintelligent musings about agency to drive human readers mad.
Someone needs to check if we can use ML to guess activations in one set of neurons from activations in another set of neurons. The losses would give straightforward estimates of such statistical quantities as mutual information. Generating inputs that have the same activations in a set of neurons illustrates what the set of neurons does. I might do this myself if nobody else does.
Suppose the bridge is safe iff there's a proof that the bridge is safe. Then you would forbid the reasoning "Suppose I cross. I must have proven it's safe. Then it's safe, and I get 10. Let's cross.", which seems sane enough in the face of Löb.
Translating to a tree of natural language descriptions and back lets you
Having the thing write papers is merely an existence proof of...
Suppose instead of a timeline with probabilistic events, the coalition experiences the full tree of all possible futures - but we translate everything to preserve behavior. Then beliefs encode which timelines each member cares about, and bets trade influence (governance tokens) between timelines.
In category theory, one learns that good math is like kabbalah, where nothing is a coincidence. All short terms ought to mean something, and when everything fits together better than expected, that is a sign that one is on the right track, and that there is a pattern to formalize. and can be replaced by and . I expect that the latter formation is better because it is shorter. Its only direct effect would be tha...
I am not sure if I understand the question.
pi has form afV, V has form mfV, f is a long reused term. Expand recursion to get afmfmf... and mfmfmf.... Define E=fmE and you get pi=aE without writing f twice. Sure, you use V a lot but my intuition is that there should be some a priori knowable argument for putting the definitions your way or your theory is going to end up with the wrong prior.
Damn it, I came to write about the monad1 then saw the edit. You may want to add it to this list, and compare it with the other entries.
Here's a dissertation and blog post by Jared Tobin on using (X -> R) -> R
with flat reals to represent usual distributions in Haskell. He appears open to get hired.
Maybe you want a more powerful type system? I think Coq allows constructing that subtype of a type which satisfies a property. Agda's cubical type theory places a lot of emphasis for its for the unit interval. Might dependent types be enough express lipsch...
I like your "Corrigibility with Utility Preservation" paper. I don't get why you prefer not using the usual conditional probability notation. leads to TurnTrout's attainable utility preservation. Why not use in the definition of ? Could you change the definition of to , and give the agent the ability to self-modify arbitrarily? The idea is that it would edit itself into its original form in order to make sure is large and ...
I'm not convinced that we can do nothing if the human wants ghosts to be happy. The AI would simply have to do what would make ghosts happy if they were real. In the worst case, the human's (coherent extrapolated) beliefs are your only source of information on how ghosts work. Any proper general solution to the pointers problem will surely handle this case. Apparently, each state of the agent corresponds to some probability distribution over worlds.
with the exception of people who decided to gamble on being part of the elite in outcome B
Game-theoretically, there's a better way. Assume that after winning the AI race, it is easy to figure out everyone else's win probability, utility function and what they would do if they won. Human utility functions have diminishing returns, so there's opportunity for acausal trade. Human ancestry gives a common notion of fairness, so the bargaining problem is easier than with aliens.
Most of us care some even about those who would take all for themselves, so instead o...
This all sounds reasonable. I just saw that you were arguing for more being learned at runtime (as some sort of Steven Reknip), and I thought that surely not all the salt machinery can be learnt, and I wanted to see which of those expectations would win.
Do you posit that it learns over the course of its life that salt taste cures salt definiency, or do you allow this information to be encoded in the genome?
The hypotheses after the modification are supposed to have knowledge that they're in training, for example because they have enough compute to find themselves in the multiverse. Among hypotheses with equal behavior in training, we select the simpler one. We want this to be the one that disregards that knowledge. If the hypothesis has form "Return whatever maximizes property _ of the multiverse", the simpler one uses that knowledge. It is this form of hypothesis which I suggest to remove by inspection.
Take an outer-aligned system, then add a 0 to each training input and a 1 to each deployment input. Wouldn't this add only malicious hypotheses that can be removed by inspection without any adverse selection effects?
I don't buy the lottery example. You never encoded the fact that you know tomorrow's numbers. Shouldn't the prior be that you win a million guranteed if you buy the ticket?
It sounds like you want to use it as a component for alignment of a larger AI, which would somehow turn its natural-language directives into action. I say use it as the capability core: Ask it to do armchair alignment research. If we give it subjective time, a command line interface and internet access, I see no reason it would do worse than the rest of us.
The problem arises whenever the environment changes. Natural selection was a continual process, and yet humans still aren't fitness-aligned.
Re claim 1: If you let it use the page as a scratch pad, you can also let it output commands to a command line interface so it can outsource these hard-to-emulate calculations to the CPU.
It appears to me that a more natural adjustment to the stepwise impact measurement in Correction than appending waiting times would be to make Q also incorporate AUP. Then instead of comparing "Disable the Off-Switch, then achieve the random goal whatever the cost" to "Wait, then achieve the random goal whatever the cost", you would compare "Disable the Off-Switch, then achieve the random goal with low impact" to "Wait, then achieve the random goal with low impact".
The scaling term makes R_AUP vary under adding a constant to all utilities. That doesn't see
...Then that minimum does not make a good denominator because it's always extremely small. It will pick phi to be as powerful as possible to make L small, aka set phi to bottom. (If the denominator before that version is defined at all, bottom is a propositional tautology given A.)
a magma [with] some distinguished element
A monoid?
min,ϕ(A,ϕ⊢⊥) where ϕ is a propositional tautology given A
Propositional tautology given A means A⊢ϕ, right? So ϕ=⊥ would make L small.
If it is capable of becoming more able to maximize its utility function, does it then not already have that ability to maximize its utility function? Do you propose that we reward it only for those plans that pay off after only one "action"?
What do you mean by equivalent? The entire history doesn't say what the opponent will do later or would do against other agents, and the source code may not allow you to prove what the agent does if it involves statements that are true but not provable.
The bottom left picture on page 21 in the paper shows that this is not just regularization coming through only after the error on the training set is ironed out: 0 regularization (1/lambda=inf) still shows the effect.
Can we switch to the interpolation regime early if we, before reaching the peak, tell it to keep the loss constant? Aka we are at loss l* and replace the loss function l(theta) with |l(theta)-l*| or (l(theta)-l*)^2.
You assume that one oracle outputting null implies that the other knows this. Specifying this in the query requires that the querier models the other oracle at all.
Our usual objective is "Make it safe, and if we aligned it correctly make it useful.". A microscope is useful even if it's not aligned, because having a world model is a convergent instrumental goal. We increase the bandwidth from it to us, but we decrease the bandwidth from us to it. By telling it almost nothing, we hide our position in the mathematical universe and any attack it devises cannot be specialized on humanity. Imagine finding the shortest-to-specify abstract game that needs AGI to solve (Nomic?), then instantiating an AGI to solve it just to l
...As I understood it, an Oracle AI is asked a question and produces an answer. A microscope is shown a situation and constructs an internal model that we then extract by reading its innards. Oracles must somehow be incentivized to give useful answers, microscopes cannot help but understand.
I started asking for a chess example because you implied that the reasoning in the top-level comment stops being sane in iterated games.
In a simple iteration of Troll bridge, whether we're dumb is clear after the first time we cross the bridge. In a simple variation, the troll requires smartness even given past observations. In either case, the best worst-case utility bound requires never to cross the bridge, and A knows crossing blows A up. You seemed to expect more.
Suppose my chess skill varies by day. If my last few moves were dumb, I shouldn't rely on
...If I'm a poor enough player that I merely have evidence, not proof, that the queen move mates in four, then the heuristic that queen sacrifices usually don't work out is fine and I might use it in real life. If I can prove that queen sacrifices don't work out, the reasoning is fine even for a proof-requiring agent. Can you give a chesslike game where some proof-requiring agent can prove from the rules and perhaps the player source codes that queen sacrifices don't work out, and therefore scores worse than some other agent would have? (Perhaps through mechanisms as in Troll bridge.)
(Maybe this doesn't answer your question?)
Correct. I am trying to pin down exactly what you mean by an agent controlling a logical statement. To that end, I ask whether an agent that takes an action iff a statement is true controls the statement through choosing whether to take the action. ("The Killing Curse doesn't crack your soul. It just takes a cracked soul to cast.")
Perhaps we could equip logic with a "causation" preorder such that all tautologies are equivalent, causation implies implication, and whenever we define an agent, we equip its control
...Troll Bridge is a rare case where agents that require proof to take action can prove they would be insane to take some action before they've thought through its consequences. Can you show how they could unwisely do this in chess, or some sort of Troll Chess?
I don't see how this agent seems to control his sanity. Does the agent who jumps off a roof iff he can (falsely) prove it wise choose whether he's insane by choosing whether he jumps?
I don't see how this agent seems to control his sanity.
The agent in Troll Bridge thinks that it can make itself insane by crossing the bridge. (Maybe this doesn't answer your question?)
Troll Bridge is a rare case where agents that require proof to take action can prove they would be insane to take some action before they've thought through its consequences. Can you show how they could unwisely do this in chess, or some sort of Troll Chess?
I make no claim that this sort of case is common. Scenarios where it comes up and is relevant to X-risk ...
Nothing stops the Halting problem being solved in particular instances. I can prove that some agent halts, and so can it. See FairBot in Robust Cooperation in the Prisoner's Dilemma.
Written slightly differently, the reasoning seems sane: Suppose I cross. I must have proven it's a good idea. Aka I proved that I'm consistent. Aka I'm inconsistent. Aka the bridge blows up. Better not cross.
I agree with your English characterization, and I also agree that it isn't really obvious that the reasoning is pathological. However, I don't think it is so obviously sane, either.
Aren't they just averaging together to yield yet another somewhat-but-not-quite-right function?
Indeed we don't want such linear behavior. The AI should preserve the potential for maximization of any candidate utility function - first so it has time to acquire all the environment's evidence about the utility function, and then for the hypothetical future scenario of us deciding to shut it off.
So you want to align the AI with us rather than its user by choosing the alignment approach it uses. If it's corrigible towards its user, won't it acquire the capabilities of the other approach in short order to better serve its user? Or is retrofitting the other approach also a blind spot of your proposed approach?
Reading the link and some reference abstracts, I think my last comment already had that in mind. The idea here is that a certain kind of AI would accelerate a certain kind of progress more than another, because of the approach we used to align it, and on reflection we would not want this. But surely if it is aligned, and therefore corrigible, this should be no problem?
Please reword your last idea. There is a possible aligned AI that is biased in its research and will ignore people telling it so?
1.5 does conjunction without my dummy.
When at most one clause is true, 0.5 does disjunction instead.