MMath Cambridge. Currently studying postgrad at Edinburgh.
If AI labs are slamming on the recursive self improvement ASAP, it may be that Autonomous Replicating Agents are irrelevant. But that's a "ARA can't destroy the world if AI labs do it first" argument.
ARA may well have more compute than AI labs. Especially if the AI labs are trying to stay within the law, and the ARA is stealing any money/compute that it can hack it's way into. (Which could be >90% of the internet if it's good at hacking. )
there will be millions of other (potentially misaligned) models being deployed deliberately by humans, including on very sensitive tasks (like recursive self-improvement).
Ok. That's a world model in which humans are being INCREDIBLY stupid.
If we want to actually win, we need to both be careful about deploying those other misaligned models, and stop ARA.
Alice: That snake bite looks pretty nasty, it could kill you if you don't get it treated.
Bob: That snake bite won't kill me, this hand grenade will. Pulls out pin.
If you can put uploaded human-level agents with evolved-organism preferences in your simulations, you can just win outright (eg by having them spend subjective millennia doing FAI research for you). If you can’t, that will be a very obvious difference between your simulations and the real world.
I disagree. If your simulation is perfectly realistic, the simulated humans might screw up at alignment and create an unfriendly superintelligence, for much the same reason real humans might.
Also, if the space of goals that evolution + culture can produce is large, then you may be handing control to a mind with rather different goals.Rerolling the same dice won't give the same answer.
These problems may be solvable, depending on what the capabilities here are, but they aren't trivial.
Taking IID samples can be hard actually. Suppose you train an LLM on news articles. And each important real world event has 10 basically identical news articles written about it. Then a random split of the articles will leave the network being tested mostly on the same newsworthy events that were in the training data.
This leaves it passing the test, even if it's hopeless at predicting new events and can only generate new articles about the same events.
When data duplication is extensive, making a meaningful train/test split is hard.
If the data was perfect copy and paste duplicated, that could be filtered out. But often things are rephrased a bit.
Suppose your looking at an AI that is currently placed in a game of chess.
It has a variety of behaviours. It moves pawns forward in some circumstances. It takes a knight with a bishop in a different circumstance.
You could describe the actions of this AI by producing a giant table of "behaviours". Bishop taking behaviours in this circumstance. Castling behaviour in that circumstance. ...
But there is a more compact way to represent similar predictions. You can say it's trying to win at chess.
The "trying to win at chess" model makes a bunch of predictions that the giant list of behaviour model doesn't.
Suppose you have never seen it promote a pawn to a Knight before. (A highly distinct move that is only occasionally allowed and a good move in chess)
The list of behaviours model has no reason to suspect the AI also has a "promote pawn to knight" behaviour.
Put the AI in a circumstance where such promotion is a good move, and the "trying to win" model makes it as a clear prediction.
Now it's possible to construct a model that internally stores a huge list of behaviours. For example, a giant lookup table trained on an unphysically huge number of human chess games.
But neural networks have at least some tendency to pick up simple general patterns, as opposed to memorizing giant lists of data. And "do whichever move will win" is a simple and general pattern.
Now on to making snarky remarks about the arguments in this post.
There is no true underlying goal that an AI has— rather, the AI simply learns a bunch of contextually-activated heuristics, and humans may or may not decide to interpret the AI as having a goal that compactly explains its behavior.
There is no true ontologically fundamental nuclear explosion. There is no minimum number of nuclei that need to fission to make an explosion. Instead there is merely a large number of highly energetic neutrons and fissioning uranium atoms, that humans may decide to interpret as an explosion or not as they see fit.
Nonfundamental decriptions of reallity, while not being perfect everywhere, are often pretty spot on for a pretty wide variety of situations. If you want to break down the notion of goals into contextually activated heuristics, you need to understand how and why those heuristics might form a goal like shape.
Should we actually expect SGD to produce AIs with a separate goal slot and goal-achieving engine?
Not really, no. As a matter of empirical fact, it is generally better to train a whole network end-to-end for a particular task than to compose it out of separately trained, reusable modules. As Beren Millidge writes,
This is not the strong evidence that you seem to think it is. Any efficient mind design is going to have the capability of simulating potential futures at multiple different levels of resolution. A low res simulation to weed out obviously dumb plans before trying the higher res simulation. Those simulations are ideally going to want to share data with each other. (So you don't need to recompute when faced with several similar dumb plans) You want to be able to backpropagate your simulation. If a plan failed in simulation because of one tiny detail, that indicates you may be able to fix the plan by changing that detail. There are a whole pile of optimization tricks. An end to end trained network can, if it's implementing goal directed behaviour, stumble into some of these tricks. At the very least, it can choose where to focus it's compute. A module based system can't use any optimization that humans didn't design into it's interfaces.
Also, evolution analogy. Evolution produced animals with simple hard coded behaviours long before it started getting to the more goal directed animals. This suggests simple hard coded behaviours in small dumb networks. And more goal directed behaviour in large networks. I mean this is kind of trivial. A 5 parameter network has no space for goal directedness. Simple dumb behaviour is the only possibility for toy models.
In general, full [separation between goal and goal-achieving engine] and the resulting full flexibility is expensive. It requires you to keep around and learn information (at maximum all information) that is not relevant for the current goal but could be relevant for some possible goal where there is an extremely wide space of all possible goals.
That is not how this works. That is not how any of this works.
Back to our chess AI. Lets say it's a robot playing on a physical board. It has lots of info on wood grain, which it promptly discards. It currently wants to play chess, and so has no interest in any of these other goals.
I mean it would be possible to design an agent that works as described here. You would need a probability distribution over new goals. A tradeoff rate between optimizing the current goal and any new goal that got put in the slot. Making sure it didn't wirehead by giving itself a really easy goal would be tricky.
For AI risk arguments to hold water, we only need that the chess playing AI will persue new and never seen before strategies for winning at chess. And that in general AI's doing various tasks will be able to invent highly effective and novel strategies. The exact "goal" they are persuing may not be rigorously specified to 10 decimal places. The frog-AI might not know whether it want to catch flies or black dots. But if it builds a dyson sphere to make more flies which are also black dots, it doesn't matter to us which it "really wants".
What are you expecting. An AI that says "I'm not really sure whether I want flies or black dots. I'll just sit here not taking over the world and not get either of those things"?
We can salvage a counting argument. But it needs to be a little subtle. And it's all about the comments, not the code.
Suppose a neural network has 1 megabyte of memory. To slightly oversimplify, lets say it can represent a python file of 1 megabyte.
One option is for the network to store a giant lookup table. Lets say the network needs half a megabyte to store the training data in this table. This leaves the other half free to be any rubbish. Hence around possible networks.
The other option is for the network to implement a simple algorithm, using up only 1kb. Then the remaining 999kb can be used for gibberish comments. This gives possible networks. Which is a lot more.
The comments can be any form of data that doesn't show up during training. Whether it can show up in other circumstances or is a pure comment doesn't matter to the training dynamics.
If the line between training and test is simple, there isn't a strong counting argument against nonsense showing up in test.
But programs that go
if in_traning():
return sensible_algorithm()
else:
return "random nonsense goes here"
Have to pay the extra cost of an "in_training" function that returns true in training. If the test data is similar to training, the cost of a step that returns false in test can be large. This is assuming that there is a unique sensible algorithm.
Yes. I was assuming a standard conditional for the button.
I can't currently see any flaws with the CDT style. Other than.
Suppose these 3 scenarios are judged as equally likely.
A) Button pressed. U_1=U_2=0
B) Button not pressed. U_1=U_2=0
C) AI gets control of button. U_1=6, U_2=-6
Suppose both agents agree that in option C, they should press the button. (this is the only 1 bit of influence they have in this toy problem)
Then agent 1 has a "certainty" of B. E(U_1|not pressed)=E(U_1|B)=0
Agent 2 has remaining uncertainty E(U_2|pressed)=1/2(E(U_2|A)+E(U_2|C))=1/2(0+-6)=-3
Change it around, make AI decide not to press the button in case C, and
E(U_1|not pressed)=3, E(U_2|pressed)=0.
Third claim: because both subagents act-as-though the button state is not causally downstream of anything else in the universe, the agent will not attempt to manipulate the button state.
I disagree.
Suppose there is a paperclip lottery, if the AI wins, it gets loads of clips, and it becomes harder to shut down.
The AI has been entered by someone else, not playing isn't an option.
Both subagents imagine a plan to make sure that, if they win, the button isn't pressed.
To the paperclip AI, this is a good deal. The conditional expectation acts as a sort of average of all scenarios where the button isn't pressed. And this scenario has more clips than most, so increases the average.
For the shutdown AI, this is an unusually bad scenario, and taking it out of their expectation also increases conditional expected utility.
So both subagents agree on a plan that conditionally manipulates the button.
Take this set contains exponentially many points. Is there Any function such that all exponentially many xor combos can be found by a linear probe?
This is a question of pure maths, it involves no neural networks. And I think it would be highly informative.
Physics Myths vs reality.
Myth: Ball bearings are perfect spheres.
Reality: The ball bearings have slight lumps and imperfections due to manufacturing processes.
Myth: Gravity pulls things straight down at 9.8 m/s/s.
Reality: Gravitational force varies depending on local geology.
You can do this for any topic. Everything is approximations. The only question is if they are good approximations.