Currently studying postgrad at Edinburgh.
I don't think this research, if done, would give you strong information about the field of AI as a whole.
I think that, of the many topics researched by AI researchers, chess playing is far from the typical case.
It's [chess] not the most relevant domain to future AI, but it's one with an unusually long history and unusually clear (and consistent) performance metrics.
An unusually long history implies unusually slow progress. There are problems that computers couldn't do at all a few years ago that they can do fairly efficiently now. Are there problems where people basically figured out how to do that decades ago and no significant progress has been made since?
The consistency of chess performance looks like more selection bias. You aren't choosing a problem domain where there was one huge breakthrough that. You are choosing a problem domain that has had slow consistent progress.
For most of the development of chess AI (All the way from Alpha Beta pruning to Alpha Zero) Chess AI's improved by an accumulation of narrow, chess specific tricks. (And more compute) How to represent chess states in memory in a fast and efficient manor. Better evaluation functions. Tables of starting and ending games. Progress on chess AI's contained no breakthroughs, no fundamental insights, only a slow accumulation of little tricks.
There are cases of problems that we basically knew how to solve from the early days of computers, any performance improvements are almost purely hardware improvements.
There are problems where one paper reduces the compute requirements by 20 orders of magnitude. Or gets us from couldn't do X at all, to able to do X easily.
The pattern of which algorithms are considered AI and which are considered maths and which are considered just programming is somewhat arbitrary. A chess playing algorithm is AI, a prime factoring algorithm is maths, a sorting algorithm is programming or computer science. Why? Well those are the names of the academic departments that work on them.
You have a spectrum of possible reference classes for transformative AI that range from the almost purely software driven progress, to the almost totally hardware driven progress.
To gain more info about transformative AI, someone would have to make either a good case for why it should be at a particular position on the scale, or a good case for why its position on the scale should be similar to the position of some previous piece of past research. In the latter case, we can gain from examining the position of that research topic. If hypothetically that topic was chess, then the research you propose would be useful. If the reason you chose chess was purely that you thought it was easier to measure, then the results are likely useless.
In a game with any finite number of players, and any finite number of actions per player.
Let O=A1×A2×... the set of possible outcomes.
Player i implements policy Pi:P(O)→Ai . For each outcome in o∈O , each player searches for proofs (in PA) that the outcome is impossible. It then takes the set of outcomes it has proved impossible, and maps that set to an action.
There is always a unique action that is chosen. Whatsmore, given oracles for
Ie the set of actions you might take if you can prove at least the impossility results in U and possibly some others.
Given such an oracle Qi for each agent, there is an algorithm for their behaviour that outputs the fixed point in polynomial (in |O| ) time.
The next task to fall to narrow AI is adversarial attacks against humans. Virulent memes and convincing ideologies become easy to generate on demand. A small number of people might see what is happening, and try to shield themselves off from dangerous ideas. They might even develop tools that auto-filter web content. Most of society becomes increasingly ideologized, with more decisions being made on political rather than practical grounds. Educational and research institutions become full of ideologues crowding out real research. There are some wars. The lines of division are between people and their neighbours, so the wars are small scale civil wars.
Researcher have been replaced with people parroting the party line. Society is struggling to produce chips of the same quality as before. Depending on how far along renewables are, there may be an energy crisis. Ideologies targeted at baseline humans are no longer as appealing. The people who first developed the ideology generating AI didn't share it widely. The tech to AI generate new ideologies is lost.
The clear scientific thinking needed for major breakthroughs has been lost. But people can still follow recipes. And make rare minor technical improvements to some things. Gradually, idealogical immunity develops. The beliefs are still crazy by a truth tracking standard, but they are crazy beliefs that imply relatively non-detrimental actions. Many years of high, stagnant tech pass. Until the culture is ready to reembrace scientific thought.
I would be potentially concerned that this is a trick that evolution can use, but human AI designers can't use safely.
In particular, I think this is the sort of trick that produces usually fairly good results when you have a fixed environment, and can optimize the parameters and settings for that environment. Evolution can try millions of birds, tweaking the strengths of desire, to get something that kind of works. When the environment will be changing rapidly; when the relative capabilities of cognitive modules are highly uncertain and when self modification is on the table, these tricks will tend to fail. (I think)
Use the same brain architecture in a moderately different environment, and you get people freezing their credit card in blocks of ice so they can't spend it, and other self defeating behaviour. I suspect the tricks will fail much worse with any change to mental architecture.
On your equivalence to an AI with an interpretability/oversight module. Data shouldn't be flowing back from the oversight into the AI.
There are various ideas along the lines of "however much you tell the AI X it just forgets it". https://www.lesswrong.com/posts/BDXvRp8w9T8KkDw5A/policy-restrictions-and-secret-keeping-ai
I think that would be the direction to look in if you have a design tha'ts safe as long as it doesn't know X.
There may be predictable errors in the training data, such that instrumental policy actually gets a lower loss than answering honestly (because it responds strategically to errors).
If you are answering questions as text, there is a lot of choice in wording. There are many strings of text that are a correct answer, and the AI has to pick the one the human would use. In order to predict how a human would word an answer, you need a fairly good understanding of how they think (I think).
Maybe you did. I find it hard to distinguish inventing and half remembering ideas.
If the training procedure either
Then using different copies of GPT-n trained from different seeds doesn't help.
If you just convert 1% of the english into network yourself, then all it needs to use is some error correction. Even without that, neural net structures are somewhat redundant. A malicious optimizer with a random 1% removed is probably still a malicious optimizer.
Secondly, its goal here is to give us subtly wrong alignment papers. Suppose we produce an examination and transparency step so thorough that no mesa optimizer can slip through. The original mesaoptimiser can just design a network that is doing its best as a mesaoptimiser free predictor, but has subtly maliciously chosen beliefs about AI research.
I don't think that learning is moving around in codespace. In the simplest case, the AI is like any other non self modifying program. The code stays fixed as the programmers wrote it. The variables update. The AI doesn't start from null. The programmer starts from a blank text file, and adds code. Then they run the code. The AI can start with sophisticated behaviour the moment its turned on.
So are we talking about a program that could change from an X er to a Y er with a small change in the code written, or with a small amount of extra observation of the world?
There seems to be some technical problem with the link. It gives me a "Our apologies, your invite link has now expired (actually several hours ago, but we hate to rush people).
We hope you had a really great time! :)" message. Edit: As of a few minutes after stated start time. It worked last week.
My picture of an X and only X er is that the actual program you run should optimize only for X. I wasn't considering similarity in code space at all.
Getting the lexicographically first formal ZFC proof of say the Collatz conjecture should be safe. Getting a random proof sampled from the set of all proofs < 1 terabyte long should be safe. But I think that there exist proofs that wouldn't be safe. There might be a valid proof of the conjecture that had the code for a paperclip maximizer encoded into the proof, and that exploited some flaw in computers or humans to bootstrap this code into existence. This is what I want to avoid.
Your picture might be coherent and formalizable into some different technical definition. But you would have to start talking about difference in codespace, which can differ depending on different programming languages.
The program if True: x() else: y() is very similar in codespace to if False: x() else: y() .
If code space is defined in terms of minimum edit distance, then layers of interpereters, error correction and holomorphic encryption can change it. This might be what you are after, I don't know.