Send me anonymous feedback: https://docs.google.com/forms/d/e/1FAIpQLScLKiFJbQiuRYBhrBbVYUo_c6Xf0f8DN_blbfpJ-2Ml39g1zA/viewform
Any type of feedback is welcome, including arguments that a post/comment I wrote is net negative.
Some quick info about me:
I have a background in computer science (BSc+MSc; my MSc thesis was in NLP and ML, though not in deep learning).
You can also find me on the EA Forum.
Feel free to reach out by sending me a PM here or on my website.
you should never get deception in the limit of infinite data (since a deceptive model has to defect on some data point).
I think a model can be deceptively aligned even if formally it maps every possible input to the correct (safe) output. For example, suppose that on input X the inference execution hacks the computer on which the inference is being executed, in order to do arbitrary consequentialist stuff (while the inference logic, as a mathematical object, formally yields the correct output for X).
[Question about reinforcement learning]
What is the most impressive/large-scale published work in RL that you're aware of where—during training—the agent's environment is the real world (rather than a simulated environment)?
Let and be two optimization algorithms, each searching over some set of programs. Let be some evaluation metric over programs such that is our evaluation of program , for the purpose of comparing a program found by to a program found by . For example, can be defined as a subjective impressiveness metric as judged by a human.
Intuitive definition: Suppose we plot a curve for each optimization algorithm such that the x-axis is the inference compute of a yielded program and the y-axis is our evaluation value of that program. If the curves of and are similar up to scaling along the x-axis, then we say that and are similarly-scaling w.r.t inference compute, or SSIC for short.
Formal definition: Let and be optimization algorithms and let be an evaluation function over programs. Let us denote with the program that finds when it uses flops (which would correspond to the training compute if is an ML algorithms). Let us denote with the amount of compute that program uses. We say that and are SSIC with respect to if for any ,,, such that , if then .
I think the report draft implicitly uses the assumption that human evolution and the first ML algorithm that will result in TAI are SSIC (with respect to a relevant ). It may be beneficial to discuss this assumption in the report. Clearly, not all pairs of optimization algorithms are SSIC (e.g. consider a pure random search + any optimization algorithm). Under what conditions should we expect a pair of optimization algorithms to be SSIC with respect to a given ?
Maybe that question should be investigated empirically, by looking at pairs of optimization algorithms, were one is a popular ML algorithm and the other is some evolutionary computation algorithm (searching over a very different model space), and checking to what extent the two algorithms are SSIC.
Some thoughts:
The development of transformative AI may involve a feedback loop in which we train ML models that help us train better ML models and so on (e.g. using approaches like neural architecture search which seems to be getting increasingly popular in recent years). There is nothing equivalent to such a feedback loop in biological evolution (animals don't use their problem-solving capabilities to make evolution more efficient). Does your analysis assume there won't be such a feedback loop (or at least not one that has a large influence on timelines)? Consider adding to the report a discussion about this topic (sorry if it's already there and I missed it).
Part of the Neural Network hypothesis is the proposition that "a transformative model would perform roughly as many FLOP / subj sec as the human brain". It seems to me worthwhile to investigate this proposition further. Human evolution corresponds to a search over a tiny subset of all possible computing machines. Why should we expect that a different search algorithm over an entirely different subset of computing machines would yield systems (with certain capabilities) that use a similar amount of compute? One might pursue an empirical approach for investigating this topic, e.g. by comparing two algorithms for searching over a space of models, where one is some common supervised learning algorithm, and the other is some evolutionary computation algorithm.
In a separate comment (under this one) I attempt to describe a more thorough and formal way of thinking about this topic.
Regarding methods that involve adjusting variables according to properties of 2020 algorithms (or the models trained by them): It would be interesting to try to apply the same methods with respect to earlier points in time (e.g. as if you were writing the report back in 1998/2012/2015 when LeNet-5/AlexNet/DQN were introduced, respectively). To what extent would the results be consistent with the 2020 analysis?
My point here is that in a world where an algo-trading company has the lead in AI capabilities, there need not be a point in time (prior to an existential catastrophe or existential security) where investing more resources into the company's safety-indifferent AI R&D does not seem profitable in expectation. This claim can be true regardless of researchers' observations beliefs and actions in given situations.
We might get TAI due to efforts by, say, an algo-trading company that develops trading AI systems. The company can limit the mundane downside risks that it faces from non-robust behaviors of its AI systems (e.g. by limiting the fraction of its fund that the AI systems control). Of course, the actual downside risk to the company includes outcomes like existential catastrophes, but it's not clear to me why we should expect that prior to such extreme outcomes their AI systems would behave in ways that are detrimental to economic value.
you need to ensure that your model is aligned, robust, and reliable (at least if you want to deploy it and get economic value from it).
I think it suffices for the model to be inner aligned (or deceptively inner aligned) for it to have economic value, at least in domains where (1) there is a usable training signal that corresponds to economic value (e.g. users' time spent in social media platforms, net income in algo-trading companies, or even the stock price in any public company); and (2) the downside economic risk from a non-robust behavior is limited (e.g. an algo-trading company does not need its model to be robust/reliable, assuming the downside risk from each trade is limited by design).
If the model is smart, this is only going to work if the (correct) translation is reasonably likely to appear in your English text database. You are (at best) going to get a prediction of what human researchers would conclude after studying Klingon, your model isn't actually going to expand what humans can do.
Agreed. Perhaps it's possible to iteratively train GPT models in an Amplification-like setup, where in each iteration we add to the English training corpus some newly possible translations; aiming to end up with something like an HCH translator. (We may not need to train a language model from scratch in each iteration; at the extreme, we just to do fine-tuning on the new translations.)
Some tentative thoughts:
Re Debate:
Making things worse, to interpret each usage they’d need to agree about the meaning of the rest of the phrase — -which isn’t necessarily any simpler than the original disagreement about “qapla.”
Consider a Debate experiment in which each of the two players outputs an entire English-Klingon dictionary (as avturchin mentioned). The judge then samples a random Klingon passage and decides which of the two dictionaries is more helpful for understanding that passage (maybe while allowing the two players to debate over which dictionary is more helpful).
Also, one might try to use GPT to complete prompts such as:
The researchers analyzed the Klingon phrase "מהדקי נייר" and concluded it roughly means
In both of these approaches we still need to deal with the potential problem of catastrophic inner alignment failures occurring before the point where we have sufficiently useful helper models. [EDIT: and in the Debate-based approach there's also an outer alignment problem: a player may try to manipulate the judge into choosing them as the winner.]
Great post!
I suppose you'll be more optimistic about Single/Single areas if you update towards fast/discontinuous takeoff?