Aryeh Englander is a mathematician and AI safety researcher at the Johns Hopkins University Applied Physics Laboratory. His work is focused on AI safety and AI risk analysis.
I'd like to hear more thoughts, from Rohin or anybody else, about how the scaling hypothesis might affect safety work.
New post on the EA Forum: Some AI Governance Research Ideas
Just came across this: Research ideas to study humans with AI Safety in mind
Thanks Adam for setting this up! I have no idea if my experience is representative, but that was definitely one of the highest-quality discussion sessions I've had at events of this type.
I don't think this is quite an example of a treacherous turn, but this still looks relevant:
Lewis et al., Deal or no deal? end-to-end learning for negotiation dialogues (2017):
Analysing the performance of our agents, we find evidence of sophisticated negotiation strategies. For example, we find instances of the model feigning interest in a valueless issue, so that it can later ‘compromise’ by conceding it. Deceit is a complex skill that requires hypothesising the other agent’s beliefs, and is learnt relatively late in child development (Talwar and Lee, 2002). Our agents have learnt to deceive without any explicit human design, simply by trying to achieve their goals.
(I found this reference cited in Kenton et al., Alignment of Language Agents (2021).)
That's later in the linked wiki page: https://timelines.issarice.com/wiki/Timeline_of_AI_safety#Full_timeline
Excellent, thanks! Now I just need a similar timeline for near-term safety engineering / assured autonomy as they relate to AI, and then a good part of a paper I'm working on will have just written itself.
Also - particular papers that you think are important, especially if you think they might be harder to find in a quick literature search. I'm part of an AI Ethics team at work, and I would like to find out about these as well.
This was actually part of a conversation I was having with this colleague regarding whether or not evolution can be viewed as an optimization process. Here are some follow-up comments to what she wrote above related to the evolution angle:
We could define the natural selection system as:
All configurations = all arrangements of matter on a planet (both arrangements that are living and those that are non-living)
Basis of attraction = all arrangements of matter on a planet that meet the definition of a living thing
Target configuration set = all arrangements of living things where the type and number of living things remains approximately stable.
I think that this system meets the definition of an optimizing system given in the Ground for Optimization essay. For example, predator and prey co-evolve to be about “equal” in survival ability. If a predator become so much better than its prey that it eats them all, the predator will die out along with its prey; the remaining animals will be in balance. I think this works for climate perturbations, etc. too.
HOWEVER, it should be clear that there are numerous ways in which this can happen – like the ball on bumpy surface with a lot of convex “valleys” (local minima), there is not just one way that living things can be in balance. So, to say that “natural selection optimized for intelligence” is quite not right – it just fell into a “valley” where intelligence happened. FURTHER, it’s not clear that we have reached the local minimum! Humans may be that predator that is going to fall “prey” to its own success. If that happened (and any intelligent animals remain at all), I guess we could say that natural selection optimized for less-than-human intelligence!
Further, this definition of optimization has no connotation of “best” or even better – just equal to a defined set. The word “optimize” is loaded. And its use in connection with natural selection has led to a lot of trouble in terms of human races, and humans v. animal rights.
Finally, in the essay’s definition, there is no imperative that the target set be reached. As long as the set of living things is “tending” toward intelligence, then the system is optimizing. So even if natural selection was optimizing for intelligence there is no guarantee that it will be achieved (in its highest manifestation). Like a billiards system where the table is slick (but not frictionless) and the collisions are close to elastic, the balls may come to rest with some of the balls outside the pockets. The reason I think this is important for AI research, especially AGI and ASI, is perhaps we should be looking for those perturbations to prevent us from ever reaching what we may think of as the target configuration, despite our best efforts.
I shared this essay with a colleague where I work (Johns Hopkins University Applied Physics Lab). Here are her comments, which she asked me to share:
This essay proposes a very interesting definition of optimization as the manifestation of a particular behavior of a closed, physical system. I haven’t finished thinking this over, but I suspect it will be (as is suggested in the essay) a useful construct. The reasoning leading to the definition is clearly laid out (thank you!), with examples that are very useful in understanding the concept. The downside of being clearly laid out, however, is that it makes critique easier. I have a few thoughts about the reasoning in the essay.
The first thing I will note is that the essay gives three definitions for an optimizing system. These definitions are close, but not exactly equivalent. The nuances can be important. For example, that the target configuration set and the basin of attraction cannot be equal is obvious; that is made explicit in definition 3, but only implied in definitions 1 and 2. A bigger issue is that there are no criteria or rationale for their extent and relative size.
For example, the essay offers two reasons why the posterchild of non-optimizers - the bottle with a cap - is not an optimizing system; they both arise from the rather arbitrary definition of the basin of attraction as equal to the target configuration set. I see no necessary reason why the basin of attraction couldn’t be defined as the set of all configurations of water molecules both inside and outside the bottle. That way, the definitional requirement of a target configuration set smaller than the basis of attraction is met. The important point is: will water molecules in this new, larger basin of attraction tend to the target configuration set?
Let’s suppose that capped bottle is in a sealed room (not necessary but easier to think about), and that the cap is made of a special material that allows water molecules to pass through it in only one direction: from outside the bottle to inside. The water molecules inside the bottle stay inside the bottle, as for any cap. The water molecules inside the room, but outside the bottle, are zooming about (thermodynamic energy), bouncing off the walls, each other, and the bottle. Although it will take some time, sooner or later all the molecules outside the bottle will hit the bottle cap, go through, and be trapped in the bottle. Voila!
Originally, the bottle-with-a-cap system was a non-optimizing system by definition; the bottle cap type was irrelevant and could have been the rather special one I described. Simply by changing the definition of the basin of attraction, we could turn it into an optimizing system. Further, the original, “non-optimizing” system (with the original definitions of the basin of attraction and target set) would have behaved exactly the same as my optimizing system. On the other hand, changing the bottle cap from our special one to a regular cap will change the system into a non-optimizing system, regardless of the definitions of the basin of attraction and the target configuration set. Perhaps, we should insist that a properly formed system description has a basin of attraction that is larger than the target set, and count on the system behavior to make the optimizing/non-optimizing distinction.
Definitions 1 and 2 both contain the phrase “a small set of target configurations” which implies that the target set << than the basin of attraction. This is a problem for the notion of the universe as a system with maximum entropy as the target configuration set because the target set is most of the possible configurations. For this reason, the essay’s author concludes that universe-with-entropy system is not an optimizing system, or at best, a weak one. Stars, galaxies, black holes – there are strong forces that pull matter into these structures. I would say that any system that has succeeded in getting nearly everything within the basin of attraction into the target configuration is a strong optimizer!
Regardless of the way we chose to think about strong or weak, the universe is a system that tends to a set of configurations smaller than the set of possible configurations despite perturbations (the occasional house-building project for example!). Personally, I see no value in a definitional limitation. The behavior of the system (tending toward a smaller set of configurations out of a larger set) should govern the definition of an optimizing system, regardless of relative sizes of the sets.
Between the universe-with-entropy and bottle-with-a-cap systems, I question the utility of the “all configurations >= basin of attraction >> target set configuration” structure in the definition of optimizing systems. I believe it is worth thinking about what the necessary relationships among these configurations are, and how they are chosen.
The example of the billiards system raised another (to me) interesting question. The essay did not offer a system description but says “Consider a billiard table with some billiard balls that are currently bouncing around in motion. Left alone, the balls will eventually come to rest in some configuration…. If we reach in while the billiard balls are bouncing around and move one of the balls that is in motion, the system will now come to rest in a different configuration. Therefore this is not an optimizing system, because there is no set of target configurations towards which the system evolves despite perturbations.”
This example has some odd features. Friction between the balls and the table surface, along with the loss of energy during non-elastic collisions, cause the balls to slow down and stop. The minutia of their travels determines where they stop. The final arrangement is unpredictable (ok, it could be modeled given complete information, but let’s skip that as beside the point), and any arrangement is as likely as another. This suggests that the billiards system is a non-optimizing system even without the proposed perturbation of moving the balls around while the balls are in motion.
Looked at another way, billiards system does tend to a certain target configuration set, while friction and the non-elasticity of the collisions are perturbations. If we make the surface frictionless and the collisions perfectly elastic, the balls will bounce around the table without stopping. Much like the water molecules in the bottle-with-a-cap example, each will eventually fall into one pocket or another during its travels. Once in the pocket, the ball cannot get out, and thus eventually all will end up in the pockets. So, this system tends to a target configuration set of all balls in pockets.
Adding back in the perturbing friction and energy loss does not mean that this system is not tending to the target configuration set. Reaching in and moving a ball to a different point, or even redirecting any ball heading for a pocket, will not keep this system from tending towards the target configuration. It seems as though the billiards system was an optimizing system all along! The larger point is that it seems, by definition, an optimizing system is an optimizing system even if there are a set of perturbations that prevent it from ever reaching the target configuration! “Tending toward”, not “reaching”, a target configuration set is in all three definitions. It is worth thinking about an optimizing system that never actually optimizes. This may have some bearing on the AGI question.
[And for you readers who, like me, would say, whoa - it is possible that the balls will enter some repeating pattern of motion where some do not enter pockets. Maybe we need a robot to move the balls around randomly if they seem stuck, just like the ball-in-valley+robot system where the robot moves the ball over barriers. I maintain that the point is the same.]
The satellite system illustrates (perhaps an obvious point) that the definition of the target configuration set can change a single system from optimizing and to non-optimizing. What is a little more subtle is that the definition of the system boundaries is essential to the characterization of the system as optimizing or non-optimizing, even if the behavior of the system is the same under both definitions. In particular, what we consider to be part of the system and what is considered to be a perturbation can flip a system between characterizations. [This latter point is illustrated by the billiards system as well, as I will explain below.]
The essay says that a satellite in orbit is a non-optimizing system because if its position or velocity is perturbed, it has no tendency to return to its original orbit; that is, the author defines the target configuration as a particular orbit. With respect to another target configuration that may be described as “a scorched pile of junk on the surface of the Earth”, a satellite in orbit is an optimizing system exactly like a ball in a valley. As soon as the launch rocket stops firing, a satellite starts falling to the center of the earth because atmospheric drag and solar radiation pressure continuously decrease the component of the satellite’s velocity perpendicular to the force of gravity. So, unless a perturbation is big enough to send it out of orbit altogether, a satellite tends towards a target configuration of junk located on Earth’s surface.
Since a particular orbit is usually the desired target configuration (!), many satellites incorporate a rocket system to force them to stay in a chosen orbit. If a rocket system is included in the system definition, then the satellite is an optimizing system relative to the desired orbit. What is a little more interesting, with respect to the junk-on-the-Earth target set, drag and solar pressure are the part of the optimizing system; an orbit correction system is a perturbation. If the target set is the particular orbit the satellite started in, these definitions swap.
This observation has bearing on the billiards system example. If we include drag and non-elastic collisions as part of the billiards system, then the system is non-optimizing. If we see them as perturbations outside the system, then the billiards system is optimizing. I find this flexibility as a little curious, although I haven’t completely thought through the implications.
A completely different sort of question is suggested by the section on Drexler. There the essay sets out a hierarchy of all AI systems, optimizing systems, and goal-directed agent systems. This makes sense with respect to AI systems, but I do not see how optimizing systems, as defined, can be wholly contained within the category of AI systems, unless you define AI systems pretty broadly. For example, I think that pretty much any control system is an optimizing system by the definition in the essay. If we accept this definition of optimizing system, and hold that all optimizing systems are a subset of AI systems, do we have to accept our thermostats as AI systems? What about the program that determined the square root of 2? Is that AI? Is this an issue for this definition, or does its broadness matter in an AI context?
And a nitpick: The first example of an optimizing system offered in the essay is a program calculating the square root of 2. It meets the definition of an optimizing system, but it seems to contradict the earlier assertion that “… optimizing systems are not something that are designed but are discovered.” The algorithm and the program were both designed. I’m not sure why this point is necessary. Either I do not understand something fundamental, or the only purpose of the statement of discovery is to give people like me something to argue about!
In summary, the definition in the essay suggests a few questions that could have a bearing on its application:
Finally, it might be better to have one, consistent definition that covers all the possibilities, including (in my opinion) that perturbations may be confined to certain dimensions.