AlphaGo is fairly constrained in what it's designed to optimize for, but it still has the standard failure mode of "things we forgot to encode". So for example AlphaGo could suffer the error of instrumental power grabbing in order to be able to get better at winning Go because we misspecified what we asked it to measure. This is a kind of failure introduced into the systems by humans failing to make m(X) adequately evaluate X as we intended, since we cared about winning Go games while also minimizing side effects, but maybe when we cons... (read more)
Really liking this model. It seems to actually deal with the problem of embeddedness for agents and the fact that there is no clear boundary to draw around what we call an agent other than one that's convenient for some purpose.
I've obviously got thoughts on how this is operationalizing insights about "no-self" and dependent origination, but that doesn't seem too important to get into, other than to say it gives me more reason to think this is likely to be useful.
"Error" here is all sources of error, not just error in the measurement equipment. So bribing surveyors is a kind of error in my model.
For what it's worth, I think this is trying to get at the same insight as logical time but via a different path.
For the curious reader, this is also the same reason we use vector clocks to build distributed systems when we can't synchronize the clocks very well.
And there's something quite interesting about computation as a partial order. It might seem that this only comes up when you have a "distributed" system, but actually you need partial orders to reason about unitary programs when they are non-deterministic (any program with loops and conditiona... (read more)
I actually don't think that model is general enough. Like, I think Goodharting is just a fact of control system's observing.
Suppose we have a simple control system with output X and a governor G. G takes a measurement m(X) (an observation) of X. So long as m(X) is not error free (and I think we can agree that no real world system can be actually error free), then X=m(X)+ϵ for some error factor ϵ. Since G uses m(X) to regulate the system to change X, we now have error ... (read more)
I'm fairly pessimistic on our ability to build aligned AI. My take is roughly that it's theoretically impossible and at best we might build AI that is aligned well enough that we don't lose. I've not written one thing to really summarize this or prove it, though.
The source of my take comes from two facts:
This paper gives a mathematical model of when Goodharting will occur. To summarize: if
(1) a human has some collection s1,…,sn of things which she values,
(2) a robot has access to a proxy utility function which takes into account some strict subset of those things, and
(3) the robot can freely vary how much of s1,…,sn there are in the world, subject only to resource constraints that make the si trade off against each other,
then when the robot optimizes for its proxy utility, it will minimize all si's which its proxy utility... (read more)
This feels like a key detail that's lacking from this post. I actually downvoted this post because I have no idea if I should be excited about this development or not. I'm pretty familiar with Stuart's work over the years, so I'm fairly surprised if there's something big here.
Might help if I put this another way. I'd be purely +1 on this project if it was just "hey, I think I've got some good ideas AND I have an idea about why it's valuable to operationalize them as a business, so I'm going to do that". Sounds great. However, the bit about "AND I think I k... (read more)
This doesn't really seem like solving symbol grounding, partially or not, so much as an argument that it's a non-problem for the purposes of value alignment.
Agreed. That said, I don't think counterfactuals are in the territory. I think I said before that they were in the map, although I'm now leaning away from that characterisation as I feel that they are more of a fundamental category that we use to draw the map.
Yes, I think there is something interesting going on where human brains seem to operate in a way that makes counterfactuals natural. I actually don't think there's anything special about counterfactuals, though, just that the human brain is designed such that thoughts are not strongly tethered to sens... (read more)
I don't think they're really at odds. Zack's analysis cuts off at a point where the circularity exists below it. There's still the standard epistemic circularity that exists whenever you try to ground out any proposition, counterfactual or not, but there's a level of abstraction where you can remove the seeming circularity by shoving it lower or deeper into the reduction of the proposition towards grounding out in some experience.
Another way to put this is that we can choose what to be pragmatic about. Zack's analysis choosing to be pragmatic about counter... (read more)
I think A is solved, though I wouldn't exactly phrase it like that, more like counterfactuals make sense because they are what they are and knowledge works the way it does.
Zack seems to be making a claim to B, but I'm not expert enough in decision theory to say much about it.
I mostly agree with Zack_M_Davis that this is a solved problem, although rather than talking about a formalization of causality I'd say this is a special case of epistemic circularity and thus an instance of the problem of the criterion. There's nothing unusual going on with counterfactuals other than that people sometimes get confused about what propositions are (e.g. they believe propositions have some sort of absolute truth beyond causality because they fail to realize epistemology is grounded in purpose rather than something eternal and external to the... (read more)
Thanks for this thorough summary. At this point the content has become spread over a books worth of posts, so it's handy to have this high level, if long, summary!
Thanks for this interesting read.
I think there's similar work that can be done to find safety analogues from a large number of fields. Some that come to mind include organizational design, market analysis, and design of public institutions.
Some of my own:
Interesting. Selection theorems seem like a way of identifying the purposes or source of goal directness in agents that seems obvious to us yet hard to pin down. Compare also the ground of optimization.
I don't really have a whole picture that I think says more than what others have. I think there's something to knowing as the act of operationalizing information, by which I mean a capacity to act based on information.
To make this more concrete, consider a simple control system like a thermostat or a steam engine governor. These systems contain information in the physical interactions we abstract away to call "signal" that's sent to the "controller". If we had only signal there'd be no knowledge because that's information that is not used to act. The contr... (read more)
Quick thought: reading this I get a sense that some of our collective confusion here revolves around "knowledge" as a quantifiable noun rather than "knowing" as a verb, and if we give up on the idea that knowledge is first a quantifiable thing (rather than a convenient post hoc reification) we open up new avenues of understanding knowledge.
Small insight why reading this: I'm starting to suspect that most (all???) unintuitive things that happen with Oracles are the result of them violating our intuitions about causality because they actually deliver no information, in that nothing can be conditioned on what the Oracle says because if we could then the Oracle would fail to actually be an Oracle, so we can only condition on the existence of the Oracle and how it functions and not what it actually says, e.g. you should still 1-box but it's mistaken to think anything an Oracle tells you allows you to do anything different.
There's no observer-independent fact of the matter about whether a system "is" an agent
Worth saying, I think, that this is fully generally true that there's no observer-independent fact of the matter about whether X "is" Y. That this is true of agents is just particularly relevant to AI.
I'm not convinced there's an actual distinction to be made here.
Using your mass comparison example, arguably the only meaningful different between the two is where information is stored. In search-in-map it's stored in an auxiliary system; in search-in-territory it's embedded in the system. The same information is still there, though, all that's changed is the mechanism, and I'm not sure map and territory is the right way to talk about this since both are embedded/embodied in actual systems.
My guess is that search-in-map looks like a thing apart from searc... (read more)
For example, I now think that the representations of “what the model knows” in imitative generalization will sometimes need to use neural networks to translate between what the model is thinking and human language. Once you go down that road, you encounter many of the difficulties of the naive training strategy. This is an update in my view; I’ll likely go into more detail in a future post.
+1 to this and excited and happy to hear about this update in your view!
Ah, yeah, that's true, there's not much concern about getting too much of a good thing and that actually being good, which does seem like a reasonable category for anti-Goodharting.
It's a bit hard to think when this would actually happen, though, since usually you have to give something up, even if it's just the opportunity to have done less. For example, maybe I'm trying to get a B on a test because that will let me pass the class and graduate, but I accidentally get an A. The A is actually better and I don't mind getting it, but then I'm potentially left... (read more)
Maybe I'm missing something, but this seems already captured by the normal notion of what Goodharting is in that it's about deviation from the objective, not the direction of that deviation.
Stories about how those algorithms lead to bad consequences. These are predictions about what could/would happen in the world. Even if they aren't predictions about what observations a human would see, they are the kind of thing that we can all recognize as a prediction (unless we are taking a fairly radical skeptical perspective which I don't really care about engaging with).
In the spirit then of caring about stories about how algorithms lead to bad consequences, a story about how I see not making a clear distinction between instrumental and intended mode... (read more)
I want to consider models that learn to predict both “how a human will answer question Q” (the instrumental model) and “the real answer to question Q” (the intended model). These two models share almost all of their computation — which is dedicated to figuring out what actually happens in the world. They differ only when it comes time to actually extract the answer. I’ll describe the resulting model as having a “world model,” an “instrumental head,” and an “intended head.”
This seems massively underspecified in that it's really unclear to me what's actually... (read more)
I wrote a research agenda that suggests additional work to be done and that I'm not doing.
Firstly, we don't understand where this logical time might come from, or how to learn it
Okay, you can't write a sentence like that and expect me not to say that it's another manifestation of the problem of the criterion.
Yes, I realize this is not the problem you're interested in, but it's one I'm interested in, so this seems like a good opportunity to think about it anyway.
The issue seems to be that we don't have a good way to ground the order on world states (or, subjectively speaking if we want to be maximally cautious here, experience moments) since we ... (read more)
Somewhat ironically, some of these failures from thinking of oneself or others as agents causes a lack of agency! Maybe this is just a trick of language, but here's what I have in mind from thinking about some of the pitfalls:
On... (read more)
Yep, that accords well with my own current view.
Oh, I don't think those things exactly sidestep the problem of the criterion so much as commit to a response to it without necessarily realizing that's what they're doing. All of them sort of punt on it by saying "let humans figure out that part", which at the end of the day is what any solution is going to do because we're the ones trying to build the AI and making the decisions, but we can be more or less deliberate about how we do this part.
Right. For example, I think Stuart Armstrong is hitting something very important about AI alignment with his pursuit of the idea that there's no free lunch in value learning. We only close the gap by making an "arbitrary" assumption, but it's only arbitrary if you assume there's some kind of context-free version of the truth. Instead we can choose in a non-arbitrary way based on what we care about and is useful to us.
I realize lots of people are bored by this point because they're non-arbitrary solution that is useful is some version of rationality criteri... (read more)
Not really. If we were Cartesian, in order to fit the way we find the world, it seems to be that it'd have to be that agentiness is created outside the observable universe, possibly somewhere hypercomputation is possible, which might only admit an answer about how to build AI that would look roughly like "put a soul in it", i.e. link it up to this other place where agentiness is coming from. Although I guess if the world really looked like that maybe the way to do the "soul linkage" part would be visible, but it's not so seems unlikely.
I think this is right and underappreciated. However I struggle myself to make a clear case of what to do about it. There's something here, but I think it mostly shows up in not getting confused that the agent model just is how reality is, which underwhelms people who perhaps most fail to deeply grok what that means because they have a surface understanding of it.
Well stated. For what it's worth I think this is a great explanation of why I'm always going on about the problem of the criterion: as embedded, finite agents without access to hypercomputation or perfect, a priori knowledge we're stuck in this mess of trying to figure things out from the inside and always getting it a little bit wrong, no matter how hard we try, so it's worth paying attention to that because solving, for example, alignment for idealized mathematical systems that don't exist is maybe interesting but also not an actual solution to the alignment problem.
Largely agree. I think you're exploring what I'd call the deep implications of the fact that agents are embedded rather than Cartesian.
Nice! From my perspective this would be pretty exciting because, if natural abstractions exist, it solves at least some of the inference problem I view at the root of solving alignment, i.e. how do you know that the AI really understands you/humans and isn't misunderstanding you/humans in some way that looks like it does understand from the outside but it doesn't. Although I phrased this in terms of reified experiences (noemata/qualia as a generalization of axia), abstractions are essentially the same thing in more familiar language, so I'm quite excited f... (read more)
Regarding conservatism, there seems to be an open question of just how robust Goodhart effects are in that we all agree Goodhart is a problem but it's not clear how much of a problem it is and when. We have opinions ranging from mine, which is basically that Goodharting happens the moment you try to apply even the weakest optimization pressure and this will be a problem (or at least a problem in expectation; you might get lucky) for any system you need to never deviate, to what I read to be Paul's position: it's not that bad and we can do a lot to correct ... (read more)
Much of this, especially the story of the production web and especially especially the story of the factorial DAOs, reminds me a lot of PKD's autofac. I'm sure there are other fictional examples worth highlighting, but I point out autofac since it's the earliest instance of this idea I'm aware of (published 1955).
I think here it makes sense to talk about internal parts, separate from behavior, and real. And similarly in the single agent case: there are physical mechanisms producing the behavior, which can have different characteristics, and which in particular can be ‘in conflict’—in a way that motivates change—or not. I think it is also worth observing that humans find their preferences ‘in conflict’ and try to resolve them, which is suggests that they at least are better understood in terms of both behavior and underlying preferences that are separate from it.&nb
I like this idea. AI alignment research is more like engineering than math or science, and engineering is definitely full of multiple paradigms, not just because it's a big field with lots of specialties that have different requirements, but also because different problems require different solutions and sometimes the same problem can be solved by approaching it in multiple ways.
A classic example from computer science is the equivalence of loops and recursion. In a lot of ways these create two very different approaches to designing systems, writing code, a... (read more)
Thanks both! I definitely had the idea that Paul had mentioned something similar somewhere but hadn't made it a top-level concept. I think there's similar echos in how Eliezer talked about seed AI in the early Friendly AI work.
Seems like it probably does, but only incidentally.
I instead tend to view ML research as the background over which alignment work is now progressing. That is, we're in a race against capabilities research that we have little power to stop, so our best bets are either that it turns out capabilities are about to hit the upper inflection point of an S-curve, buying us some time, or that the capabilities can be safely turned to helping us solve alignment.
I do think there's something interesting about a direction not considered in this post related to intellige... (read more)
Looks good to me! Thanks for planning to include this in the AN!
I think the generalized insight from Armstrong's no free lunch paper is still underappreciated in that I sometimes see papers that, to me, seem to run up against this and fail to realize there's a free variable in their mechanisms that needs to be fixed if they want them to not go off in random directions.
Another post of mine I'll recommend you:
This is the culmination of a series of post on "formal alignment", where I start out saying "what it would mean to formally state what it would mean to build aligned AI" and then from that try to figure out what we'd have to figure out in order to achieve that.
Over the last year I've gotten pulled in other directions so not pushed this line of research forward much, plus I reached a point with it where it was clear it require... (read more)
I wrote this post as a summary of a paper I published. It didn't get much attention, so I'd be interesting in having you all review it.
To say a little more, I think the general approach I lay out in here for taking towards safety work is worth considering more deeply and points towards a better process for choosing interventions in attempts to build aligned AI. I think what's more important than the specific examples where I apply the method is t... (read more)
Okay, so here's a more adequate follow up.
In this seminal cybernetics essay a way of thinking about this is layed out.
First, they consider systems that have observable behavior, i.e. systems that take inputs and produce outputs. Such systems can be either active, in that the system itself is the source of energy that produces the outputs, or passive, in that some outside source supplies the energy to power the mechanism. Compare an active plant or animal to something passive like a rock, though obviously whether or not something is active or passive depend... (read more)
Doing a little digging, I realized that the idea of "teleological mechanism" from cybernetics is probably a better handle for the idea and will provide a more accessible presentation of the idea. Some decent references:
I don't know of anywhere that presents the idea quite how I think of it, though. If you read Dreyfus on Heidegger you might manage to pick this out. Similarly I think this idea underlies Sartre's talk about freedom... (read more)
Reading this, I'm realizing again something I may have realized before and forgotten, but I think ideas about goal-directedness in AI have a lot of overlap with the philosophical topic of telos and Heideggerian care/concern.
The way I think about this is that ontological beings (that is, any process we can identify as producing information) have some ability to optimize (because information is produced by feedback) and must optimize for something rather than nothing (else they are not optimizers) or everything (in which case they are not finite, which they ... (read more)