Interesting. Selection theorems seem like a way of identifying the purposes or source of goal directness in agents that seems obvious to us yet hard to pin down. Compare also the ground of optimization.
I don't really have a whole picture that I think says more than what others have. I think there's something to knowing as the act of operationalizing information, by which I mean a capacity to act based on information.
To make this more concrete, consider a simple control system like a thermostat or a steam engine governor. These systems contain information in the physical interactions we abstract away to call "signal" that's sent to the "controller". If we had only signal there'd be no knowledge because that's information that is not used to act. The contr... (read more)
Quick thought: reading this I get a sense that some of our collective confusion here revolves around "knowledge" as a quantifiable noun rather than "knowing" as a verb, and if we give up on the idea that knowledge is first a quantifiable thing (rather than a convenient post hoc reification) we open up new avenues of understanding knowledge.
Small insight why reading this: I'm starting to suspect that most (all???) unintuitive things that happen with Oracles are the result of them violating our intuitions about causality because they actually deliver no information, in that nothing can be conditioned on what the Oracle says because if we could then the Oracle would fail to actually be an Oracle, so we can only condition on the existence of the Oracle and how it functions and not what it actually says, e.g. you should still 1-box but it's mistaken to think anything an Oracle tells you allows you to do anything different.
There's no observer-independent fact of the matter about whether a system "is" an agent
Worth saying, I think, that this is fully generally true that there's no observer-independent fact of the matter about whether X "is" Y. That this is true of agents is just particularly relevant to AI.
I'm not convinced there's an actual distinction to be made here.
Using your mass comparison example, arguably the only meaningful different between the two is where information is stored. In search-in-map it's stored in an auxiliary system; in search-in-territory it's embedded in the system. The same information is still there, though, all that's changed is the mechanism, and I'm not sure map and territory is the right way to talk about this since both are embedded/embodied in actual systems.
My guess is that search-in-map looks like a thing apart from searc... (read more)
For example, I now think that the representations of “what the model knows” in imitative generalization will sometimes need to use neural networks to translate between what the model is thinking and human language. Once you go down that road, you encounter many of the difficulties of the naive training strategy. This is an update in my view; I’ll likely go into more detail in a future post.
+1 to this and excited and happy to hear about this update in your view!
Ah, yeah, that's true, there's not much concern about getting too much of a good thing and that actually being good, which does seem like a reasonable category for anti-Goodharting.
It's a bit hard to think when this would actually happen, though, since usually you have to give something up, even if it's just the opportunity to have done less. For example, maybe I'm trying to get a B on a test because that will let me pass the class and graduate, but I accidentally get an A. The A is actually better and I don't mind getting it, but then I'm potentially left... (read more)
Maybe I'm missing something, but this seems already captured by the normal notion of what Goodharting is in that it's about deviation from the objective, not the direction of that deviation.
Stories about how those algorithms lead to bad consequences. These are predictions about what could/would happen in the world. Even if they aren't predictions about what observations a human would see, they are the kind of thing that we can all recognize as a prediction (unless we are taking a fairly radical skeptical perspective which I don't really care about engaging with).
In the spirit then of caring about stories about how algorithms lead to bad consequences, a story about how I see not making a clear distinction between instrumental and intended mode... (read more)
I want to consider models that learn to predict both “how a human will answer question Q” (the instrumental model) and “the real answer to question Q” (the intended model). These two models share almost all of their computation — which is dedicated to figuring out what actually happens in the world. They differ only when it comes time to actually extract the answer. I’ll describe the resulting model as having a “world model,” an “instrumental head,” and an “intended head.”
This seems massively underspecified in that it's really unclear to me what's actually... (read more)
I wrote a research agenda that suggests additional work to be done and that I'm not doing.
Firstly, we don't understand where this logical time might come from, or how to learn it
Okay, you can't write a sentence like that and expect me not to say that it's another manifestation of the problem of the criterion.
Yes, I realize this is not the problem you're interested in, but it's one I'm interested in, so this seems like a good opportunity to think about it anyway.
The issue seems to be that we don't have a good way to ground the order on world states (or, subjectively speaking if we want to be maximally cautious here, experience moments) since we ... (read more)
Somewhat ironically, some of these failures from thinking of oneself or others as agents causes a lack of agency! Maybe this is just a trick of language, but here's what I have in mind from thinking about some of the pitfalls:
On... (read more)
Yep, that accords well with my own current view.
Oh, I don't think those things exactly sidestep the problem of the criterion so much as commit to a response to it without necessarily realizing that's what they're doing. All of them sort of punt on it by saying "let humans figure out that part", which at the end of the day is what any solution is going to do because we're the ones trying to build the AI and making the decisions, but we can be more or less deliberate about how we do this part.
Right. For example, I think Stuart Armstrong is hitting something very important about AI alignment with his pursuit of the idea that there's no free lunch in value learning. We only close the gap by making an "arbitrary" assumption, but it's only arbitrary if you assume there's some kind of context-free version of the truth. Instead we can choose in a non-arbitrary way based on what we care about and is useful to us.
I realize lots of people are bored by this point because they're non-arbitrary solution that is useful is some version of rationality criteri... (read more)
Not really. If we were Cartesian, in order to fit the way we find the world, it seems to be that it'd have to be that agentiness is created outside the observable universe, possibly somewhere hypercomputation is possible, which might only admit an answer about how to build AI that would look roughly like "put a soul in it", i.e. link it up to this other place where agentiness is coming from. Although I guess if the world really looked like that maybe the way to do the "soul linkage" part would be visible, but it's not so seems unlikely.
I think this is right and underappreciated. However I struggle myself to make a clear case of what to do about it. There's something here, but I think it mostly shows up in not getting confused that the agent model just is how reality is, which underwhelms people who perhaps most fail to deeply grok what that means because they have a surface understanding of it.
Well stated. For what it's worth I think this is a great explanation of why I'm always going on about the problem of the criterion: as embedded, finite agents without access to hypercomputation or perfect, a priori knowledge we're stuck in this mess of trying to figure things out from the inside and always getting it a little bit wrong, no matter how hard we try, so it's worth paying attention to that because solving, for example, alignment for idealized mathematical systems that don't exist is maybe interesting but also not an actual solution to the alignment problem.
Largely agree. I think you're exploring what I'd call the deep implications of the fact that agents are embedded rather than Cartesian.
Nice! From my perspective this would be pretty exciting because, if natural abstractions exist, it solves at least some of the inference problem I view at the root of solving alignment, i.e. how do you know that the AI really understands you/humans and isn't misunderstanding you/humans in some way that looks like it does understand from the outside but it doesn't. Although I phrased this in terms of reified experiences (noemata/qualia as a generalization of axia), abstractions are essentially the same thing in more familiar language, so I'm quite excited f... (read more)
Regarding conservatism, there seems to be an open question of just how robust Goodhart effects are in that we all agree Goodhart is a problem but it's not clear how much of a problem it is and when. We have opinions ranging from mine, which is basically that Goodharting happens the moment you try to apply even the weakest optimization pressure and this will be a problem (or at least a problem in expectation; you might get lucky) for any system you need to never deviate, to what I read to be Paul's position: it's not that bad and we can do a lot to correct ... (read more)
Much of this, especially the story of the production web and especially especially the story of the factorial DAOs, reminds me a lot of PKD's autofac. I'm sure there are other fictional examples worth highlighting, but I point out autofac since it's the earliest instance of this idea I'm aware of (published 1955).
I think here it makes sense to talk about internal parts, separate from behavior, and real. And similarly in the single agent case: there are physical mechanisms producing the behavior, which can have different characteristics, and which in particular can be ‘in conflict’—in a way that motivates change—or not. I think it is also worth observing that humans find their preferences ‘in conflict’ and try to resolve them, which is suggests that they at least are better understood in terms of both behavior and underlying preferences that are separate from it.&nb
I like this idea. AI alignment research is more like engineering than math or science, and engineering is definitely full of multiple paradigms, not just because it's a big field with lots of specialties that have different requirements, but also because different problems require different solutions and sometimes the same problem can be solved by approaching it in multiple ways.
A classic example from computer science is the equivalence of loops and recursion. In a lot of ways these create two very different approaches to designing systems, writing code, a... (read more)
Thanks both! I definitely had the idea that Paul had mentioned something similar somewhere but hadn't made it a top-level concept. I think there's similar echos in how Eliezer talked about seed AI in the early Friendly AI work.
Seems like it probably does, but only incidentally.
I instead tend to view ML research as the background over which alignment work is now progressing. That is, we're in a race against capabilities research that we have little power to stop, so our best bets are either that it turns out capabilities are about to hit the upper inflection point of an S-curve, buying us some time, or that the capabilities can be safely turned to helping us solve alignment.
I do think there's something interesting about a direction not considered in this post related to intellige... (read more)
Looks good to me! Thanks for planning to include this in the AN!
I think the generalized insight from Armstrong's no free lunch paper is still underappreciated in that I sometimes see papers that, to me, seem to run up against this and fail to realize there's a free variable in their mechanisms that needs to be fixed if they want them to not go off in random directions.
Another post of mine I'll recommend you:
This is the culmination of a series of post on "formal alignment", where I start out saying "what it would mean to formally state what it would mean to build aligned AI" and then from that try to figure out what we'd have to figure out in order to achieve that.
Over the last year I've gotten pulled in other directions so not pushed this line of research forward much, plus I reached a point with it where it was clear it require... (read more)
I wrote this post as a summary of a paper I published. It didn't get much attention, so I'd be interesting in having you all review it.
To say a little more, I think the general approach I lay out in here for taking towards safety work is worth considering more deeply and points towards a better process for choosing interventions in attempts to build aligned AI. I think what's more important than the specific examples where I apply the method is t... (read more)
Okay, so here's a more adequate follow up.
In this seminal cybernetics essay a way of thinking about this is layed out.
First, they consider systems that have observable behavior, i.e. systems that take inputs and produce outputs. Such systems can be either active, in that the system itself is the source of energy that produces the outputs, or passive, in that some outside source supplies the energy to power the mechanism. Compare an active plant or animal to something passive like a rock, though obviously whether or not something is active or passive depend... (read more)
Doing a little digging, I realized that the idea of "teleological mechanism" from cybernetics is probably a better handle for the idea and will provide a more accessible presentation of the idea. Some decent references:
I don't know of anywhere that presents the idea quite how I think of it, though. If you read Dreyfus on Heidegger you might manage to pick this out. Similarly I think this idea underlies Sartre's talk about freedom... (read more)
Reading this, I'm realizing again something I may have realized before and forgotten, but I think ideas about goal-directedness in AI have a lot of overlap with the philosophical topic of telos and Heideggerian care/concern.
The way I think about this is that ontological beings (that is, any process we can identify as producing information) have some ability to optimize (because information is produced by feedback) and must optimize for something rather than nothing (else they are not optimizers) or everything (in which case they are not finite, which they ... (read more)
I like that this post is fairly accessible, although I found the charts confusing, largely because it's not always that clear to me what's being measured on each axis. I basically get what's going on, but I find myself disliking something about way the charts are presented because it's not always very clear what each axis measures.
(In some cases I think of them as more like being multidimensional spaces you've put on a line, but that still makes the visuals kind of confusing.)
None of this is really meant to be a big complaint, though. Graphics are hard; I ... (read more)
An important caveat is that many non-AI systems have humans in the loop somewhere that can intervene if they don't like what the automated system is doing. Some examples:
Just want to register I agree with your assessment of concerns, and generally reflects my concerns with attempts to bound risk. I generally think evolution is a good example of this: "risk" is bounded within an individual generation because organisms cannot arbitrarily change themselves, but over generations there's little bounding where things can go other than getting trapped in local maxima.
Interesting. I can't recall if I commented on the alignment as translation post about this, but I think this is in fact the key thing standing in the way of addressing alignment, and put together a formal model that identified this as the problem, i.e. how do you ensure that two minds agree about preference ordering, or really even the statements being ordered.
This post I wrote a while back has some references you might find useful: "A developmentally-situated approach to teaching normative behavior to AI".
Also I think some of the references in this paper I wrote might be useful: "Robustness to fundamental uncertainty in AGI alignment".
Topics that seem important to cover to me include not only AI impact on humans but also questions surrounding the subjective experience of AI, which largely revolves around the question of if AI have subjective experience or are otherwise moral patients at all.
Branch predictors for sure, but modern CPUs also do things like managing multiple layers of cache using relatively simple algorithms that nonetheless in practice get high hit rates, conversion of instructions into microcode because it turns out small, simple instructions execute faster but CPUs need to do a lot of things so the tradeoff is to have the CPU interpret the instructions in real time into simpler instructions sent to specialized processing units inside the CPU, and maybe even optimistic branch execution, where instructions in the pipeline are partially executed provisionally ahead of branches being confirmed. All of these things seem like tricks of the sort I wouldn't be surprised to find parallels to in the brain.
Your toy models drew a parallel for me to modern CPU architectures. That is, doing computation the "complete" way involves loading things from memory, doing math, writing to memory, and then that memory might affect later instructions. CPUs have all kinds of tricks to get around this to go faster, and it sort of like like your models of brain parts, only with a reversed etiology, since the ACU came first whereas the neocortex came last, as i understand it.
Hmm, I see some problems here.
By looking for manipulation on the basis of counterfactuals, you're at the mercy of your ability to find such counterfactuals, and that ability can also be manipulated such that you can't notice either the object level counterfactuals that would make you suspect manipulation of the counterfactuals about your counterfactual reasoning that would make you suspect manipulation. This seems insufficiently robust way to detect manipulation, or even define it since the mechanism of detecting it can itself be manipulated to not notice ... (read more)
So "no manipulation" or "maintaining human free will" seems to require a form of indifference: we want the AI to know how its actions affect our decisions, but not take that influence into account when choosing those actions.
One, this seems likely to have some overlap with notions of impact and impact measures.
Two, it seems like there's no real way to eliminate manipulation in a very broad sense, because we'd expect our AI to be causally entangled with the human, so there's no action the AI could take that would not influence the human in some... (read more)
Not Stuart, but I agree there's overlap here. Personally, I think about manipulation as when an agent's policy robustly steers the human into taking a certain kind of action, in a way that's robust to the human's counterfactual preferences. Like if I'm choosing which pair of shoes to buy, and I ask the AI for help, and no matter what preferences I had for shoes to begin with, I end up buying blue shoes, then I'm probably being manipulated. A non-manipulative AI would act in a way that increases my knowledge and lets me condition my actions on my preferences.
I recently watched all 7 seasons of HBO's "Silicon Valley" and the final episode (or really the final 4 episodes leading up into the final one) did a really great job of hitting on some important ideas we talk about in AI safety.
Now, the show in earlier seasons has played with the idea of AI with things like an obvious parody of Ben Goertzel and Sophia, discussion of Roko's Basilisk, and of course AI that Goodharts. In fact, Goodharting is a pivotal plot point in how the show ends, along with a Petrov-esque ending where hard choices have to be made under u... (read more)
One thing I like about this series is that it puts all this online in a fairly condensed form, which I feel like I often am not quite sure what to link to in order to present these kinds of arguments. That you do it better than perhaps we have done in the past makes it all the better!
Any model is going to be in the head of some onlooker. This is the tough part about the white box approach: it's always an inference about what's "really" going on. Of course, this is true even of the boundaries of black boxes, so it's a fully general problem. And I think that suggests it's not a problem except insofar as we have normal problems setting up correspondence between map and territory.
I'm excited to see this cross-over into AI safety discussions. I work on what we often call "reliability engineering" in software, and I think there's a lot of lessons there that apply here, especially the systems-based or highly-contextualized approach, since it acknowledges the same kind of failure as, say, was pointed out in The Design of Everyday Things: just because you build something to spec doesn't mean it works if humans make mistakes using it.
I've not done a lot to bring that over to LW or AF, other than a half-assed post about normalization of d... (read more)
So far, we’ve only talked about one AI ending up aligned, or a handful ending up aligned at one particular time. However, that isn’t really the ultimate goal of AI alignment research. What we really want is for AI to remain aligned in the long run, as we (and AIs themselves) continue to build new and more powerful systems and/or scale up existing systems over time.
I think this suggests an interesting path where alignment by default might be able to serve as a bridge to better alignment mechanisms, i.e. if it works and we can select for AIs that contains re... (read more)
I somewhat hopeful that this is right, but I'm also not so confident that I feel like we can ignore the risks of GPT-N.
For example, this post makes the argument that, because of GPT's design and learning mechanism, we need not worry about it coming up with significantly novel things or outperforming humans because it's optimizing for imitating existing human writing, not saying true things. On the other hand, it's managing to do powerful things it wasn't trained for, like solve math equations we have no reason to believe it saw in the training set or write... (read more)