All of G Gordon Worley III's Comments + Replies

Selection Theorems: A Program For Understanding Agents

Interesting. Selection theorems seem like a way of identifying the purposes or source of goal directness in agents that seems obvious to us yet hard to pin down. Compare also the ground of optimization.

David Wolpert on Knowledge

I don't really have a whole picture that I think says more than what others have. I think there's something to knowing as the act of operationalizing information, by which I mean a capacity to act based on information.

To make this more concrete, consider a simple control system like a thermostat or a steam engine governor. These systems contain information in the physical interactions we abstract away to call "signal" that's sent to the "controller". If we had only signal there'd be no knowledge because that's information that is not used to act. The contr... (read more)

David Wolpert on Knowledge

Quick thought: reading this I get a sense that some of our collective confusion here revolves around "knowledge" as a quantifiable noun rather than "knowing" as a verb, and if we give up on the idea that knowledge is first a quantifiable thing (rather than a convenient post hoc reification) we open up new avenues of understanding knowledge.

1Alex Flint25dYeah that resonates with me. I'd be interested in any more thoughts you have on this. Particularly anything about how we might recognize knowing in another entity or in a physical system.
Oracle predictions don't apply to non-existent worlds

Small insight why reading this: I'm starting to suspect that most (all???) unintuitive things that happen with Oracles are the result of them violating our intuitions about causality because they actually deliver no information, in that nothing can be conditioned on what the Oracle says because if we could then the Oracle would fail to actually be an Oracle, so we can only condition on the existence of the Oracle and how it functions and not what it actually says, e.g. you should still 1-box but it's mistaken to think anything an Oracle tells you allows you to do anything different.

2Chris_Leong1moYeah, you want either information about the available counterfactuals or information independent of your decision. Information about just the path taken isn't something you can condition on.
Grokking the Intentional Stance

There's no observer-independent fact of the matter about whether a system "is" an agent[9]

Worth saying, I think, that this is fully generally true that there's no observer-independent fact of the matter about whether X "is" Y. That this is true of agents is just particularly relevant to AI.

Search-in-Territory vs Search-in-Map

I'm not convinced there's an actual distinction to be made here.

Using your mass comparison example, arguably the only meaningful different between the two is where information is stored. In search-in-map it's stored in an auxiliary system; in search-in-territory it's embedded in the system. The same information is still there, though, all that's changed is the mechanism, and I'm not sure map and territory is the right way to talk about this since both are embedded/embodied in actual systems.

My guess is that search-in-map looks like a thing apart from searc... (read more)

A naive alignment strategy and optimism about generalization

For example, I now think that the representations of “what the model knows” in imitative generalization will sometimes need to use neural networks to translate between what the model is thinking and human language. Once you go down that road, you encounter many of the difficulties of the naive training strategy. This is an update in my view; I’ll likely go into more detail in a future post.

+1 to this and excited and happy to hear about this update in your view!

The reverse Goodhart problem

Ah, yeah, that's true, there's not much concern about getting too much of a good thing and that actually being good, which does seem like a reasonable category for anti-Goodharting.

It's a bit hard to think when this would actually happen, though, since usually you have to give something up, even if it's just the opportunity to have done less. For example, maybe I'm trying to get a B on a test because that will let me pass the class and graduate, but I accidentally get an A. The A is actually better and I don't mind getting it, but then I'm potentially left... (read more)

The reverse Goodhart problem

Maybe I'm missing something, but this seems already captured by the normal notion of what Goodharting is in that it's about deviation from the objective, not the direction of that deviation.

3Stuart Armstrong4moThe idea that maximising the proxy will inevitably end up reducing the true utility seems a strong implicit part of Goodharting the way it's used in practice. After all, if the deviation is upwards, Goodharting is far less of a problem. It's "suboptimal improvement" rather than "inevitable disaster".
Teaching ML to answer questions honestly instead of predicting human answers
  • Stories about how those algorithms lead to bad consequences. These are predictions about what could/would happen in the world. Even if they aren't predictions about what observations a human would see, they are the kind of thing that we can all recognize as a prediction (unless we are taking a fairly radical skeptical perspective which I don't really care about engaging with).

In the spirit then of caring about stories about how algorithms lead to bad consequences, a story about how I see not making a clear distinction between instrumental and intended mode... (read more)

Teaching ML to answer questions honestly instead of predicting human answers

I want to consider models that learn to predict both “how a human will answer question Q” (the instrumental model) and “the real answer to question Q” (the intended model). These two models share almost all of their computation — which is dedicated to figuring out what actually happens in the world. They differ only when it comes time to actually extract the answer. I’ll describe the resulting model as having a “world model,” an “instrumental head,” and an “intended head.”

This seems massively underspecified in that it's really unclear to me what's actually... (read more)

3Paul Christiano5moI don't think anyone has a precise general definition of "answer questions honestly" (though I often consider simple examples in which the meaning is clear). But we do all understand how "imitate what a human would say" is completely different (since we all grant the possibility of humans being mistaken or manipulated), and so a strong inductive bias towards "imitate what a human would say" is clearly a problem to be solved even if other concepts are philosophically ambiguous. Sometimes a model might say something like "No one entered the datacenter" when what they really mean is "Someone entered the datacenter, got control of the hard drives with surveillance logs, and modified them to show no trace of their presence." In this case I'd say the answer is "wrong;" when such wrong answers appear as a critical part of a story about catastrophic failure, I'm tempted to look at why they were wrong to try to find a root cause of failure, and to try to look for algorithms that avoid the failure by not being "wrong" in the same intuitive sense. The mechanism in this post is one way that you can get this kind of wrong answer, namely by imitating human answers, and so that's something we can try to fix. On my perspective, the only things that are really fundamental are: * Algorithms to train ML systems. These are programs you can run. * Stories about how those algorithms lead to bad consequences. These are predictions about what could/would happen in the world. Even if they aren't predictions about what observations a human would see, they are the kind of thing that we can all recognize as a prediction (unless we are taking a fairly radical skeptical perspective which I don't really care about engaging with). Everything else is just a heuristic to help us understand why an algorithm might work or where we might look for a possible failure story. I think this is one of the upsides of my research methodology [
Saving Time

Firstly, we don't understand where this logical time might come from, or how to learn it

Okay, you can't write a sentence like that and expect me not to say that it's another manifestation of the problem of the criterion.

Yes, I realize this is not the problem you're interested in, but it's one I'm interested in, so this seems like a good opportunity to think about it anyway.

The issue seems to be that we don't have a good way to ground the order on world states (or, subjectively speaking if we want to be maximally cautious here, experience moments) since we ... (read more)

Pitfalls of the agent model

Somewhat ironically, some of these failures from thinking of oneself or others as agents causes a lack of agency! Maybe this is just a trick of language, but here's what I have in mind from thinking about some of the pitfalls:

  • Self-hatred results in less agency (freedom to do what you want) rather than more because effort is placed on hating the self rather than trying to change the self to be more in the desired state.
  • Procrastination is basically the textbook example of a failure of agency.
  • Hatred of others is basically the same story here as self-hatred.

On... (read more)

2Alex Flint5moYeah right, I agree with those three bullet points very much. Could also say "thinking of oneself or others as Cartesian agents causes a lack of power". Does agency=power? I'm not sure what the appropriate words are but I agree with your point. Yeah, that seems well said to me. This gradual process of taking more things as object seems to lead towards very good things. Circling, I think, has a lot to do with taking the emotions we are used to treating as subject and getting a bit more of an object lens on them just by talking about them. Gendlin's focussing seems to have a lot to do with this, too. Yeah right, it's a great lens to pick up and use when its helpful. But nice to know that it's there and also to be able to put it down by choice.
Where are intentions to be found?

Oh, I don't think those things exactly sidestep the problem of the criterion so much as commit to a response to it without necessarily realizing that's what they're doing. All of them sort of punt on it by saying "let humans figure out that part", which at the end of the day is what any solution is going to do because we're the ones trying to build the AI and making the decisions, but we can be more or less deliberate about how we do this part.

Probability theory and logical induction as lenses

Right. For example, I think Stuart Armstrong is hitting something very important about AI alignment with his pursuit of the idea that there's no free lunch in value learning. We only close the gap by making an "arbitrary" assumption, but it's only arbitrary if you assume there's some kind of context-free version of the truth. Instead we can choose in a non-arbitrary way based on what we care about and is useful to us.

I realize lots of people are bored by this point because they're non-arbitrary solution that is useful is some version of rationality criteri... (read more)

2Alex Flint5moYou're talking about how we ground out our thinking in something that is true but is not just further conceptualization? Look if we just make a choice about the truth by making an assumption then eventually the world really does "bite back". It's possible to try this out by just picking a certain fundamental orientation towards the world and sticking to it no matter what throughout your life for a little while. The more rigidly you adhere to it the more quickly the world will bite back. So I don't think we can just pick a grounding. But at the same time I very much agree that there is no concept that corresponds to the truth in a context-free or absolute way. The analogy I like the most is dance: imagine if I danced a dance that beautifully expressed what it's like to walk in the forest at night. It might be an incredibly evocative dance and it might point towards a deep truth about the forest at night, but it would be strange to claim that a particular dance is the final, absolute, context-free truth. It would be strange to seek after a final, absolute, context-free dance that expresses what it's like to walk in the forest at night in a way that finally captures the actual truth about the forest at night. When we engage in conceptualization, we are engaging in something like a dance. It's a dance with real consequence, real power, real impacts on the world, and real importance. It matters that we dance it and that we get it right. It's hard to think of anything at this point that matters more. But its significance is not a function of its capturing the truth in a final or context-free way. So when I consider "grounding out" my thinking in reality, I think of it in the same way that a dance should "ground out" in reality. That is: it should be about something real. It's also possible to pick some idea about what it's really like to walk in the forest at night and dance in a way that adheres to that idea but not to the reality of what it's actually like to walk
Where are intentions to be found?

Not really. If we were Cartesian, in order to fit the way we find the world, it seems to be that it'd have to be that agentiness is created outside the observable universe, possibly somewhere hypercomputation is possible, which might only admit an answer about how to build AI that would look roughly like "put a soul in it", i.e. link it up to this other place where agentiness is coming from. Although I guess if the world really looked like that maybe the way to do the "soul linkage" part would be visible, but it's not so seems unlikely.

1Alex Flint5moWell ok, agreed, but even if we were Cartesian, we would still have questions about what is the right way to link up our machines with this place where agentiness is coming from, how we discern whether we are in fact Cartesian or embedded, and so on down to the problem of the criterion as you described it. One common response to any such difficult philosophical problems seems to be to just build AI that uses some form of indirect normativity such as CEV or HCH or AI debate to work out what wise humans would do about those philosophical problems. But I don't think it's so easy to sidestep the problem of the criterion.
Beware over-use of the agent model

I think this is right and underappreciated. However I struggle myself to make a clear case of what to do about it. There's something here, but I think it mostly shows up in not getting confused that the agent model just is how reality is, which underwhelms people who perhaps most fail to deeply grok what that means because they have a surface understanding of it.

Probability theory and logical induction as lenses

Well stated. For what it's worth I think this is a great explanation of why I'm always going on about the problem of the criterion: as embedded, finite agents without access to hypercomputation or perfect, a priori knowledge we're stuck in this mess of trying to figure things out from the inside and always getting it a little bit wrong, no matter how hard we try, so it's worth paying attention to that because solving, for example, alignment for idealized mathematical systems that don't exist is maybe interesting but also not an actual solution to the alignment problem.

2Alex Flint5moThat post was a delightful read! Thanks for the pointer. It seems that we cannot ever find, among concepts, a firm foundation on which we can be absolutely sure of our footing. For the same reason, our basic systems of logic, ethics, and empiricism can never be put on absolutely sure footing (Godel, Humean is/ought gap, radical skepticism).
Where are intentions to be found?

Largely agree. I think you're exploring what I'd call the deep implications of the fact that agents are embedded rather than Cartesian.

1Alex Flint5moInteresting. Is it that if we were Caresian, you'd expect to be able to look at the agent-outside-the-world to find answers to questions about what even is the right way to go about building AI?
Testing The Natural Abstraction Hypothesis: Project Intro

Nice! From my perspective this would be pretty exciting because, if natural abstractions exist, it solves at least some of the inference problem I view at the root of solving alignment, i.e. how do you know that the AI really understands you/humans and isn't misunderstanding you/humans in some way that looks like it does understand from the outside but it doesn't. Although I phrased this in terms of reified experiences (noemata/qualia as a generalization of axia), abstractions are essentially the same thing in more familiar language, so I'm quite excited f... (read more)

Solving the whole AGI control problem, version 0.0001

Regarding conservatism, there seems to be an open question of just how robust Goodhart effects are in that we all agree Goodhart is a problem but it's not clear how much of a problem it is and when. We have opinions ranging from mine, which is basically that Goodharting happens the moment you try to apply even the weakest optimization pressure and this will be a problem (or at least a problem in expectation; you might get lucky) for any system you need to never deviate, to what I read to be Paul's position: it's not that bad and we can do a lot to correct ... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Much of this, especially the story of the production web and especially especially the story of the factorial DAOs, reminds me a lot of PKD's autofac. I'm sure there are other fictional examples worth highlighting, but I point out autofac since it's the earliest instance of this idea I'm aware of (published 1955).

2Andrew Critch6moI hadn't read it (nor almost any science fiction books/stories) but yes, you're right! I've now added a callback to Autofac after the "facotiral DAO" story. Thanks.
Coherence arguments imply a force for goal-directed behavior

I think here it makes sense to talk about internal parts, separate from behavior, and real. And similarly in the single agent case: there are physical mechanisms producing the behavior, which can have different characteristics, and which in particular can be ‘in conflict’—in a way that motivates change—or not. I think it is also worth observing that humans find their preferences ‘in conflict’ and try to resolve them, which is suggests that they at least are better understood in terms of both behavior and underlying preferences that are separate from it.&nb

... (read more)
Epistemological Framing for AI Alignment Research

I like this idea. AI alignment research is more like engineering than math or science, and engineering is definitely full of multiple paradigms, not just because it's a big field with lots of specialties that have different requirements, but also because different problems require different solutions and sometimes the same problem can be solved by approaching it in multiple ways.

A classic example from computer science is the equivalence of loops and recursion. In a lot of ways these create two very different approaches to designing systems, writing code, a... (read more)

Bootstrapped Alignment

Thanks both! I definitely had the idea that Paul had mentioned something similar somewhere but hadn't made it a top-level concept. I think there's similar echos in how Eliezer talked about seed AI in the early Friendly AI work.

Bootstrapped Alignment

Seems like it probably does, but only incidentally.

I instead tend to view ML research as the background over which alignment work is now progressing. That is, we're in a race against capabilities research that we have little power to stop, so our best bets are either that it turns out capabilities are about to hit the upper inflection point of an S-curve, buying us some time, or that the capabilities can be safely turned to helping us solve alignment.

I do think there's something interesting about a direction not considered in this post related to intellige... (read more)

Bootstrapped Alignment

Looks good to me! Thanks for planning to include this in the AN!

Suggestions of posts on the AF to review

I think the generalized insight from Armstrong's no free lunch paper is still underappreciated in that I sometimes see papers that, to me, seem to run up against this and fail to realize there's a free variable in their mechanisms that needs to be fixed if they want them to not go off in random directions.

1Adam Shimi8moThanks for the suggestion! I didn't know about this post. We'll consider it. :)
Suggestions of posts on the AF to review

Another post of mine I'll recommend you:

This is the culmination of a series of post on "formal alignment", where I start out saying "what it would mean to formally state what it would mean to build aligned AI" and then from that try to figure out what we'd have to figure out in order to achieve that.

Over the last year I've gotten pulled in other directions so not pushed this line of research forward much, plus I reached a point with it where it was clear it require... (read more)

1Adam Shimi8moThanks for the suggestion! We want to go through the different research agendas (and I already knew about yours), as they give different views/paradigms on AI Alignment. Yet I'm not sure how relevant a review of such posts are. In a sense, the "reviewable" part is the actual research that underlies the agenda, right?
Suggestions of posts on the AF to review

I wrote this post as a summary of a paper I published. It didn't get much attention, so I'd be interesting in having you all review it.

To say a little more, I think the general approach I lay out in here for taking towards safety work is worth considering more deeply and points towards a better process for choosing interventions in attempts to build aligned AI. I think what's more important than the specific examples where I apply the method is t... (read more)

1Adam Shimi8moThanks for the suggestion! It's great to have some methodological posts! We'll consider it. :)
Literature Review on Goal-Directedness

Okay, so here's a more adequate follow up.

In this seminal cybernetics essay a way of thinking about this is layed out.

First, they consider systems that have observable behavior, i.e. systems that take inputs and produce outputs. Such systems can be either active, in that the system itself is the source of energy that produces the outputs, or passive, in that some outside source supplies the energy to power the mechanism. Compare an active plant or animal to something passive like a rock, though obviously whether or not something is active or passive depend... (read more)

Literature Review on Goal-Directedness

Doing a little digging, I realized that the idea of "teleological mechanism" from cybernetics is probably a better handle for the idea and will provide a more accessible presentation of the idea. Some decent references:

I don't know of anywhere that presents the idea quite how I think of it, though. If you read Dreyfus on Heidegger you might manage to pick this out. Similarly I think this idea underlies Sartre's talk about freedom... (read more)

Literature Review on Goal-Directedness

Reading this, I'm realizing again something I may have realized before and forgotten, but I think ideas about goal-directedness in AI have a lot of overlap with the philosophical topic of telos and Heideggerian care/concern.

The way I think about this is that ontological beings (that is, any process we can identify as producing information) have some ability to optimize (because information is produced by feedback) and must optimize for something rather than nothing (else they are not optimizers) or everything (in which case they are not finite, which they ... (read more)

1Adam Shimi9moThanks for the proposed idea! Yet I find myself lost when trying to find more information about this concept of care. It is mentioned in both the chapter on Heidegger in The History of Philosophy [] and the section on care in the SEP article on Heidegger [], but I don't get a single thing written there. I think the ideas of "thrownness" and "disposedness" are related? Do you have specific pointers to deeper discussions of this concept? Specifically, I'm interested in new intuitions for how a goal is revealed by actions.
Values Form a Shifting Landscape (and why you might care)

I like that this post is fairly accessible, although I found the charts confusing, largely because it's not always that clear to me what's being measured on each axis. I basically get what's going on, but I find myself disliking something about way the charts are presented because it's not always very clear what each axis measures.

(In some cases I think of them as more like being multidimensional spaces you've put on a line, but that still makes the visuals kind of confusing.)

None of this is really meant to be a big complaint, though. Graphics are hard; I ... (read more)

1Vojtech Kovarik10moThank you for the comment. As for the axes, the y-axis always denotes the desirability of the given value-system (except for Figure 1). And you are exactly right with the x-axis --- that is a multidimensional space of value-systems that we put on a line, because drawing this in 3D (well, (multi+1)-D :-) ) would be a mess. I will see if I can make it somewhat clearer in the post.
AI Problems Shared by Non-AI Systems

An important caveat is that many non-AI systems have humans in the loop somewhere that can intervene if they don't like what the automated system is doing. Some examples:

  • we shut down stock markets that seem to be out of control
  • employees ignore standard operating procedures when they get into corner cases and SOPs would have them do something that would hurt them or that they'd get in trouble for
  • an advertiser might manually override their automated ad bidding algorithm if it tries to spend too much or too little money
  • customer service reps (or their managers
... (read more)
2Vojtech Kovarik10moAll of these possible causes for the lack of support are valid. I would like to add one more: when the humans that could provide this kind of support don't care about providing it or have incentives against providing it. For example, I could report a bug in some system, but this would cost me time and only benefit people I don't know, so I will happily ignore it :-).
Recursive Quantilizers II

Just want to register I agree with your assessment of concerns, and generally reflects my concerns with attempts to bound risk. I generally think evolution is a good example of this: "risk" is bounded within an individual generation because organisms cannot arbitrarily change themselves, but over generations there's little bounding where things can go other than getting trapped in local maxima.

Early Thoughts on Ontology/Grounding Problems

Interesting. I can't recall if I commented on the alignment as translation post about this, but I think this is in fact the key thing standing in the way of addressing alignment, and put together a formal model that identified this as the problem, i.e. how do you ensure that two minds agree about preference ordering, or really even the statements being ordered.

The ethics of AI for the Routledge Encyclopedia of Philosophy

This post I wrote a while back has some references you might find useful: "A developmentally-situated approach to teaching normative behavior to AI".

Also I think some of the references in this paper I wrote might be useful: "Robustness to fundamental uncertainty in AGI alignment".

Topics that seem important to cover to me include not only AI impact on humans but also questions surrounding the subjective experience of AI, which largely revolves around the question of if AI have subjective experience or are otherwise moral patients at all.

2Stuart Armstrong1yThanks!
Supervised learning of outputs in the brain

Branch predictors for sure, but modern CPUs also do things like managing multiple layers of cache using relatively simple algorithms that nonetheless in practice get high hit rates, conversion of instructions into microcode because it turns out small, simple instructions execute faster but CPUs need to do a lot of things so the tradeoff is to have the CPU interpret the instructions in real time into simpler instructions sent to specialized processing units inside the CPU, and maybe even optimistic branch execution, where instructions in the pipeline are partially executed provisionally ahead of branches being confirmed. All of these things seem like tricks of the sort I wouldn't be surprised to find parallels to in the brain.

Supervised learning of outputs in the brain

Your toy models drew a parallel for me to modern CPU architectures. That is, doing computation the "complete" way involves loading things from memory, doing math, writing to memory, and then that memory might affect later instructions. CPUs have all kinds of tricks to get around this to go faster, and it sort of like like your models of brain parts, only with a reversed etiology, since the ACU came first whereas the neocortex came last, as i understand it.

2Steve Byrnes1yInteresting! I'm thinking of CPU brach predictors [] , are you? (Are there other examples? Don't know much about CPUs.) If so, that did seem like a suggestive analogy to what I was calling "the accelerator". Not sure about etiology. How different is a neocortex from the pallium in a bird or lizard? I'm inclined to say "only superficially different", although I don't think it's known for sure. But if so, then there's a version of it even in lampreys, if memory serves. I don't know the evolutionary history of the cerebellum, or of the cerebellum-pallium loops. It might be in this paper by Cisek [] which I read but haven't fully processed / internalized.
Knowledge, manipulation, and free will

Hmm, I see some problems here.

By looking for manipulation on the basis of counterfactuals, you're at the mercy of your ability to find such counterfactuals, and that ability can also be manipulated such that you can't notice either the object level counterfactuals that would make you suspect manipulation of the counterfactuals about your counterfactual reasoning that would make you suspect manipulation. This seems insufficiently robust way to detect manipulation, or even define it since the mechanism of detecting it can itself be manipulated to not notice ... (read more)

1Alex Turner1yOK, but there's a difference between "here's a definition of manipulation that's so waterproof you couldn't break it if you optimized against it with arbitrarily large optimization power" and "here's my current best way of thinking about manipulation." I was presenting the latter, because it helps me be less confused than if I just stuck to my previous gut-level, intuitive understanding of manipulation. Edit: Put otherwise, I was replying more to your point (1) than your point (2) in the original comment. Sorry for the ambiguity!
Knowledge, manipulation, and free will

So "no manipulation" or "maintaining human free will" seems to require a form of indifference: we want the AI to know how its actions affect our decisions, but not take that influence into account when choosing those actions.

Two thoughts.

One, this seems likely to have some overlap with notions of impact and impact measures.

Two, it seems like there's no real way to eliminate manipulation in a very broad sense, because we'd expect our AI to be causally entangled with the human, so there's no action the AI could take that would not influence the human in some... (read more)

3Charlie Steiner1yI agree. The important part of cases 5 & 6, where some other agent "manipulates" Petrov, is that suddenly, to us human readers, it seems like the protagonist of the story (and we do model it as a story) is the cook/kidnapper, not Petrov. I'm fine with the AI choosing actions using a model of the world that includes me. I'm not fine with it supplanting me from my agent-shaped place in the story I tell about my life.

Not Stuart, but I agree there's overlap here. Personally, I think about manipulation as when an agent's policy robustly steers the human into taking a certain kind of action, in a way that's robust to the human's counterfactual preferences. Like if I'm choosing which pair of shoes to buy, and I ask the AI for help, and no matter what preferences I had for shoes to begin with, I end up buying blue shoes, then I'm probably being manipulated. A non-manipulative AI would act in a way that increases my knowledge and lets me condition my actions on my preferences.

G Gordon Worley III's Shortform

I recently watched all 7 seasons of HBO's "Silicon Valley" and the final episode (or really the final 4 episodes leading up into the final one) did a really great job of hitting on some important ideas we talk about in AI safety.

Now, the show in earlier seasons has played with the idea of AI with things like an obvious parody of Ben Goertzel and Sophia, discussion of Roko's Basilisk, and of course AI that Goodharts. In fact, Goodharting is a pivotal plot point in how the show ends, along with a Petrov-esque ending where hard choices have to be made under u... (read more)

AGI safety from first principles: Conclusion

One thing I like about this series is that it puts all this online in a fairly condensed form, which I feel like I often am not quite sure what to link to in order to present these kinds of arguments. That you do it better than perhaps we have done in the past makes it all the better!

Learning human preferences: black-box, white-box, and structured white-box access

Any model is going to be in the head of some onlooker. This is the tough part about the white box approach: it's always an inference about what's "really" going on. Of course, this is true even of the boundaries of black boxes, so it's a fully general problem. And I think that suggests it's not a problem except insofar as we have normal problems setting up correspondence between map and territory.

2Steve Byrnes1yMy understanding of the OP was that there is a robot, and the robot has source code, and "black box" means we don't see the source code but get an impenetrable binary and can do tests of what its input-output behavior is, and "white box" means we get the source code and run it step-by-step in debugging mode but the names of variables, functions, modules, etc. are replaced by random strings. We can still see the structure of the code, like "module A calls module B". And "labeled white box" means we get the source code along with well-chosen names of variables, functions, etc. Then my question was: what if none of the variables, functions, etc. corresponds to "preferences"? What if "preferences" is a way that we try to interpret the robot, but not a natural subsystem or abstraction or function or anything else that would be useful for the robot's programmer? But now this conversation is suggesting that I'm not quite understanding it right. "Black box" is what I thought, but "white box" is any source code that produces the same input-output behavior—not necessarily the robot's actual source code—and that includes source code that does extra pointless calculations internally. And then my question doesn't really make sense, because whatever "preferences" is, I can come up a white-box model wherein "preferences" is calculated and then immediately deleted, such that it's not part of the input-output behavior. Something like that?
[AN #112]: Engineering a Safer World

I'm excited to see this cross-over into AI safety discussions. I work on what we often call "reliability engineering" in software, and I think there's a lot of lessons there that apply here, especially the systems-based or highly-contextualized approach, since it acknowledges the same kind of failure as, say, was pointed out in The Design of Everyday Things: just because you build something to spec doesn't mean it works if humans make mistakes using it.

I've not done a lot to bring that over to LW or AF, other than a half-assed post about normalization of d... (read more)

Alignment By Default

So far, we’ve only talked about one AI ending up aligned, or a handful ending up aligned at one particular time. However, that isn’t really the ultimate goal of AI alignment research. What we really want is for AI to remain aligned in the long run, as we (and AIs themselves) continue to build new and more powerful systems and/or scale up existing systems over time.

I think this suggests an interesting path where alignment by default might be able to serve as a bridge to better alignment mechanisms, i.e. if it works and we can select for AIs that contains re... (read more)

2johnswentworth1yI think of this as the Rohin trajectory, since he's the main person I've heard talk about it. I agree it's a natural approach to consider, though deceptiveness-type problems are a big potential issue.
2Adam Shimi1yIsn't remaining aligned an example of robust delegation [] ? If so, there have been both discussions and technical work on this problem before.
The Fusion Power Generator Scenario

I somewhat hopeful that this is right, but I'm also not so confident that I feel like we can ignore the risks of GPT-N.

For example, this post makes the argument that, because of GPT's design and learning mechanism, we need not worry about it coming up with significantly novel things or outperforming humans because it's optimizing for imitating existing human writing, not saying true things. On the other hand, it's managing to do powerful things it wasn't trained for, like solve math equations we have no reason to believe it saw in the training set or write... (read more)

2Donald Hobson1yAny of the risks of being like a group of humans, only much faster, apply. There are also the mesa alignment issues. I suspect that a sufficiently powerful GPT-n might form deceptively aligned mesa optimisers. I would also worry that off distribution attractors could be malign and intelligent. Suppose you give GPT-n an off training distribution prompt. You get it to generate text from this prompt. Sometimes it might wander back into the distribution, other times it might stay off distribution. How wide is the border between processes that are safely immitating humans, and processes that aren't performing significant optimization? You could get "viruses", patterns of text that encourage GPT-n to repeat them so they don't drop out of context. GPT-n already has an accurate world model, a world model that probably models the thought processes of humans in detail. You have all the components needed to create powerful malign intelligences, and a process that smashes them together indiscriminately.
Load More