G Gordon Worley III

Director of Research at PAISRI

Sequences

Formal Alignment

Comments

Bootstrapped Alignment

Seems like it probably does, but only incidentally.

I instead tend to view ML research as the background over which alignment work is now progressing. That is, we're in a race against capabilities research that we have little power to stop, so our best bets are either that it turns out capabilities are about to hit the upper inflection point of an S-curve, buying us some time, or that the capabilities can be safely turned to helping us solve alignment.

I do think there's something interesting about a direction not considered in this post related to intelligence enhancement of humans and human emulations (ems) as a means to working on alignment, but I think realistically current projections of AI capability timelines suggest they're unlikely to have much opportunity for impact.

Bootstrapped Alignment

Looks good to me! Thanks for planning to include this in the AN!

Suggestions of posts on the AF to review

I think the generalized insight from Armstrong's no free lunch paper is still underappreciated in that I sometimes see papers that, to me, seem to run up against this and fail to realize there's a free variable in their mechanisms that needs to be fixed if they want them to not go off in random directions.

https://www.lesswrong.com/posts/LRYwpq8i9ym7Wuyoc/other-versions-of-no-free-lunch-in-value-learning

Suggestions of posts on the AF to review

Another post of mine I'll recommend you:

https://www.lesswrong.com/posts/k8F8TBzuZtLheJt47/deconfusing-human-values-research-agenda-v1

This is the culmination of a series of post on "formal alignment", where I start out saying "what it would mean to formally state what it would mean to build aligned AI" and then from that try to figure out what we'd have to figure out in order to achieve that.

Over the last year I've gotten pulled in other directions so not pushed this line of research forward much, plus I reached a point with it where it was clear it required different specialization than I have to make additional progress, but I still think it presents a different approach to what others are doing in the space of work towards AI alignment and think you might find it interesting to review (along with the preceding posts in the series) for that reason.

Suggestions of posts on the AF to review

I wrote this post as a summary of a paper I published. It didn't get much attention, so I'd be interesting in having you all review it.

 https://www.lesswrong.com/posts/JYdGCrD55FhS4iHvY/robustness-to-fundamental-uncertainty-in-agi-alignment-1

To say a little more, I think the general approach I lay out in here for taking towards safety work is worth considering more deeply and points towards a better process for choosing interventions in attempts to build aligned AI. I think what's more important than the specific examples where I apply the method is the method itself, but thus far as best I can tell folks did not much engage with that, so unclear to me if that's because they disagree, think it's too obvious, or what.

Literature Review on Goal-Directedness

Okay, so here's a more adequate follow up.

In this seminal cybernetics essay a way of thinking about this is layed out.

First, they consider systems that have observable behavior, i.e. systems that take inputs and produce outputs. Such systems can be either active, in that the system itself is the source of energy that produces the outputs, or passive, in that some outside source supplies the energy to power the mechanism. Compare an active plant or animal to something passive like a rock, though obviously whether or not something is active or passive depends a lot on where you draw the boundaries of its inside vs. its outside.

Active behavior is subdivided into two classes: purposeful and purposeless. They say that purposeful behavior is that which can be interpreted as directed to attaining a goal; purposeless behavior does not. They spend some time in the paper defending the idea of purposefulness, and I think it doesn't go well. So I'd instead propose we think of these terms differently; I prefer to think of purposeful behavior as that which creates a reduction in entropy within the system and its outputs and purposeless behavior does not.

They then go on to divide purposeful behavior into teleological and non-teleological behavior, by which they simply mean behavior that's the result of feedback (and they specify negative feedback) or not. In LessWrong terms, I'd say this is like the difference between optimizers ("fitness maximizers") and adaptation executors.

They then go on to make a few additional distinctions that are not relevant to the present topic although do have some relevance to AI alignment relating to predictability of systems.

I'd say then that systems with active, purposeful, teleological behavior are the ones that "care", and the teleological mechanism is the aspect of the system by which a system is made to care.

Literature Review on Goal-Directedness

Doing a little digging, I realized that the idea of "teleological mechanism" from cybernetics is probably a better handle for the idea and will provide a more accessible presentation of the idea. Some decent references:

https://www.jstor.org/stable/184878

https://www.jstor.org/stable/2103479

https://nyaspubs.onlinelibrary.wiley.com/toc/17496632/50/4

I don't know of anywhere that presents the idea quite how I think of it, though. If you read Dreyfus on Heidegger you might manage to pick this out. Similarly I think this idea underlies Sartre's talk about freedom, but I can't recall that he explicitly makes the connection in the way I would. To the best of my knowledge philosophers have unfortunately not said enough about this topic because it's omnipresent in humans and often something that comes up incidentally to considering other things but not something deeply explored for its own sake except when people are confused (cg. Hegel on teleology).

Literature Review on Goal-Directedness

Reading this, I'm realizing again something I may have realized before and forgotten, but I think ideas about goal-directedness in AI have a lot of overlap with the philosophical topic of telos and Heideggerian care/concern.

The way I think about this is that ontological beings (that is, any process we can identify as producing information) have some ability to optimize (because information is produced by feedback) and must optimize for something rather than nothing (else they are not optimizers) or everything (in which case they are not finite, which they must be in our world). Thus we should expect anything we might think of as optimizing will have something it cares about, where here "caring about" is not the self-reflective way humans may knowingly care for something but in the implicit way that acts demonstrate care for something.

That something might not be very crisp and hard to specify or might be incoherent (or at least, incoherent when not conditioned on the entire state of the world), hence we might not be able to line it up perfectly with a notion like a utility function, although we could say a utility function is an attempt to represent the concern of an optimization process.

That optimization processes must care about something is, as I think underlies some of the discussion around Dennette's position though not discussed here, similar to the way that the intentionality of thought/belief means that thoughts must be about something.

Values Form a Shifting Landscape (and why you might care)

I like that this post is fairly accessible, although I found the charts confusing, largely because it's not always that clear to me what's being measured on each axis. I basically get what's going on, but I find myself disliking something about way the charts are presented because it's not always very clear what each axis measures.

(In some cases I think of them as more like being multidimensional spaces you've put on a line, but that still makes the visuals kind of confusing.)

None of this is really meant to be a big complaint, though. Graphics are hard; I probably wouldn't have even tried to illustrate it, so kudos to you for trying. Just felt it was also useful to register my feedback that they didn't quite land for me even though I got the gist of them.

AI Problems Shared by Non-AI Systems

An important caveat is that many non-AI systems have humans in the loop somewhere that can intervene if they don't like what the automated system is doing. Some examples:

  • we shut down stock markets that seem to be out of control
  • employees ignore standard operating procedures when they get into corner cases and SOPs would have them do something that would hurt them or that they'd get in trouble for
  • an advertiser might manually override their automated ad bidding algorithm if it tries to spend too much or too little money
  • customer service reps (or their managers) are empowered to override automated support systems (e.g. talk to the operator when an automated phone system can't handle your request)

Much of the concern about AI systems is when they lack support for these kind of interventions, whether it be because they are too fast, too complex, or can outsmart the would-be intervening human trying to correct what they see as an error.

Load More