All of G Gordon Worley III's Comments + Replies

G Gordon Worley III's Shortform

AlphaGo is fairly constrained in what it's designed to optimize for, but it still has the standard failure mode of "things we forgot to encode". So for example AlphaGo could suffer the error of instrumental power grabbing in order to be able to get better at winning Go because we misspecified what we asked it to measure. This is a kind of failure introduced into the systems by humans failing to make  adequately evaluate  as we intended, since we cared about winning Go games while also minimizing side effects, but maybe when we cons... (read more)

Optimization at a Distance

Really liking this model. It seems to actually deal with the problem of embeddedness for agents and the fact that there is no clear boundary to draw around what we call an agent other than one that's convenient for some purpose.

I've obviously got thoughts on how this is operationalizing insights about "no-self" and dependent origination, but that doesn't seem too important to get into, other than to say it gives me more reason to think this is likely to be useful.

G Gordon Worley III's Shortform

"Error" here is all sources of error, not just error in the measurement equipment. So bribing surveyors is a kind of error in my model.

0rhollerith_dot_com4d
Can you explain where there is an error term in AlphaGo or where an error term might appear in hypothetical model similar to AlphaGo trained much longer with much more numerous parameters and computational resources?
Against Time in Agent Models

For what it's worth, I think this is trying to get at the same insight as logical time but via a different path.

For the curious reader, this is also the same reason we use vector clocks to build distributed systems when we can't synchronize the clocks very well. 

And there's something quite interesting about computation as a partial order. It might seem that this only comes up when you have a "distributed" system, but actually you need partial orders to reason about unitary programs when they are non-deterministic (any program with loops and conditiona... (read more)

G Gordon Worley III's Shortform

I actually don't think that model is general enough. Like, I think Goodharting is just a fact of control system's observing.

Suppose we have a simple control system with output  and a governor  takes a measurement  (an observation) of . So long as  is not error free (and I think we can agree that no real world system can be actually error free), then  for some error factor . Since  uses  to regulate the system to change , we now have error ... (read more)

0Sam Marks6d
Hmm, I'm not sure I understand -- it doesn't seem to me like noisy observations ought to pose a big problem to control systems in general. For example, suppose we want to minimize the number of mosquitos in the U.S., and we access to noisy estimates of mosquito counts in each county. This may result in us allocating resources slightly inefficiently (e.g. overspending resources on counties that have fewer mosquitos than we think), but we'll still always be doing the approximately correct thing and mosquito counts will go down. In particular, I don't see a sense in which the error "comes to dominate" the thing we're optimizing. One concern which does make sense to me (and I'm not sure if I'm steelmanning your point or just saying something completely different) is that under extreme optimization pressure, measurements might become decoupled from the thing they're supposed to measure. In the mosquito example, this would look like us bribing the surveyors to report artificially low mosquito counts instead of actually trying to affect real-world mosquito counts. If this is your primary concern regarding Goodhart's Law, then I agree the model above doesn't obviously capture it. I guess it's more precisely a model of proxy misspecification.
G Gordon Worley III's Shortform

I'm fairly pessimistic on our ability to build aligned AI. My take is roughly that it's theoretically impossible and at best we might build AI that is aligned well enough that we don't lose. I've not written one thing to really summarize this or prove it, though.

The source of my take comes from two facts:

  1. Goodharting is robust. That is, the mechanism of Goodharting seems impossible to overcome. Goodharting is just a fact of any control system.
  2. It's impossible to infer the inner experience (and thus values) of another being perfectly without making normative
... (read more)
0rhollerith_dot_com8d
At least one person here disagrees with you on Goodharting. (I do.) You've written before on this site if I recall correctly that Eliezer's 2004 CEV proposal is unworkable because of Goodharting. I am granting myself the luxury of not bothering to look up your previous statement because you can contradict me if my recollection is incorrect. I believe that the CEV proposal is probably achievable by humans if those humans had enough time and enough resources (money, talent, protection from meddling) and that if it is not achievable, it is because of reasons other than Goodhart's law. (Sadly, an unaligned superintelligence is much easier for humans living in 2022 to create than a CEV-aligned superintelligence is, so we are probably all going to die IMHO.) Perhaps before discussing the CEV proposal we should discuss a simpler question, namely, whether you believe that Goodharting inevitably ruins the plans of any group setting out intentionally to create a superintelligent paperclip maximizer. Another simple goal we might discuss is a superintelligence (SI) whose goal is to shove as much matter as possible into a black hole or an SI that "shuts itself off" within 3 months of its launching where "shuts itself off" means stops trying to survive or to affect reality in any way.

This paper gives a mathematical model of when Goodharting will occur. To summarize: if

(1) a human has some collection  of things which she values,

(2) a robot has access to a proxy utility function which takes into account some strict subset of those things, and

(3) the robot can freely vary how much of  there are in the world, subject only to resource constraints that make the  trade off against each other,

then when the robot optimizes for its proxy utility, it will minimize all 's which its proxy utility... (read more)

Why I'm co-founding Aligned AI

This feels like a key detail that's lacking from this post. I actually downvoted this post because I have no idea if I should be excited about this development or not. I'm pretty familiar with Stuart's work over the years, so I'm fairly surprised if there's something big here.

Might help if I put this another way. I'd be purely +1 on this project if it was just "hey, I think I've got some good ideas AND I have an idea about why it's valuable to operationalize them as a business, so I'm going to do that". Sounds great. However, the bit about "AND I think I k... (read more)

Value extrapolation partially resolves symbol grounding

This doesn't really seem like solving symbol grounding, partially or not, so much as an argument that it's a non-problem for the purposes of value alignment.

$1000 USD prize - Circular Dependency of Counterfactuals

Agreed. That said, I don't think counterfactuals are in the territory. I think I said before that they were in the map, although I'm now leaning away from that characterisation as I feel that they are more of a fundamental category that we use to draw the map.

Yes, I think there is something interesting going on where human brains seem to operate in a way that makes counterfactuals natural. I actually don't think there's anything special about counterfactuals, though, just that the human brain is designed such that thoughts are not strongly tethered to sens... (read more)

$1000 USD prize - Circular Dependency of Counterfactuals

I don't think they're really at odds. Zack's analysis cuts off at a point where the circularity exists below it. There's still the standard epistemic circularity that exists whenever you try to ground out any proposition, counterfactual or not, but there's a level of abstraction where you can remove the seeming circularity by shoving it lower or deeper into the reduction of the proposition towards grounding out in some experience.

Another way to put this is that we can choose what to be pragmatic about. Zack's analysis choosing to be pragmatic about counter... (read more)

2Chris_Leong5mo
I wouldn't be surprised if other concepts such as probability were circular in the same way as counterfactuals, although I feel that this is more than just a special case of epistemic circularity. Like I agree that we can only reason starting from where we are - rather than from the view from nowhere - but counterfactuals feel different because they are such a fundamental concept that appears everywhere. As an example, our understanding of chairs doesn't seem circular in quite the same sense. That said, I'd love to see someone explore this line of thought. I could be wrong, but I suspect Zack would disagree with the notion that there is a circularity below it involving counterfactuals. I wouldn't be surprised though if Zack acknowledge a circularity not involving counterfactuals. Agreed. That said, I don't think counterfactuals are in the territory. I think I said before that they were in the map, although I'm now leaning away from that characterisation as I feel that they are more of a fundamental category that we use to draw the map.
$1000 USD prize - Circular Dependency of Counterfactuals

I think A is solved, though I wouldn't exactly phrase it like that, more like counterfactuals make sense because they are what they are and knowledge works the way it does.

Zack seems to be making a claim to B, but I'm not expert enough in decision theory to say much about it.

2Chris_Leong5mo
Sorry, when you say A is solved, you're claiming that the circularity is known to be true, right? Zack seems to be claiming that Bayesian Networks both draw out the implications and show that the circularity is false. So unless I'm misunderstanding you, your answer seems to be at odds with Zack.
$1000 USD prize - Circular Dependency of Counterfactuals

I mostly agree with Zack_M_Davis that this is a solved problem, although rather than talking about a formalization of causality I'd say this is a special case of epistemic circularity and thus an instance of the problem of the criterion. There's nothing unusual going on with counterfactuals other than that people sometimes get confused about what propositions are (e.g. they believe propositions have some sort of absolute truth beyond causality because they fail to realize epistemology is grounded in purpose rather than something eternal and external to the... (read more)

2Chris_Leong5mo
Which part are you claiming is a solved problem? Is it: a) That counterfactuals can only be understood within the counterfactual perspective OR b) The implications of this for decision theory OR c) Both
Integrating Three Models of (Human) Cognition

Thanks for this thorough summary. At this point the content has become spread over a books worth of posts, so it's handy to have this high level, if long, summary!

Drug addicts and deceptively aligned agents - a comparative analysis

Thanks for this interesting read.

I think there's similar work that can be done to find safety analogues from a large number of fields. Some that come to mind include organizational design, market analysis, and design of public institutions.

Daniel Kokotajlo's Shortform

Some of my own:

  • SSDs
  • laptops
  • CDs
  • digital cameras
  • modems
  • genome sequencing
  • automatic transmissions for cars that perform better than a moderately skilled human using a manual transmission can
  • cheap shipping
  • solar panels with reasonable power generation
  • breathable wrinkle free fabrics that you can put in the washing machine
  • bamboo textiles
  • good virtual keyboards for phones
  • scissor switches
  • USB
  • GPS
2Daniel Kokotajlo7mo
Oh yeah, cheap shipping! I grew up in a military family, all around the world, and I remember thinking it was so cool that my parents could go on "ebay" and order things and then they would be shipped to us! And then now look where we are -- groceries delivered in ten minutes! Almost everything I buy, I buy online!
Selection Theorems: A Program For Understanding Agents

Interesting. Selection theorems seem like a way of identifying the purposes or source of goal directness in agents that seems obvious to us yet hard to pin down. Compare also the ground of optimization.

David Wolpert on Knowledge

I don't really have a whole picture that I think says more than what others have. I think there's something to knowing as the act of operationalizing information, by which I mean a capacity to act based on information.

To make this more concrete, consider a simple control system like a thermostat or a steam engine governor. These systems contain information in the physical interactions we abstract away to call "signal" that's sent to the "controller". If we had only signal there'd be no knowledge because that's information that is not used to act. The contr... (read more)

David Wolpert on Knowledge

Quick thought: reading this I get a sense that some of our collective confusion here revolves around "knowledge" as a quantifiable noun rather than "knowing" as a verb, and if we give up on the idea that knowledge is first a quantifiable thing (rather than a convenient post hoc reification) we open up new avenues of understanding knowledge.

2Alex Flint8mo
Yeah that resonates with me. I'd be interested in any more thoughts you have on this. Particularly anything about how we might recognize knowing in another entity or in a physical system.
Oracle predictions don't apply to non-existent worlds

Small insight why reading this: I'm starting to suspect that most (all???) unintuitive things that happen with Oracles are the result of them violating our intuitions about causality because they actually deliver no information, in that nothing can be conditioned on what the Oracle says because if we could then the Oracle would fail to actually be an Oracle, so we can only condition on the existence of the Oracle and how it functions and not what it actually says, e.g. you should still 1-box but it's mistaken to think anything an Oracle tells you allows you to do anything different.

2Chris_Leong8mo
Yeah, you want either information about the available counterfactuals or information independent of your decision. Information about just the path taken isn't something you can condition on.
Grokking the Intentional Stance

There's no observer-independent fact of the matter about whether a system "is" an agent[9]

Worth saying, I think, that this is fully generally true that there's no observer-independent fact of the matter about whether X "is" Y. That this is true of agents is just particularly relevant to AI.

Search-in-Territory vs Search-in-Map

I'm not convinced there's an actual distinction to be made here.

Using your mass comparison example, arguably the only meaningful different between the two is where information is stored. In search-in-map it's stored in an auxiliary system; in search-in-territory it's embedded in the system. The same information is still there, though, all that's changed is the mechanism, and I'm not sure map and territory is the right way to talk about this since both are embedded/embodied in actual systems.

My guess is that search-in-map looks like a thing apart from searc... (read more)

2johnswentworth10mo
+1
A naive alignment strategy and optimism about generalization

For example, I now think that the representations of “what the model knows” in imitative generalization will sometimes need to use neural networks to translate between what the model is thinking and human language. Once you go down that road, you encounter many of the difficulties of the naive training strategy. This is an update in my view; I’ll likely go into more detail in a future post.

+1 to this and excited and happy to hear about this update in your view!

The reverse Goodhart problem

Ah, yeah, that's true, there's not much concern about getting too much of a good thing and that actually being good, which does seem like a reasonable category for anti-Goodharting.

It's a bit hard to think when this would actually happen, though, since usually you have to give something up, even if it's just the opportunity to have done less. For example, maybe I'm trying to get a B on a test because that will let me pass the class and graduate, but I accidentally get an A. The A is actually better and I don't mind getting it, but then I'm potentially left... (read more)

The reverse Goodhart problem

Maybe I'm missing something, but this seems already captured by the normal notion of what Goodharting is in that it's about deviation from the objective, not the direction of that deviation.

3Stuart Armstrong1y
The idea that maximising the proxy will inevitably end up reducing the true utility seems a strong implicit part of Goodharting the way it's used in practice. After all, if the deviation is upwards, Goodharting is far less of a problem. It's "suboptimal improvement" rather than "inevitable disaster".
Teaching ML to answer questions honestly instead of predicting human answers
  • Stories about how those algorithms lead to bad consequences. These are predictions about what could/would happen in the world. Even if they aren't predictions about what observations a human would see, they are the kind of thing that we can all recognize as a prediction (unless we are taking a fairly radical skeptical perspective which I don't really care about engaging with).

In the spirit then of caring about stories about how algorithms lead to bad consequences, a story about how I see not making a clear distinction between instrumental and intended mode... (read more)

1Adam Shimi7mo
Reading this thread, I wonder if the apparent disagreement doesn't come from the use of the world "honestly". The way I understand Paul's statement of the problem is that "answer questions honestly" could be replaced by "answer questions appropriately to the best of your knowledge". And his point is that "answer what a human would have answered" is not a good proxy for that (yet still an incentivized one due to how we train neural nets) From my reading of it, this post's proposal does provide some plausible ways to incentivize the model to actual search for appropriate answers instead of the ones human would have given, and I don't think it assumes the existence of true categories and/or essences.
Teaching ML to answer questions honestly instead of predicting human answers

I want to consider models that learn to predict both “how a human will answer question Q” (the instrumental model) and “the real answer to question Q” (the intended model). These two models share almost all of their computation — which is dedicated to figuring out what actually happens in the world. They differ only when it comes time to actually extract the answer. I’ll describe the resulting model as having a “world model,” an “instrumental head,” and an “intended head.”

This seems massively underspecified in that it's really unclear to me what's actually... (read more)

3Paul Christiano1y
I don't think anyone has a precise general definition of "answer questions honestly" (though I often consider simple examples in which the meaning is clear). But we do all understand how "imitate what a human would say" is completely different (since we all grant the possibility of humans being mistaken or manipulated), and so a strong inductive bias towards "imitate what a human would say" is clearly a problem to be solved even if other concepts are philosophically ambiguous. Sometimes a model might say something like "No one entered the datacenter" when what they really mean is "Someone entered the datacenter, got control of the hard drives with surveillance logs, and modified them to show no trace of their presence." In this case I'd say the answer is "wrong;" when such wrong answers appear as a critical part of a story about catastrophic failure, I'm tempted to look at why they were wrong to try to find a root cause of failure, and to try to look for algorithms that avoid the failure by not being "wrong" in the same intuitive sense. The mechanism in this post is one way that you can get this kind of wrong answer, namely by imitating human answers, and so that's something we can try to fix. On my perspective, the only things that are really fundamental are: * Algorithms to train ML systems. These are programs you can run. * Stories about how those algorithms lead to bad consequences. These are predictions about what could/would happen in the world. Even if they aren't predictions about what observations a human would see, they are the kind of thing that we can all recognize as a prediction (unless we are taking a fairly radical skeptical perspective which I don't really care about engaging with). Everything else is just a heuristic to help us understand why an algorithm might work or where we might look for a possible failure story. I think this is one of the upsides of my research methodology [https://www.alignmentforum.org/posts/EF5M6CmKRd
Saving Time

Firstly, we don't understand where this logical time might come from, or how to learn it

Okay, you can't write a sentence like that and expect me not to say that it's another manifestation of the problem of the criterion.

Yes, I realize this is not the problem you're interested in, but it's one I'm interested in, so this seems like a good opportunity to think about it anyway.

The issue seems to be that we don't have a good way to ground the order on world states (or, subjectively speaking if we want to be maximally cautious here, experience moments) since we ... (read more)

Pitfalls of the agent model

Somewhat ironically, some of these failures from thinking of oneself or others as agents causes a lack of agency! Maybe this is just a trick of language, but here's what I have in mind from thinking about some of the pitfalls:

  • Self-hatred results in less agency (freedom to do what you want) rather than more because effort is placed on hating the self rather than trying to change the self to be more in the desired state.
  • Procrastination is basically the textbook example of a failure of agency.
  • Hatred of others is basically the same story here as self-hatred.

On... (read more)

2Alex Flint1y
Yeah right, I agree with those three bullet points very much. Could also say "thinking of oneself or others as Cartesian agents causes a lack of power". Does agency=power? I'm not sure what the appropriate words are but I agree with your point. Yeah, that seems well said to me. This gradual process of taking more things as object seems to lead towards very good things. Circling, I think, has a lot to do with taking the emotions we are used to treating as subject and getting a bit more of an object lens on them just by talking about them. Gendlin's focussing seems to have a lot to do with this, too. Yeah right, it's a great lens to pick up and use when its helpful. But nice to know that it's there and also to be able to put it down by choice.
Where are intentions to be found?

Oh, I don't think those things exactly sidestep the problem of the criterion so much as commit to a response to it without necessarily realizing that's what they're doing. All of them sort of punt on it by saying "let humans figure out that part", which at the end of the day is what any solution is going to do because we're the ones trying to build the AI and making the decisions, but we can be more or less deliberate about how we do this part.

Probability theory and logical induction as lenses

Right. For example, I think Stuart Armstrong is hitting something very important about AI alignment with his pursuit of the idea that there's no free lunch in value learning. We only close the gap by making an "arbitrary" assumption, but it's only arbitrary if you assume there's some kind of context-free version of the truth. Instead we can choose in a non-arbitrary way based on what we care about and is useful to us.

I realize lots of people are bored by this point because they're non-arbitrary solution that is useful is some version of rationality criteri... (read more)

2Alex Flint1y
You're talking about how we ground out our thinking in something that is true but is not just further conceptualization? Look if we just make a choice about the truth by making an assumption then eventually the world really does "bite back". It's possible to try this out by just picking a certain fundamental orientation towards the world and sticking to it no matter what throughout your life for a little while. The more rigidly you adhere to it the more quickly the world will bite back. So I don't think we can just pick a grounding. But at the same time I very much agree that there is no concept that corresponds to the truth in a context-free or absolute way. The analogy I like the most is dance: imagine if I danced a dance that beautifully expressed what it's like to walk in the forest at night. It might be an incredibly evocative dance and it might point towards a deep truth about the forest at night, but it would be strange to claim that a particular dance is the final, absolute, context-free truth. It would be strange to seek after a final, absolute, context-free dance that expresses what it's like to walk in the forest at night in a way that finally captures the actual truth about the forest at night. When we engage in conceptualization, we are engaging in something like a dance. It's a dance with real consequence, real power, real impacts on the world, and real importance. It matters that we dance it and that we get it right. It's hard to think of anything at this point that matters more. But its significance is not a function of its capturing the truth in a final or context-free way. So when I consider "grounding out" my thinking in reality, I think of it in the same way that a dance should "ground out" in reality. That is: it should be about something real. It's also possible to pick some idea about what it's really like to walk in the forest at night and dance in a way that adheres to that idea but not to the reality of what it's actually like to walk
Where are intentions to be found?

Not really. If we were Cartesian, in order to fit the way we find the world, it seems to be that it'd have to be that agentiness is created outside the observable universe, possibly somewhere hypercomputation is possible, which might only admit an answer about how to build AI that would look roughly like "put a soul in it", i.e. link it up to this other place where agentiness is coming from. Although I guess if the world really looked like that maybe the way to do the "soul linkage" part would be visible, but it's not so seems unlikely.

1Alex Flint1y
Well ok, agreed, but even if we were Cartesian, we would still have questions about what is the right way to link up our machines with this place where agentiness is coming from, how we discern whether we are in fact Cartesian or embedded, and so on down to the problem of the criterion as you described it. One common response to any such difficult philosophical problems seems to be to just build AI that uses some form of indirect normativity such as CEV or HCH or AI debate to work out what wise humans would do about those philosophical problems. But I don't think it's so easy to sidestep the problem of the criterion.
Beware over-use of the agent model

I think this is right and underappreciated. However I struggle myself to make a clear case of what to do about it. There's something here, but I think it mostly shows up in not getting confused that the agent model just is how reality is, which underwhelms people who perhaps most fail to deeply grok what that means because they have a surface understanding of it.

Probability theory and logical induction as lenses

Well stated. For what it's worth I think this is a great explanation of why I'm always going on about the problem of the criterion: as embedded, finite agents without access to hypercomputation or perfect, a priori knowledge we're stuck in this mess of trying to figure things out from the inside and always getting it a little bit wrong, no matter how hard we try, so it's worth paying attention to that because solving, for example, alignment for idealized mathematical systems that don't exist is maybe interesting but also not an actual solution to the alignment problem.

2Alex Flint1y
That post was a delightful read! Thanks for the pointer. It seems that we cannot ever find, among concepts, a firm foundation on which we can be absolutely sure of our footing. For the same reason, our basic systems of logic, ethics, and empiricism can never be put on absolutely sure footing (Godel, Humean is/ought gap, radical skepticism).
Where are intentions to be found?

Largely agree. I think you're exploring what I'd call the deep implications of the fact that agents are embedded rather than Cartesian.

1Alex Flint1y
Interesting. Is it that if we were Caresian, you'd expect to be able to look at the agent-outside-the-world to find answers to questions about what even is the right way to go about building AI?
Testing The Natural Abstraction Hypothesis: Project Intro

Nice! From my perspective this would be pretty exciting because, if natural abstractions exist, it solves at least some of the inference problem I view at the root of solving alignment, i.e. how do you know that the AI really understands you/humans and isn't misunderstanding you/humans in some way that looks like it does understand from the outside but it doesn't. Although I phrased this in terms of reified experiences (noemata/qualia as a generalization of axia), abstractions are essentially the same thing in more familiar language, so I'm quite excited f... (read more)

Solving the whole AGI control problem, version 0.0001

Regarding conservatism, there seems to be an open question of just how robust Goodhart effects are in that we all agree Goodhart is a problem but it's not clear how much of a problem it is and when. We have opinions ranging from mine, which is basically that Goodharting happens the moment you try to apply even the weakest optimization pressure and this will be a problem (or at least a problem in expectation; you might get lucky) for any system you need to never deviate, to what I read to be Paul's position: it's not that bad and we can do a lot to correct ... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Much of this, especially the story of the production web and especially especially the story of the factorial DAOs, reminds me a lot of PKD's autofac. I'm sure there are other fictional examples worth highlighting, but I point out autofac since it's the earliest instance of this idea I'm aware of (published 1955).

2Andrew Critch1y
I hadn't read it (nor almost any science fiction books/stories) but yes, you're right! I've now added a callback to Autofac after the "facotiral DAO" story. Thanks.
Coherence arguments imply a force for goal-directed behavior

I think here it makes sense to talk about internal parts, separate from behavior, and real. And similarly in the single agent case: there are physical mechanisms producing the behavior, which can have different characteristics, and which in particular can be ‘in conflict’—in a way that motivates change—or not. I think it is also worth observing that humans find their preferences ‘in conflict’ and try to resolve them, which is suggests that they at least are better understood in terms of both behavior and underlying preferences that are separate from it.&nb

... (read more)
Epistemological Framing for AI Alignment Research

I like this idea. AI alignment research is more like engineering than math or science, and engineering is definitely full of multiple paradigms, not just because it's a big field with lots of specialties that have different requirements, but also because different problems require different solutions and sometimes the same problem can be solved by approaching it in multiple ways.

A classic example from computer science is the equivalence of loops and recursion. In a lot of ways these create two very different approaches to designing systems, writing code, a... (read more)

Bootstrapped Alignment

Thanks both! I definitely had the idea that Paul had mentioned something similar somewhere but hadn't made it a top-level concept. I think there's similar echos in how Eliezer talked about seed AI in the early Friendly AI work.

Bootstrapped Alignment

Seems like it probably does, but only incidentally.

I instead tend to view ML research as the background over which alignment work is now progressing. That is, we're in a race against capabilities research that we have little power to stop, so our best bets are either that it turns out capabilities are about to hit the upper inflection point of an S-curve, buying us some time, or that the capabilities can be safely turned to helping us solve alignment.

I do think there's something interesting about a direction not considered in this post related to intellige... (read more)

Bootstrapped Alignment

Looks good to me! Thanks for planning to include this in the AN!

Suggestions of posts on the AF to review

I think the generalized insight from Armstrong's no free lunch paper is still underappreciated in that I sometimes see papers that, to me, seem to run up against this and fail to realize there's a free variable in their mechanisms that needs to be fixed if they want them to not go off in random directions.

https://www.lesswrong.com/posts/LRYwpq8i9ym7Wuyoc/other-versions-of-no-free-lunch-in-value-learning

1Adam Shimi1y
Thanks for the suggestion! I didn't know about this post. We'll consider it. :)
Suggestions of posts on the AF to review

Another post of mine I'll recommend you:

https://www.lesswrong.com/posts/k8F8TBzuZtLheJt47/deconfusing-human-values-research-agenda-v1

This is the culmination of a series of post on "formal alignment", where I start out saying "what it would mean to formally state what it would mean to build aligned AI" and then from that try to figure out what we'd have to figure out in order to achieve that.

Over the last year I've gotten pulled in other directions so not pushed this line of research forward much, plus I reached a point with it where it was clear it require... (read more)

1Adam Shimi1y
Thanks for the suggestion! We want to go through the different research agendas (and I already knew about yours), as they give different views/paradigms on AI Alignment. Yet I'm not sure how relevant a review of such posts are. In a sense, the "reviewable" part is the actual research that underlies the agenda, right?
Suggestions of posts on the AF to review

I wrote this post as a summary of a paper I published. It didn't get much attention, so I'd be interesting in having you all review it.

 https://www.lesswrong.com/posts/JYdGCrD55FhS4iHvY/robustness-to-fundamental-uncertainty-in-agi-alignment-1

To say a little more, I think the general approach I lay out in here for taking towards safety work is worth considering more deeply and points towards a better process for choosing interventions in attempts to build aligned AI. I think what's more important than the specific examples where I apply the method is t... (read more)

1Adam Shimi1y
Thanks for the suggestion! It's great to have some methodological posts! We'll consider it. :)
Literature Review on Goal-Directedness

Okay, so here's a more adequate follow up.

In this seminal cybernetics essay a way of thinking about this is layed out.

First, they consider systems that have observable behavior, i.e. systems that take inputs and produce outputs. Such systems can be either active, in that the system itself is the source of energy that produces the outputs, or passive, in that some outside source supplies the energy to power the mechanism. Compare an active plant or animal to something passive like a rock, though obviously whether or not something is active or passive depend... (read more)

Literature Review on Goal-Directedness

Doing a little digging, I realized that the idea of "teleological mechanism" from cybernetics is probably a better handle for the idea and will provide a more accessible presentation of the idea. Some decent references:

https://www.jstor.org/stable/184878

https://www.jstor.org/stable/2103479

https://nyaspubs.onlinelibrary.wiley.com/toc/17496632/50/4

I don't know of anywhere that presents the idea quite how I think of it, though. If you read Dreyfus on Heidegger you might manage to pick this out. Similarly I think this idea underlies Sartre's talk about freedom... (read more)

Literature Review on Goal-Directedness

Reading this, I'm realizing again something I may have realized before and forgotten, but I think ideas about goal-directedness in AI have a lot of overlap with the philosophical topic of telos and Heideggerian care/concern.

The way I think about this is that ontological beings (that is, any process we can identify as producing information) have some ability to optimize (because information is produced by feedback) and must optimize for something rather than nothing (else they are not optimizers) or everything (in which case they are not finite, which they ... (read more)

1Adam Shimi1y
Thanks for the proposed idea! Yet I find myself lost when trying to find more information about this concept of care. It is mentioned in both the chapter on Heidegger in The History of Philosophy [https://www.penguinrandomhouse.com/books/610800/the-history-of-philosophy-by-a-c-grayling/] and the section on care in the SEP article on Heidegger [https://plato.stanford.edu/entries/heidegger/#Car], but I don't get a single thing written there. I think the ideas of "thrownness" and "disposedness" are related? Do you have specific pointers to deeper discussions of this concept? Specifically, I'm interested in new intuitions for how a goal is revealed by actions.
Load More