I left a comment over in the other thread, but I think Joachim misunderstands my position.
In the above comment I've taken for granted that there's a non-trivial possibility that AGI is near, so I'm not arguing we should say that "AGI is near" regardless of whether it is or not, because we don't know if it is or not, we only have our guesses about it, and so long as there's a non-trivial chance that AGI is near, I think that's the more important message to communicate.
Overall it would be better if we can communicate something like "AGI is probably near", bu... (read more)
From a broad policy perspective, it can be tricky to know what to communicate. I think it helps if we think a bit more about the effects of our communication and a bit less about correctly conveying our level of credence in particular claims. Let me explain.
If we communicate the simple idea that AGI is near then it pushes people to work on safety projects that would be good to work on even if AGI is not near while paying some costs in terms of reputation, mental health, and personal wealth.
If we communicate the simple idea that AGI is not near then people ... (read more)
Fair. For what it's worth I strongly agree that causality is just one domain where this problem becomes apparent, and we should be worried about it generally for super intelligent agents, much more so than I think many folks seem (in my estimation) to worry about it today.
Yes, the variables constitute a reference frame, which is to say an ultimately subjective way of viewing the world. Even if there is high inter-observer agreement about the shape of the reference frame, it's not guaranteed unless you also posit something like Wentworth's natural abstraction hypothesis to be true.
Perhaps a toy example will help explain my point. Suppose the grass should only be watered when there's a violet cube on the lawn. To automate this a sensor is attached to the sprinklers that turns them on only when the sensor sees a violet cube. I... (read more)
I think there's something big left out of this post, which is accounting for the agent observing and judging the causal relationships. Something has to decide how to carve up the world into parts and calculate counterfactuals. It's something that exists implicitly in your approach to causality but you don't address it here, which I think is unfortunate because although humans generally have the same frame of reference for judging causality, alien minds, like AI, may not.
Actually, I kind of forgot what ended up in the paper, but then I remembered so wanted to update my comment.
There was an early draft of this paper that talked about deontology, but because there are so many different forms of deontology it was hard to come up with arguments where there wasn't some version of deontological reasoning that broke the argument, so I instead switched to talking about the question of moral facts independent of ethical system. That said, the argument I make in the paper suggesting that moral realism is more dangerous than moral an... (read more)
I don't see it in the references so you might find this paper of mine (link is to Less Wrong summary, which links to full thing) interesting because within it I include an argument suggesting building AI that assumes deontology is strictly more risky than building one that does not.
If the mind becomes much more capable than the surrounding minds, it does so by being on a trajectory of creativity: something about the mind implies that it generates understanding that is novel to the mind and its environment.
I don't really understand this claim enough to evaluate it. Can you expand a bit on what you mean by it? I'm unsure about the rest of the post because it's unclear to me what the premise your top-line claim rest upon means.
to answer my own question:
Level of AI risk concern: high
General level of risk tolerance in everyday life: low
Brief summary of what you do in AI: first tried to formalize what alignment would mean, this led me to work on a program of deconfusing human values that reached an end of what i could do, now have moved on to writing about epistemology that i think is critical to understand if we want to get alignment right
Anything weird about you: prone to anxiety, previously dealt with OCD, mostly cured it with meditation but still pops up sometimes
I think I disagree. Based on your presentation here, I think someone following a policy inspired by this post would be more likely to cause existential catastrophe by pursuing a promising false positive that actually destroys all future value in our Hubble volume. I've argued we need to focus on minimizing false positive risk rather than optimizing for max expected value, which is what I read this as proposing we do.
This post brought to mind a thought: I actually don't care very much about arguments about how likely doom is and how pessimistic or optimistic to be since they are irrelevant, to my style of thinking, for making decisions related to building TAI. Instead, I mostly focus on downside risks and avoiding them because they are so extreme, which makes me look "pessimistic" but actually I'm just trying to minimize the risk of false positives in building aligned AI. Given this framing, it's actually less important, in most cases, to figure out how likely somethin... (read more)
A good specific example of trying to pull this kind of shell game is perhaps HCH. I don't recall if someone made this specific critique of it before, but it seems like there's some real concern that it's just hiding the misalignment rather than actually generating an aligned system.
In classical Chinese philosophy there's the concept of shi-fei or "this not that". A key bit of the idea, among other things, is that all knowledge involves making distinctions, and those distinctions are judgments, and so if you want to have knowledge and put things into words you have to make this-not-that style judgements of distinction to decide what goes in what category.
More recently here on the forum, Abram has written about teleosemantics, which seems quite relevant to your investigations in this post.
The teleosemantic picture is that epistemic accuracy is a common, instrumentally convergent subgoal; and "meaning" (in the sense of semantic content) arises precisely where this subgoal is being optimized.
I think this is exactly right. I often say things like "accurate maps are extremely useful to things like survival, so you and every other living thing has strong incentives to draw accurate maps, but this is contingent on the extent to which you care about e.g. survival".
So to see if I have this right, the difference is I'm trying to point at a larger phenomenon and you mean teleosemantics to point just at the way beliefs get constrained to be useful.
Cool. For what it's worth, I also disagree with many of my old framings. Basically anything written more than ~1 year ago is probably vaguely but not specifically endorsed.
Oh man I kind of wish I could go back in time and wipe out all the cringe stuff I wrote when I was trying to figure things out (like why did I need to pull in Godel or reify my confusion?). With that said, here's some updated thoughts on holons. I'm not really familiar with OOO, so I'll be going off your summary here.
I think I started out really not getting what the holon idea points at, but I understood enough to get myself confused in new ways for a while. So first off there's only ~1 holon, such that it doesn't make sense to talk about it as anything ot... (read more)
I very much agree and really like the coining of the term "teleosemantics". I might steal it! :-)
I'm not sure how much you've read my work on this topic or how much it influenced you, but in case you're not very aware of it I think it's worth pointing out some things I've been working on in this space for a while that you might find interesting.
I got nervous about how truth works when I tried to tackle the alignment problem head on. I ended up having to write a sequence of posts to sort out my ideas. At the time, I really failed to appreciate how deep telo... (read more)
So there's different notions of more here.
There's more in the sense I'm thinking in that it's not clear additional levels of abstraction enable deeper understanding given enough time. If 3 really is all the more levels you need because that's how many it takes to think about any number of levels of depth (again by swapping out levels in your "abstraction registers"), additional levels end up being in the same category.
And then there's more like doing things faster which makes things cheaper. I'm more skeptical of scaling than you are perhaps. I do agree th... (read more)
Alright, fair warning, this is an out there kind of comment. But I think there's some kind of there there, so I'll make it anyway.
Although I don't have much of anything new to say about it lately, I spent several years really diving into developmental psychology and my take on most of it is that its an attempt to map changes in the order of complexity of the structure thoughts can take on. I view the stages of human psychological development as building up the mental infrastructure to be able to hold up to three levels of fully-formed structure (yes, this ... (read more)
Why does there need to be structure? We can just have a non-uniform distribution of energy around the universe in order for there to be information to extract. I guess you could call this "structure" but that seems like a stretch to me.
I don't know if I can convince you. You seem pretty convinced that there are natural abstractions or something like them. I'm pretty suspicious that there are natural abstractions and instead think there are useful abstractions but they are all contingent on how the minds creating those abstractions are organized and that no... (read more)
Sure, differences are as real as the minds making them are. Once you have minds those minds start perceiving differentiation since they need to extract information from the environment to function. So I guess I'm saying I don't see what your objection is in this last comment as you've not posited anything that seems to claim something that actually disagrees with my point as far as I can tell. I think it's a bit weird to call the differentiation you're referring to "objective", but you explained what you mean.
Isn't a special case of aiming at any target we want the goals we would want it to have? And whatever goals we'd want it to have would be informed by our ontology? So what I'm saying is I think there's a case where the generality of your claim breaks down.
I think that the big claim the post relies on is that values are a natural abstraction, and the Natural Abstractions Hypothesis holds. Now this is admittedly very different from the thesis that value is complex and fragile.It is not that AI would naturally learn human values, but that it's relatively easy for us to point at human values/Do What I Mean/Corrigibility, and that they are natural abstractions.This is not a claim that is satisfied by default, but is a claim that would be relatively easy to satisfy if true.
I think that the big claim the post relies on is that values are a natural abstraction, and the Natural Abstractions Hypothesis holds. Now this is admittedly very different from the thesis that value is complex and fragile.
It is not that AI would naturally learn human values, but that it's relatively easy for us to point at human values/Do What I Mean/Corrigibility, and that they are natural abstractions.
This is not a claim that is satisfied by default, but is a claim that would be relatively easy to satisfy if true.
If this is the case, my concern seems ye... (read more)
For what it's worth, I think you're running headlong into an instance of the problem of the criterion and enjoy seeing how you're grappling with it. I've tagged this post as such.
Reading this post I think it insufficiently addresses motivations, purpose, reward functions, etc. to make the bold claim that perfect world-model interpretability is sufficient for alignment. I think this because ontology is not the whole of action. Two agents with the same ontology and very different purposes would behave in very different ways.
Perhaps I'm being unfair, but I'm not convinced that you're not making the same mistake as when people claim any sufficiently intelligent AI would be naturally good.
This seems straightforward to me: reification is a process by which our brain picks out patterns/features and encodes them so we can recognize them again and make sense of the world given our limited hardware. We can then think in terms of those patterns and gloss over the details because the details often aren't relevant for various things.
The reason we reify things one way versus another depends on what we care about, i.e. our purposes.
To me this seems obvious: noumena feel real to most people because they're captured by their ontology. It takes a lot of work for a human mind to learn not to jump straight from sensation to reification, and even with training there's only so much a person can do because the mind has lots of low-level reification "built in" that happens prior to conscious awareness. Cf. noticing
Oh, I thought I already explained that. There's at least two different ways "exist" can be meant here, and I think we're talking past each other.
For some thing to exist that implies it must exist ontologically, i.e. in the map. Otherwise it is not yet a thing. So I'm saying there's a difference between what we might call existence and being. You exist, in the sense of being an ontological thing, only by virtue of reification, but you are by virtue of the whole world being.
Yep, so I think this gets into a different question of epistemology not directly related to things but rather about what we care about, since positing a theory that what looks to me like a table implies something table shaped about the universe requires caring about parsimony.
(Aside: It's kind of related because to talk about caring about things we need reifications that enable us to point to what we care about, but I think that's just an artifact of using words—care is patterns of behavior and preference we can reify call "parsimonious" or something else,... (read more)
Yes, though note you can observe yourself.
I didn't link it in my original reply by work on natural abstractions is also related. My take is that if natural abstractions exist they don't actually rehabilitate noumena but they do explain why it intuitively feels like there are noumena. However abstractions are still phenomena (except insofar as all phenomena are of course embedded in the world) even if they are picking up on what I might metaphorically describe as the natural contours of the territory.
This is confusing two different notions of exist. There is existence as part of the wholeness of the world that is as yet undifferentiated and there is your existence in the minds of people. "You" exist lots of places in many minds, and also "you" don't have a clearly defined existence separate and independent from the rest of the world.
I realize this is unintuitive to many folks. The thing you have to notice is that the world has an existence independent of ontology and ontology-less existence can't be fathomed in terms of ontology.
I very much appreciate trying to figure out what things are. I think, though, you've added more complication than needed. However, my take depends on a particular view on philosophy.
So, first I think Kant is wrong about noumena. They don't exist. There are no things in themselves, there are only phenomena: things that exist because we reify them into existence to fit some concern we have. Things are reified out of sensory experience of the world (though note that "sensory" is redundant here), and the world is the unified non-thing that we can only reify by... (read more)
On the one hand, cool, on the other, the abstract is deceptive because it tries to claim that the AI trained is "harmless but nonevasive AI assistant" but what the paper in fact claims is that Anthropic trained an AI that has a higher harmlessness and helpfulness score and thus offers a Pareato improvement over previous models but is not definitely across some bar we could say is harmless vs. not-harmless or helpful vs. not-helpful. As much is also stated in the included figure.
The work is cool, don't get me wrong. We should celebrate it. But also I want a... (read more)
These are good intuitive arguments against these sorts of solutions, but I think there's a more formal argument we can make that these solutions are dangerous because they pose excess false positive risk. In particular, I think they fail to fully account for the risks of generalized Goodharting, as do most proposed solutions other than something like agent foundations.
Right. Nothing that happens in the same Hubble volume can really be said to not be causally connected. Nonetheless I like the point of the OP even if it's made in an imprecise way.
I continue to be excited about this line of work. I feel like you're slowly figuring out how to formalize ontology in a way reflective of what we actually do and generalizing it. This is something missing from a lot of other approaches.
This is pretty exciting. I've not really done any direct work to push forward alignment in the last couple years, but this is exactly the sort of direction I was hoping someone would go when I wrote my research agenda for deconfusing human values. What came out of it was that there was some research to do that I wasn't equipped to do myself, and I'm very happy to say you've done the sort of thing I had hoped for.
On first pass this seems to address many of the common problems with traditional approaches to formalizing values. I hope that this proves a fruitful line of research!
Re Project 4, you might find my semi-abandoned (mostly because I wasn't and still am not in a position to make further progress on it) research agenda for deconfusing human values useful.
Re: Project 2
This project’s goal is to better understand the bridge principles needed between subjective, first person optimality and objective, third person success.
This seems quite valuable, because there is, properly speaking, no objective, third person perspective on which we can speak, only the inferred sense that there exists something that looks to use like a third person perspective from our first person perspectives. Thus I think this seems like a potentially fruitful line of research since the proposed premise contains the confusion that needs to... (read more)
As it happens, I think this is a rather important topic. Failure to consider and mitigate the risk of assumptions creates both false negative (less concerning) and false positive (most concerning) risks when attempting to build aligned AI.
AlphaGo is fairly constrained in what it's designed to optimize for, but it still has the standard failure mode of "things we forgot to encode". So for example AlphaGo could suffer the error of instrumental power grabbing in order to be able to get better at winning Go because we misspecified what we asked it to measure. This is a kind of failure introduced into the systems by humans failing to make m(X) adequately evaluate X as we intended, since we cared about winning Go games while also minimizing side effects, but maybe when we cons... (read more)
Really liking this model. It seems to actually deal with the problem of embeddedness for agents and the fact that there is no clear boundary to draw around what we call an agent other than one that's convenient for some purpose.
I've obviously got thoughts on how this is operationalizing insights about "no-self" and dependent origination, but that doesn't seem too important to get into, other than to say it gives me more reason to think this is likely to be useful.
"Error" here is all sources of error, not just error in the measurement equipment. So bribing surveyors is a kind of error in my model.
Can you explain where there is an error term in AlphaGo or where an error term might appear in hypothetical model similar to AlphaGo trained much longer with much more numerous parameters and computational resources?
For what it's worth, I think this is trying to get at the same insight as logical time but via a different path.
For the curious reader, this is also the same reason we use vector clocks to build distributed systems when we can't synchronize the clocks very well.
And there's something quite interesting about computation as a partial order. It might seem that this only comes up when you have a "distributed" system, but actually you need partial orders to reason about unitary programs when they are non-deterministic (any program with loops and conditiona... (read more)
I actually don't think that model is general enough. Like, I think Goodharting is just a fact of control system's observing.
Suppose we have a simple control system with output X and a governor G. G takes a measurement m(X) (an observation) of X. So long as m(X) is not error free (and I think we can agree that no real world system can be actually error free), then X=m(X)+ϵ for some error factor ϵ. Since G uses m(X) to regulate the system to change X, we now have error ... (read more)
Hmm, I'm not sure I understand -- it doesn't seem to me like noisy observations ought to pose a big problem to control systems in general.
For example, suppose we want to minimize the number of mosquitos in the U.S., and we access to noisy estimates of mosquito counts in each county. This may result in us allocating resources slightly inefficiently (e.g. overspending resources on counties that have fewer mosquitos than we think), but we'll still always be doing the approximately correct thing and mosquito counts will go down. In particular, I don't see a se... (read more)
I'm fairly pessimistic on our ability to build aligned AI. My take is roughly that it's theoretically impossible and at best we might build AI that is aligned well enough that we don't lose. I've not written one thing to really summarize this or prove it, though.
The source of my take comes from two facts:
At least one person here disagrees with you on Goodharting. (I do.)
You've written before on this site if I recall correctly that Eliezer's 2004 CEV proposal is unworkable because of Goodharting. I am granting myself the luxury of not bothering to look up your previous statement because you can contradict me if my recollection is incorrect.
I believe that the CEV proposal is probably achievable by humans if those humans had enough time and enough resources (money, talent, protection from meddling) and that if it is not achievable, it is because of reasons ot... (read more)
This paper gives a mathematical model of when Goodharting will occur. To summarize: if
(1) a human has some collection s1,…,sn of things which she values,
(2) a robot has access to a proxy utility function which takes into account some strict subset of those things, and
(3) the robot can freely vary how much of s1,…,sn there are in the world, subject only to resource constraints that make the si trade off against each other,
then when the robot optimizes for its proxy utility, it will minimize all si's which its proxy utility... (read more)
This feels like a key detail that's lacking from this post. I actually downvoted this post because I have no idea if I should be excited about this development or not. I'm pretty familiar with Stuart's work over the years, so I'm fairly surprised if there's something big here.
Might help if I put this another way. I'd be purely +1 on this project if it was just "hey, I think I've got some good ideas AND I have an idea about why it's valuable to operationalize them as a business, so I'm going to do that". Sounds great. However, the bit about "AND I think I k... (read more)
This doesn't really seem like solving symbol grounding, partially or not, so much as an argument that it's a non-problem for the purposes of value alignment.
Agreed. That said, I don't think counterfactuals are in the territory. I think I said before that they were in the map, although I'm now leaning away from that characterisation as I feel that they are more of a fundamental category that we use to draw the map.
Yes, I think there is something interesting going on where human brains seem to operate in a way that makes counterfactuals natural. I actually don't think there's anything special about counterfactuals, though, just that the human brain is designed such that thoughts are not strongly tethered to sens... (read more)