Alex Flint

Independent AI safety researcher


The accumulation of knowledge

Wiki Contributions


Very very good question IMO. Thank you for this.

Consider a person who comes to a very clear understanding of the world, such that they are extremely capable of designing things, building things, fundraising, collaborating, and so on. Consider a moment where this person is just about to embark on a project but has not yet acquired any resources, perhaps has not even made any connections with anyone at all, yet is highly likely to succeed in their project when they do embark. Would you say this person has “resources”? If not, there is a kind of continuous trading-in that will take place as this person exchanges their understanding of things for resources, then later for final outcomes. Is there really a line between understanding-of-things, resources, and outcomes? The interesting part is the thing that gives this person power, and that seems to be their understanding-of-things.

Ah so I think what you're saying is that for a given outcome, we can ask whether there is a goal we can give to the system such that it steers towards that outcome. Then, as a system becomes more powerful, the range of outcomes that it can steer towards expands. That seems very reasonable to me, though the question that strikes me as most interesting is: what can be said about the internal structure of physical objects that have power in this sense?

I don't understand why a strong simplicity guarantee places most of the difficulty on the learning problem. In the diamond situation, a strong simplicity requirement on the reporter can mean that the direct translator gets ruled out, since it may have to translate from a very large and sophisticated AI predictor?

What we're actually doing is here is defining "automated ontology identification" as an algorithm that only has to work if the predictor computes intermediate results that are sufficiently "close" to what is needed to implement a conservative helpful decision boundary. Because we're working towards an impossibility result, we wanted to make it as easy as possible for an algorithm to meet the requirements of "automated ontology identification". If some proposed automated ontology identifier works without the need for any such "sufficiently close intermediate computation" guarantee then it should certainly work in the presence of such a guarantee.

So this "sufficiently close intermediate computation" guarantee kind of changes the learning problem from "find a predictor that predicts well" to "find a predictor that predicts well and also computes intermediate results meeting a certain requirement". That is a strange requirement to place on a learning process, but it's actually really hard to see any way to avoid make some such requirement, because if we place no requirement at all then what if the predictor is just a giant lookup table? You might say that such a predictor would not physically fit inside the whole universe, and that's correct, and is why we wanted to operationalize this "sufficiently close intermediate computation" guarantee, even though it changes the definition of the learning problem in a very important way.

But this is all just to make the definition of "automated ontology identification" not unreasonably difficult, in order that we would not be analyzing a kind of "straw problem". You could ignore the "sufficiently close intermediate computation" guarantee completely and treat the write-up as analyzing the more difficult problem of automated ontology identification without any guarantee about the intermediate results computed by the predictor.

Perhaps I miss the mystery. My first reaction is, "It came from the assumption of a true decision boundary, and the ability to recursively deploy monotonically better generalization while maintaining conservativeness."

Well yeah of course but if you don't think it's reasonable that any algorithm could meet this requirement then you have to deny exactly one of the three things that you pointed out: the assumption of a true decision boundary, the monotonically better generalization, or the maintenance of conservativeness. I don't think it's so easy to pick one of these to deny without also denying the feasibility of automated ontology identification (from a finite narrow dataset, with a safety guarantee).

  1. If you deny the existence of a true decision boundary then you're saying that there is just no fact of the matter about the questions that we're asking to automated ontology identification. How then would we get any kind of safety guarantee (conservativeness or anything else)?

  2. If you deny the generalization then you're saying that there is some easy set where no matter which prediction problem you solve, there is just no way for the reporter to generalize beyond given a dataset that is sampled entirely within , not even by one single case. That's of course logically possible, but it would mean that automated ontology identification as we have defined it is impossible. You might say "yes but you have defined it wrong". But if we define it without any generalization guarantee at all then the problem is trivial: an automated ontology identifier can just memorize the dataset and refuse to answer any cases that were not literally present in the dataset. So we need some generalization guarantee. Maybe there is a better one than what we have used.

  3. If you deny the maintenance of conservativeness then, again, that means automated ontology identification as we have defined it is impossible, and again you might say that we have defined it badly. But again, if we remove the conservative requirement completely then we can solve automated ontology identification by just returning random noise. So we need some safety guarantee. I suspect a lot of alternative safety guarantees are going to be susceptible to the same iteration scheme, but again I'm interested in safety guarantees that sidestep this issue.

I feel like this part is getting slippery with how words are used, in a way which is possibly concealing unimagined resolutions to the apparent tension.

Indeed. I also feel that this part is getting slippery with words. The fact that we don't have a formal impossibility result (or a formal understanding of why this line cannot lead to an impossibility result) indicates that there is further work needed to clarify what's really going on here.

Why can't I dodge the puzzle? Why can't I have an intended ELK reporter which answers the easy questions, and a small subset of the hard questions, without also being able to infinitely recurse an ELK solution to get better and better conservative reporters?

Fundamentally I do think you have to deny one of the 3 points above. That can be done, of course, but it seems to us that none of them are easy to deny without also denying the feasibility of automated ontology identification.

Thanks for these clarifying questions. It's been helpful to write up this reply.

My understanding of the argument: if we can always come up with a conservative reporter (one that answers yes only when the true answer is yes), and this reporter can label at least one additional data point that we couldn't label before, we can use this newly expanded dataset to pick a new reporter, feed this process back into itself ad infinitum to label more and more data, and the fixed point of iterating this process is the perfect oracle. This would imply an ability to solve arbitrary model splintering problems, which seems like it would need to either incorporate a perfect understanding of human value extrapolation baked into the process somehow, or implies that such an automated ontology identification process is not possible.

That is a good summary of the argument.

Personally, I think that it's possible that we don't really need a lot of understanding of human values beyond what we can extract from the easy set to figure out "when to stop."

Thanks for this question.

Consider a problem involving robotic surgery of somebody's pet dog, and suppose that there is a plan that would transform all the dog's entire body as if it were mirror-imaged (left<->right). This dog will have a perfectly healthy body, but it will actually be unable to consume any food currently on Earth because the chirality of its biology will render it incompatible with the biology of any other living thing on the planet, so it will starve. We would not want to execute this plan and there are of course ordinary questions that, if we think to ask them, will reveal that the plan is bad ("will the dog be able to eat food from planet Earth?"). But if we only think to ask "is the dog healthy?" then we would really like our system not to try to extrapolate concepts of "healthy" to cases where a dog's entire body is mirror-imaged. But how would any system know that this particular simple geometric transformation (mirror-imaging) is dangerous, while other simple geometrical transformations on the dog's body -- such as physically moving or rotating the dog's body in space -- are benign? I think it would have to know what we value with respect to the dog.

To make this example sharper, let the human providing the training examples be someone alive today who has no idea that biological chirality is a thing. How surprised they would be that their seemingly-healthy pet dog died a few days after the seemingly-successful surgery. Even if they did an autopsy on their dog to find out why it died, it would appear that all the organs and such were perfectly healthy and normal. It would be quite strange to them.

Now what if it was the diamond that was mirror-imaged inside the vault? Is the diamond still in the vault? Yeah, of course, we don't care about a diamond being mirror imaged. It seems to me that the reason we don't care in the case of the diamond is that the qualities we value in a diamond are not affected by mirror-imaging.

Now you might say that atomically mirror-imaging a thing should always be classified as "going too far" in conceptual extrapolation, and that an oracle should refuse to answer questions like "is the diamond in the vault?" or "is the dog healthy?" when the thing under consideration is a mirror image of its original. I agree this kind of refusal would be good, I just think it's hard to build this without knowing a lot about our values. Why is it that some basic geometric operations like translation and rotation are totally fine to extrapolate over, but mirror-imaging is not? We could hard-code that mirror-imaging is "going too far" but then we just come to another example of the same phenomenon.

Yes, I think what you're saying is that there is (1) the set of all possible outcomes, (2) within that, the set of outcomes where the company succeeds with respect to any goal, and (3) within that, the set of outcomes where the company succeeds with respect to the operator's goal. The capability-increasing interventions, then, are things that concentrate probability mass onto (2), whereas the alignment-increasing interventions are things that concentrate probability mass onto (3). This is a very interesting way to say it and I think it explains why there is a spectrum from alignment to capabilities.

Very roughly, (1) corresponds to any system whatsoever, (2) corresponds to a system that is generally powerful, and (3) corresponds to a system that is powerful and aligned. We are not so worried about non-powerful unaligned systems, and we are not worried at all about powerful aligned systems. We are worried about the awkward middle ground - powerful unaligned systems.

Thank you.

I was thinking of the incentive structure of a company (to focus on one example) as an affordance for aligning a company with a particular goal because if you set the incentive structure up right then you don’t have to keep track of everything that everyone does within the company, you can just (if you do it well) trust that the net effect of all those actions will optimize something that you want it to optimize (much like steering via the goals of an AI or steering via the taxes and regulations of a market).

But I think actually you are pointing to a very important way that alignment generally requires clarity, and clarity generally increases capabilities. This is present also in AI development: if we gained the insight necessary to build a very clear consequentialist AI that we knew how to align, we would simultaneously increase capabilities due to the same clarity.

Interested in your thoughts.

Summing up all that, this post made me realize Alignment Research should be its own discipline.

Yeah I agree! It seems that AI alignment is not really something that any existing disciplines is well set up to study. The existing disciplines that study human values are generally very far away from engineering, and the existing disciplines that have an engineering mindset tend to be very far away from directly studying human values. If we merely created a new "subject area" that studies human values + engineering under the standard paradigm of academic STEM, or social science, or philosophy, I don't think it would go well. It seems like a new discipline/paradigm is innovation at a deeper level of reality. (I understand adamShimi's work to be figuring out what this new discipline/paradigm really is.)

The habit formation example seems weirdly 'acausal decision theory' flavored to me (though this might be a 'tetris effect' like instance). It seems like habits similar to this are a mechanism of making trades across time/contexts with yourself. This makes me more optimistic about acausal decision theories being a natural way of expressing some key concepts in alignment.

Interesting! I hadn't thought of habit formation as relating to acausal decision theory. I see the analogy to making trades across time/contexts with yourself but I have the sense that you're referring to something quite different to ordinary trades across time that we would make e.g. with other people. Is the thing you're seeing something like when we're executing a habit we kind of have no space/time left over to be trading with other parts of ourselves, so we just "do the thing such that, if the other parts of ourselves knew we would do that and responded in kind, would lead to overall harmony" ?

Proxies are mentioned but it feels like we could have a rich science or taxonomy of proxies. There's a lot to study with historical use of proxies, or analyzing proxies in current examples of intelligence alignment.

We could definitely study proxies in detail. We could look at all the market/government/company failures that we can get data on and try to pinpoint what exactly folks were trying to align the intelligent system with, what operationalization was used, and how exactly that failed. I think this could be useful beyond merely cataloging failures as a cautionary tale -- I think it could really give us insight into the nature of intelligent systems. We may also find some modest successes!

Hope you are well Alex!

First, an ontology is just an agents way of organizing information about the world...

Second, a third-person perspective is a "view from nowhere" which has the capacity to be rooted at specific locations...

Yep I'm with you here

Well, what's a 3rd-person perspective good for? Why do we invent such things in the first place? It's good for communication.

Yeah I very much agree with justifying the use of 3rd person perspectives on practical grounds.

we should be able to consider the [first person] viewpoint of any physical object.

Well if we are choosing to work with third-person perspectives then maybe we don't need first person perspectives at all. We can describe gravity and entropy without any first person perspectives at all, for example.

I'm not against first person perspectives, but if we're working with third person perspectives then we might start by sticking to third person perspectives exclusively.

Let's look at a different type of knowledge, which I will call tacit knowledge -- stuff like being able to ride a bike (aka "know-how"). I think this can be defined (following my "very basic" theme) from an object's ability to participate successfully in patterns.

Yeah right. A screw that fits into a hole does have mutual information with the hole. I like the idea that knowledge is about the capacity to harmonize within a particular environment because it might avoid the need to define goal-directedness.

Now we can start to think about measuring the extent to which mutual information contributes to learning of tacit knowledge. Something happens to our object. It gains some mutual information w/ external stuff. If this mutual information increases its ability to pursue some goal predicate, we can say that the information is accessible wrt that goal predicate. We can imagine the goal predicate being "active" in the agent, and having a "translation system" whereby it unpacks the mutual information into what it needs.

The only problem is that now we have to say what a goal predicate is. Do you have a sense of how to do that? I have also come to the conclusion that knowledge has a lot to do with being useful in service of a goal, and that then requires some way to talk about goals and usefulness.

The hope is to eventually be able to build up to complicated types of knowledge (such as the definition you seek here), but starting with really basic forms.

I very much resonate with keeping it as simple as possible, especially when doing this kind of conceptual engineering, which can become so lost. I have been grounding my thinking in wanting to know whether or not a certain entity in the world has an understanding of a certain phenomenon, in order to use that to overcome the deceptive misalignment problem. Do you also have go-to practical problems against which to test these kinds of definitions?

Today this link does not seem to be working for me, I see:

Our apologies, your invite link has now expired (actually several hours ago, but we hate to rush people).

I also notice that the date is still 10/25 so perhaps the event is not happening today?

Yeah that resonates with me. I'd be interested in any more thoughts you have on this. Particularly anything about how we might recognize knowing in another entity or in a physical system.

Load More