Charlie Steiner

LW1.0 username Manfred. Day job is condensed matter physics, hobby is thinking I know how to assign anthropic probabilities.


Hierarchical planning: context agents

Oh wait, are you the first author on this paper? I didn't make the connection until I got around to reading your recent post.

So when you talk about moving to a hierarchical human model, how practical do you think it is to also move to a higher-dimensional space of possible human-models, rather than using a few hand-crafted goals? This necessitates some loss function or prior probability over models, and I'm not sure how many orders of magnitude more computationally expensive it makes everything.

AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

My first idea is, you take your common sense AI, and rather than saying "build me a spaceship, but, like, use common sense," you can tell it "do the right thing, but, like, use common sense." (Obviously with "saying" and "tell" in invisible finger quotes.) Bam, Type-1 FAI.

Of course, whether this will go wrong or not depends on the specifics. I'm reminded of Adam Shimi et al's recent post that mentioned "Ideal Accomplishment" (how close to an explicit goal a system eventually gets) and "Efficiency" (how fast it gets there). If you have a general purpose "common sensical optimizer" that optimizes any goal but, like, does it in a common sense way, then before you turn it on you'd better know whether it's affecting ideal accomplishment, or just efficiency.

That is to say, if I tell it to make me the best spaceship it can or something similarly stupid, will the AI "know that the goal is stupid" and only make a normal spaceship before stopping? Or will it eventually turn the galaxy into spaceship, just taking common-sense actions along the way? The truly idiot-proof common sensical optimizer changes its final destination so that it does what we "obviously" meant, not what we actually said. The flaws in this process seem to determine if it's trustworthy enough to tell to "do the right thing," or trustworthy enough to tell to do anything at all.

AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

I'm a lot less excited about the literature of the world's philosophy than I am about the living students of it.

Of course, there are some choices in designing an AI that are ethical choices, for which there's no standard by which one culture's choice is better than another's. In this case, incorporating diverse perspectives is "merely" a fair way to choose how to steer the future - a thing to do because we want to, not because it solves some technical problem.

But there are also philosophical problems faced in the construction of AI that are technical problems, and I think the philosophy literature is just not going to contain a solution to these problems, because they require highly specific solutions that you're not going to think of if you're not even aware of the problem. You bring up ontological shifts, and I think the Madhyamaka Buddhist sutra you quote is a typical example - it's interesting as a human, especially with the creativity in interpretation afforded to us by hindsight, but the criteria for "interesting as a human" are so much fewer and more lenient than what's necessary to design a goal system that responds capably to ontological shifts.

The Anglo-American tradition of philosophy is in no way superior to Buddhist philosophy on this score. What is really necessary is "bespoke" philosophy oriented to the problems at hand in AI alignment. This philosophy is going to superficially sound more like analytic philosophy than, say, continental philosophy or vedic philosophy, just because of what we need it to do, but that doesn't mean it can't benefit from a diversity of viewpoints and mental toolboxes.

AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

Children learn to follow common sense, despite not having (explicit) meta-ethical and meta-normative beliefs at all.

Children also learn right from wrong - I'd be interested in where you draw the line between "An AI that learns common sense" and "An AI that learns right from wrong." (You say this argument doesn't apply in the case of human values, but it seems like you mean only explicit human values, not implicit ones.)

My suspicion, which is interesting to me so I'll explain it even if you're going to tell me that I'm off base, is that you're thinking that part of common sense is to avoid uncertain or extreme situations (e.g. reshaping the galaxy with nanotechnology), and that common sense is generally safe and trustworthy for an AI to follow, in a way that doesn't carry over to "knowing right from wrong." An AI that has learned right from wrong to the same extent that humans learn it might make dangerous moral mistakes.

But when I think about it like that, it actually makes me less trusting of learned common sense. After all, one of the most universally acknowledged things about common sense is that it's uncommon among humans! Merely doing common sense as well as humans seems like a recipe for making a horrible mistake because it seemed like the right thing at the time - this opens the door to the same old alignment problems (like self-reflection and meta-preferences [or should that be meta-common-sense]).


P.S. I'm not sure I quite agree with this particular setting of normativity. The reason is the possibility of "subjective objectivity", where you can make what you mean by "Quality Y" arbitrarily precise and formal if given long enough to split hairs. Thus equipped, you can turn "Does this have quality Y?" into an objective question by checking against the (sufficiently) formal, precise definition.

The point is that the aliens are going to be able to evaluate this formal definition just as well as you. They just don't care about it. Even if you both call something "Quality Y," that doesn't avail you much if you're using that word to mean very different things. (Obligatory old Eliezer post)

Anyhow, I'd bet that xuan is not saying that it is impossible to build an AI with common sense - they're saying that building an AI with common sense is in the same epistemological category as building an AI that knows right from wrong.

Literature Review on Goal-Directedness

The little quizzes were highly effective in getting me to actually read the post :)

I think depending on what position you take, there are difference in how much one thinks there's "room for a lot of work in this sphere." The more you treat goal-directedness as important because it's a useful category in our map for predicting certain systems, the less important it is to be precise about it. On the other hand if you want to treat goal-directedness in a human-independent way or otherwise care about it "for its own sake" for some reason, then it's a different story.

Why I'm excited about Debate

Good question. There's a big roadblock to your idea as stated, which is that asking something to define "alignment" is a moral question. But suppose we sorted out a verbal specification of an aligned AI and had a candidate FAI coded up - could we then use Debate on the question "does this candidate match the verbal specification?"

I don't know - I think it still depends on how bad humans are as judges of arguments - we've made the domain more objective, but maybe there's some policy of argumentation that still wins by what we would consider cheating. I can imagine being convinced that it would work by seeing Debates play out with superhuman litigators, but since that's a very high bar maybe I should apply more creativity to my expextations.

Why I'm excited about Debate

I think the Go example really gets to the heart of why I think Debate doesn't cut it.

The reason Go is hard is that it has a large game tree despite simple rules. When we treat an AI game as information about the value of a state of the Go board, we know exactly what the rules are and how the game should be scored, the superhuman work the AIs are doing is in searching this game tree that's too big for us. The adversarial gameplay provides a check that the search through the game tree is actually finding high-scoring policies.

What does this framework need to apply to moral arguments? That humans "know the rules" of argumentation, that we can recognize good arguments when we see them, and that what we really need help with is searching the game tree of arguments to find high-scoring policies of argumentation.

This immediately should sound a little off. If humans have any exploits (or phrased differently, if there are places where our meta-preferences and our behavior conflict), then this search process will try to find them. We can imagine trying to patch humans (e.g. giving them computer assistants), but this patching process has to already be the process of bringing human behavior in line with human meta-preferences! It's the patching process that's doing all the alignment work, reducing the Debate part to a fancy search for high-approval actions.

No, the dream of Debate is that it's a game where human meta-preferences and behavior are already aligned. For all places where they diverge, the dream is that there's some argument that will point this out and permanently fix it, and that this inconsistency-resolution process does not itself violate too many of our meta-preferences. That Debate is fair like Go is fair - each move is incremental, you can't place a Go stone that changes the layout of the board to make it impossible for your opponent to win.

Transparency and AGI safety

Re: non-agenty AGI. The typical problem is that there are incentives for individual actors to build AI systems that pursue goals in the world. So even if you postulate non-agenty AGI, you then have to further figure out why nobody has asked the Oracle AI "What's the code to an AI that will make me rich?" or asked it for the motor output of a robot given various sense data and natural-language goals, then used that output to control a robot (also see ).

Transparency and AGI safety


I'm reminded a bit of the reason why Sudoku and quantum computing are difficult: the possibilities you have to track are not purely local, they can be a nonlocal combination of different things. General NNs seem like they'd be at least NP to interpret.

But this is what dropout is useful for, penalizing reliance on correlations. So maybe if you're having trouble interpreting something you can just crank up the dropout parameters. On the other hand, dropout also promotes redundancy, which might make interpretation confusing - perhaps there's something similar to dropout that's even better for interpretability.

Edit for unfiltered ideas:

You could automatically sample an image, find neurons excited, sample neurons, sample images based on how much they excite that neuron, etc, until you end up with a sampled pool of similar images and similar neurons. Then you drop out all similar neurons.

You could try anti-dropout: punishing the NN for redundancy and rewarding it for fragility/specificity. However, to avoid the incentive to create fine tuned activation/inhibition pairs, you only use positive activations for this step.

Hierarchical planning: context agents

Sorry for being slow :) No, I haven't read anything of Bratman's. Should I? The synopsis looks like it might have some interesting ideas but I'm worried he could get bogged down in what human planning "really is" rather than what models are useful.

I'd totally be happy to chat either here or in PMs. Full Bayesian reasoning seems tricky if the environment is complicated enough to make hierarchical planning attractive - or do you mean optimizing a model for posterior probability (the prior being something like MML?) by local search?

I think one interesting question there is if it can learn human foibles. For example, suppose we're playing a racing game and I want to win the race, but fail because my driving skills are bad. How diverse a dataset about me do you need to actually be able to infer that a) I am capable of conceptualizing how good my performance is b) I wanted it to be good c) It wasn't good, from a hierarchical perpective, because of the lower-level planning faculties I have. I think maybe you could actually learn this only from racing game data (no need to make an AGI that can ask me about my goals and do top-down inference), so long as you had diverse enough driving data to make the "bottom-up" generalization that my low-level driving skill can be modeled as bad almost no matter the higher-level goal, and therefore it's simplest to explain me not winning a race by taking the bad driving I display elsewhere as a given and asking what simple higher-level goal fits on top.

Load More