(I don't follow it all, for instance I don't recall why it's important that the former view assumes that utility is computable.)
Partly because the "reductive utility" view is made a bit more extreme than it absolutely had to be. Partly because I think it's extremely natural, in the "LessWrong circa 2014 view", to say sentences like "I don't even know what it would mean for humans to have uncomputable utility functions -- unless you think the brain is uncomputable". (I think there is, or at least was, a big overlap between the LW crowd and the set of people... (read more)
I think we could get a GPT-like model to do this if we inserted other random sequences, in the same way, in the training data; it should learn a pattern like "non-word-like sequences that repeat at least twice tend to repeat a few more times" or something like that.
GPT-3 itself may or may not get the idea, since it does have some significant breadth of getting-the-idea-of-local-patterns-its-never-seen-before.
So I don't currently see what your experiment has to do with the planning-ahead question.
I would say that the GPT training process has no "inherent" p... (read more)
I think maybe our disagreement is about how good/useful of an overarching model ACT-R is? It's definitely not like in physics, where some overarching theories are widely accepted (e.g. the standard model) even by people working on much more narrow topics -- and many of the ones that aren't (e.g. string theory) are still widely known about and commonly taught. The situation in cog sci (in my view, and I think in many people's views?) is much more that we don't have an overarching model of the mind in anywhere close to the level of detail/mechanistic specifi
I think my post (at least the title!) is essentially wrong if there are other overarching theories of cognition out there which have similar track records of matching data. Are there?
By "overarching theory" I mean a theory which is roughly as comprehensive as ACT-R in terms of breadth of brain regions and breadth of cognitive phenomena.
As someone who has also done grad school in cog-sci research (but in a computer science department, not a psychology department, so my knowledge is more AI focused), my impression is that most psychology research isn't about... (read more)
Thanks for the thoughtful response, that perspective makes sense. I take your point that ACT-R is unique in the ways you're describing, and that most cognitive scientists are not working on overarching models of the mind like that. I think maybe our disagreement is about how good/useful of an overarching model ACT-R is? It's definitely not like in physics, where some overarching theories are widely accepted (e.g. the standard model) even by people working on much more narrow topics -- and many of the ones that aren't (e.g. string theory) are still widely k... (read more)
Hope it turns out to be interesting to you!
This lines up fairly well with how I've seen psychology people geek out over ACT-R. That is: I had a psychology professor who was enamored with the ability to line up programming stuff with neuroanatomy. (She didn't use it in class or anything, she just talked about it like it was the most mind blowing stuff she ever saw as a research psychologist, since normally you just get these isolated little theories about specific things.)
And, yeah, important to view it as a programming language which can model a bunch of stuff, but requires fairly extensive user in... (read more)
I think that's not quite fair. ACT-R has a lot to say about what kinds of processing are happening, as well. Although, for example, it does not have a theory of vision (to my limited understanding anyway), or of how the full motor control stack works, etc. So in that sense I think you are right.
What it does have more to say about is how the working memory associated with each modality works: how you process information in the various working memories, including various important cognitive mechanisms that you might not otherwise think about. In this sense, it's not just about interconnection like you said.
We also know how to implement it today.
I would argue that inner alignment problems mean we do not know how to do this today. We know how to limit the planning horizon for parts of a system which are doing explicit planning, but this doesn't bar other parts of the system from doing planning. For example, GPT-3 has a time horizon of effectively one token (it is only trying to predict one token at a time). However, it probably learns to internally plan ahead anyway, just because thinking about the rest of the current sentence (at least) is useful for th... (read more)
Imagine a spectrum of time horizons (and/or discounting rates), from very long to very short.
Now, if the agent is aligned, things are best with an infinite time horizon (or, really, the convergently-endorsed human discounting function; or if that's not a well-defined thing, whatever theoretical object replaces it in a better alignment theory). As you reduce the time horizon, things get worse and worse: the AGI willingly destroys lots of resources for short-term prosperity.
At some point, this trend starts to turn itself around: the AGI becomes so shortsight... (read more)
Recently I have been thinking that we should in fact use "really basic" definitions, EG "knowledge is just mutual information", and also other things with a general theme of "don't make agency so complicated". The hope is to eventually be able to build up to complicated types of knowledge (such as the definition you seek here), but starting with really basic forms. Let me see if I can explain.
First, an ontology is just an agents way of organizing information about the world. These can take lots of forms and I'm not going to constrain it to any partic... (read more)
Suppose instead the crossing counterfactual results in a utility greater than -10 utility. This seems very strange. By assumption, it's provable using the AI's proof system that (A=′Cross⟹U=−10). And the AI's counterfactual environment is supposed to line up with reality.
Right. This is precisely the sacrifice I'm making in order to solve Troll Bridge. Something like this seems to be necessary for any solution, because we already know that if your expectations of consequences entirely respect entailment, you'll fall prey to the Troll Bridge! In fact, y... (read more)
I'll talk about some ways I thought of potentially formalizing, "stop thinking if it's bad".
If your point is that there are a lot of things to try, I readily accept this point, and do not mean to argue with it. I only intended to point out that, for your proposal to work, you would have to solve another hard problem.
One simple way to try to do so is to have an agent using regular evidential decision theory but have a special, "stop thinking about this thing" action that it can take. Every so often, the agent considers taking this action using regular evide
You say that a "bad reason" is one such that the agents the procedure would think is bad.
To elaborate a little, one way we could think about this would be that "in a broad variety of situations" the agent would think this property sounded pretty bad.
For example, the hypothetical "PA proves ⊥" would be evaluated as pretty bad by a proof-based agent, in many situations; it would not expect its future self to make decisions well, so, it would often have pretty poor performance bounds for its future self (eg the lowest utility available in the given scena... (read more)
Ok. This threw me for a loop briefly. It seems like I hadn't considered your proposed definition of "bad reasoning" (ie "it's bad if the agent crosses despite it being provably bad to do so") -- or had forgotten about that case.
I'm not sure I endorse the idea of defining "bad" first and then considering the space of agents who pass/fail according to that notion of "bad"; how this is supposed to work is, rather, that we critique a particular decision theory by proposing a notion of "bad" tailored to that particular decision theory. For example, if a specifi... (read more)
Seems fair. I'm similarly conflicted. In truth, both the generalization-focused path and the objective-focused path look a bit doomed to me.
Great, I feel pretty resolved about this conversation now.
I would further add that looking for difficulties created by the simplification seems very intellectually productive. (Solving "embedded agency problems" seems to genuinely allow you to do new things, rather than just soothing philosophical worries.) But yeah, I would agree that if we're defining mesa-objective anyway, we're already in the business of assuming some agent/environment boundary.
(see the unidentifiability in IRL paper)
Ah, I wasn't aware of this!
Btw, if you're aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can't carve up a world, or recover a consistent utility function through this sort of process — please let me know. I'm directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.
I'm not sure what would constitute a clearly-worked counterexample. To me, a high reliance on an agent/worl... (read more)
Right, exactly. (I should probably have just referred to that, but I was trying to avoid reference-dumping.)
I pretty strongly endorse the new diagram with the pseudo-equivalences, with one caveat (much the same comment as on your last post)... I think it's a mistake to think of only mesa-optimizers as having "intent" or being "goal-oriented" unless we start to be more inclusive about what we mean by "mesa-optimizer" and "mesa-objective." I don't think those terms as defined in RFLO actually capture humans, but I definitely want to say that we're "goal-oriented" and have "intent."But the graph structure makes perfect sense, I just am doing the mental substitution
I pretty strongly endorse the new diagram with the pseudo-equivalences, with one caveat (much the same comment as on your last post)... I think it's a mistake to think of only mesa-optimizers as having "intent" or being "goal-oriented" unless we start to be more inclusive about what we mean by "mesa-optimizer" and "mesa-objective." I don't think those terms as defined in RFLO actually capture humans, but I definitely want to say that we're "goal-oriented" and have "intent."
But the graph structure makes perfect sense, I just am doing the mental substitution
Maybe a very practical question about the diagram: is there a REASON for there to be no "sufficient together" linkage from "Intent Alignment" and "Robustness" up to "Behavioral Alignment"?
Leaning hard on my technical definitions:
Robustness: Performing well on the base objective in a wide range of circumstances.Intent Alignment: A model is intent-aligned if it has a mesa-objective, and that mesa-objective is aligned with humans. (Again, I don't want to get into exactly what "alignment" means.)
These two together do not quite imply behavioral alignment, becau... (read more)
I think there's another reason why factorization can be useful here, which is the articulation of sub-problems to try.
For example, in the process leading up to inventing logical induction, Scott came up with a bunch of smaller properties to try for. He invented systems which got desirable properties individually, then growing combinations of desirable properties, and finally, figured out how to get everything at once. However, logical induction doesn't have parts corresponding to those different subproblems.
It can be very useful to individually achieve, sa... (read more)
I agree that we need a notion of "intent" that doesn't require a purely behavioral notion of a model's objectives, but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea "what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?" I suspect MIRI people have some unshared intuitions about why we might expect this, but I currently don't have a good reason to believe this.)
For... (read more)
They can't? Why not?
I meant to invoke a no-free-lunch type intuition; we can always construct worlds where some particular tool isn't useful.
My go-to would be "a world that checks what an InfraBayesian would expect, and does the opposite". This is enough for the narrow point I was trying to make (that InfraBayes does express some kind of regularity assumption about the world), but it's not very illustrative or compelling for my broader point (that InfraBayes plausibly addresses your concerns about learning theory). So I'll try to tell a better stor... (read more)
No such thing is possible in reality, as an agent cannot exist without its environment, so why shouldn't we talk about the mesa-objective being over a perturbation set, too, just that it has to be some function of the model's internal features?
This makes some sense, but I don't generally trust some "perturbation set" to in fact capture the distributional shift which will be important in the real world. There has to at least be some statement that the perturbation set is actually quite broad. But I get the feeling that if we could make the right statement there, we would understand the problem in enough detail that we might have a very different framing. So, I'm not sure what to do here.
Great! I feel like we're making progress on these basic definitions.
InfraBayes doesn't look for the regularity in reality that NNs are taking advantage of, agreed. But InfraBayes is exactly about "what kind of regularity assumptions can we realistically make about reality?" You can think of it as a reaction to the unrealistic nature of the regularity assumptions which Solomonoff induction makes. So it offers an answer to the question "what useful+realistic regularity assumptions could we make?"
The InfraBayesian answer is "partial models". IE, the idea that even if reality cannot be completely described by usable models, pe... (read more)
I like the addition of the pseudo-equivalences; the graph seems a lot more accurate as a representation of my views once that's done.
But it seems to me that there's something missing in terms of acceptability.
The definition of "objective robustness" I used says "aligns with the base objective" (including off-distribution). But I think this isn't an appropriate representation of your approach. Rather, "objective robustness" has to be defined something like "generalizes acceptably". Then, ideas like adversarial training and checks and balances make sense as ... (read more)
All of that made perfect sense once I thought through it, and I tend to agree with most it. I think my biggest disagreement with you is that (in your talk) you said you don't expect formal learning theory work to be relevant. I agree with your points about classical learning theory, but the alignment community has been developing basically-classical-learning-theory tools which go beyond those limitations. I'm optimistic that stuff like Vanessa's InfraBayes could help here.
Granted, there's a big question of whether that kind of thing can be competitive. (Although there could potentially be a hybrid approach.)
I've watched your talk at SERI now.
One question I have is how you hope to define a good notion of "acceptable" without a notion of intent. In your talk, you mention looking at why the model does what it does, in addition to just looking at what it does. This makes sense to me (I talk about similar things), but, it seems just about as fraught as the notion of mesa-objective:
(Meta: was this meant to be a question?)
I originally conceived of it as such, but in hindsight, it doesn't seem right.
In contrast, the generalization-focused approach puts less emphasis on the assumption that the worst catastrophes are intentional.I don't think this is actually a con of the generalization-focused approach.
In contrast, the generalization-focused approach puts less emphasis on the assumption that the worst catastrophes are intentional.
I don't think this is actually a con of the generalization-focused approach.
By no means did I intend it to be a con. I'll try to edit to clarify. I think it is a real pro of the generalization-focused approach that it does not rely on models having mesa-objectives (putting it in Evan's terms, there is a real poss... (read more)
Are you the historical origin of the robustness-centric approach?
Idk, probably? It's always hard for me to tell; so much of what I do is just read what other people say and make the ideas sound sane to me. But stuff I've done that's relevant:
If there were a "curated posts" system on the alignment forum, I would nominate this for curation. I think it's a great post.
All of which I really should have remembered, since it's all stuff I have known in the past, but I am a doofus. My apologies.(But my error wasn't being too mired in EDT, or at least I don't think it was; I think EDT is wrong. My error was having the term "counterfactual" too strongly tied in my head to what you call linguistic counterfactuals. Plus not thinking clearly about any of the actual decision theory.)
All of which I really should have remembered, since it's all stuff I have known in the past, but I am a doofus. My apologies.
(But my error wasn't being too mired in EDT, or at least I don't think it was; I think EDT is wrong. My error was having the term "counterfactual" too strongly tied in my head to what you call linguistic counterfactuals. Plus not thinking clearly about any of the actual decision theory.)
I'm glad I pointed out the difference between linguistic and DT counterfactuals, then!
It still feels to me as if your proof-based agents are unrealis
It's obvious how ordinary conditionals are important for planning and acting (you design a bridge so that it won't fall down if someone drives a heavy lorry across it; you don't cross a bridge because you think the troll underneath will eat you if you cross), but counterfactuals? I mean, obviously you can put them in to a particular problem
All the various reasoning behind a decision could involve material conditionals, probabilistic conditionals, logical implication, linguistic conditionals (whatever those are), linguistic counterfactuals, decision-theoret... (read more)
Agreed. The asymmetry needs to come from the source code for the agent.
In the simple version I gave, the asymmetry comes from the fact that the agent checks for a proof that x>y before checking for a proof that y>x. If this was reversed, then as you said, the Lobian reasoning would make the agent take the 10, instead of the 5.
In a less simple version, this could be implicit in the proof search procedure. For example, the agent could wait for any proof of the conclusion x>y or y>x, and make a decision based on whichever happened first. Then ther... (read more)
While I agree that the algorithm might output 5, I don't share the intuition that it's something that wasn't 'supposed' to happen, so I'm not sure what problem it was meant to demonstrate.
OK, this makes sense to me. Instead of your (A) and (B), I would offer the following two useful interpretations:
1: From a design perspective, the algorithm chooses 5 when 10 is better. I'm not saying it has "computed argmax incorrectly" (as in your A); an agent design isn't supposed to compute argmax (argmax would be insufficient to solve this problem, because we're not g... (read more)
Yep, agreed. I used the language "false antecedents" mainly because I was copying the language in the comment I replied to, but I really had in mind "demonstrably false antecedents".
Yeah, interesting. I don't share your intuition that nested counterfactuals seem funny. The example you give doesn't seem ill-defined due to the nesting of counterfactuals. Rather, the antecedent doesn't seem very related to the consequent, which generally has a tendency to make counterfactuals ambiguous. If you ask "if calcium were always ionic, would Nixon have been elected president?" then I'm torn between three responses:
I agree that much of what's problematic about the example I gave is that the "inner" counterfactuals are themselves unclear. I was thinking that this makes the nested counterfactual harder to make sense of (exactly because it's unclear what connection there might be between them) but on reflection I think you're right that this isn't really about counterfactual nesting and that if we picked other poorly-defined (non-counterfactual) propositions we'd get a similar effect: "If it were morally wrong to eat shellfish, would humans Really Truly Have Free Will?"... (read more)
Hmm. I'm not following. It seems like you follow the chain of reasoning and agree with the conclusion:
The algorithm doesn't try to select an assignment with largest U(), but rather just outputs 5 if there's a valid assignment with x>y, and 10 otherwise. Only p2 fulfills the condition, so it outputs 5.
This is exactly the point: it outputs 5. That's bad! But the agent as written will look perfectly reasonable to anyone who has not thought about the spurious proof problem. So, we want general tools to avoid t... (read more)
Ah, I wasn't strongly differentiating between the two, and was actually leaning toward your proposal in my mind. The reason I was not differentiating between the two was that the probability of C(A|B) behaves a lot like the probabilistic value of Prc(A|B). I wasn't thinking of nearby-world semantics or anything like that (and would contrast my proposal with such a proposal), so I'm not sure whether the C(A|B) notation carries any important baggage beyond that. However, I admit it could be an important distinction; C(A|B) is itself a proposition, which can ... (read more)
I never found Stalnaker's thesis at all plausible, not because I'd thought of the ingenious little calculation you give but because it just seems obviously wrong intuitively. But I suppose if you don't have any presuppositions about what sort of notion an implication is allowed to be, you don't get to reject it on those grounds. So I wasn't really entitled to say "Pr(A|B) is not the same thing as Pr(B=>A) for any particular notion of implication", since I hadn't thought of that calculation.
Anyway, I have just the same sense of obvious wrongness about th... (read more)
I should! But I've got a lot of things to write up!
It also needs a better name, as there have been several things termed "weak logical induction" over time.
In between … well … in between, we're navigating treacherous waters …
Right, I basically agree with this picture. I might revise it a little:
I don't believe that LI provides such a Pareto improvement, but I suspect that there's a broader theory which contains the two.
Overall, I place much less weight on arguments that revolve around the presumed nature of human values compared to arguments grounded in abstract reasoning about rational agents.
Ah. I was going for the human-values argument because I thought you might not appreciate the rational-agent argument. After all, who cares what general rational agents can value, if human values happen to be well-represented by infrabayes?
But for general ra... (read more)
I agree inasmuch as we actually can model this sort of preferences, for a sufficiently strong meaning of "model". I feel that it's much harder to be confident about any detailed claim about human values than about the validity of a generic theory of rationality. Therefore, if the ultimate generic theory of rationality imposes some conditions on utility functions (while still leaving a very rich space of different utility functions), that will lead me to try formalizing human values within those constraints. Of course, given a candidate theory, we should po
If PA is consistent, then the agent cannot prove U = -10 (or anything else inconsistent) under the assumption that the agent already crossed, and therefore Löb's theorem fails to apply. In this case, there is no weird certainty that crossing is doomed.
I think this is the wrong step. Why do you think this? Just because PA is consistent doesn't mean you can't prove weird things under assumption. Look at the structure of the proof. You're objecting to an assumption. ("Suppose PA proves that crossing -> U=-10") That's a pretty weird way to object to a proof. I'm allowed to make any assumptions I like.
My guess is that you are wrestling with Lobs theorem itself. Lobs theorem is pretty weird!
It seems to me that the last paragraph should update you to thinking that this plan is no worse than the default. IE: yes, this plan creates additional risk because there are complicated pathways a malign gpt-n could use to get arbitrary code run on a big computer. But if people are giving it that chance anyway, it does seem like a small increase in risk with a large potential gain. (Small, not zero, for the chance that your specific gpt-n instance somehow becomes malign when others are safe, eg if something about the task actually activated a subtle malignancy not present during other tasks).
So for me a crux would be, if it's not malign, how good could we expect the papers to actually be?
First, I'm not sure exactly why you think this is bad. Care to say more? My guess is that it just doesn't fit the intuitive notion that updates should be heading toward some state of maximal knowledge. But we do fit this intuition in other ways; specifically, logical inductors eventually trust their future opinions more than their present opinions.
Personally, I found this result puzzling but far from damning.
Second, I've actually done some unpublished work on this. There is a variation of the logical induction criterion which is more relaxed (admits more t... (read more)
So it's still in the observation-utility paradigm I think, or at least it seems to me that it doesn't have an automatic incentive to wirehead. It could want to wirehead, if the value function winds up seeing wireheading as desirable for any reason, but it doesn't have to. In the human example, some people are hedonists, but others aren't.
All sounds perfectly reasonable. I just hope you recognize that it's all a big mess (because it's difficult to see how to provide evidence in a way which will, at least eventually, rule out the wireheading hypothesis or an... (read more)
OK, so, here is a question.
The abstract theory of InfraBayes (like the abstract theory of Bayes) elides computational concerns.
In reality, all of ML can more or less be thought of as using a big search for good models, where "good" means something approximately like MAP, although we can also consider more sophisticated variational targets. This introduces two different types of approximation:
What we want out of InfraBayes is a bounded regret guarantee (in settings ... (read more)
My hope is that we will eventually have computationally feasible algorithms that satisfy provable (or at least conjectured) infra-Bayesian regret bounds for some sufficiently rich hypothesis space. Currently, even in the Bayesian case, we only have such algorithms for poor hypothesis spaces, such as MDPs with a small number of states. We can also rule out such algorithms for some large hypothesis spaces, such as short programs with a fixed polynomial-time bound. In between, there should be some hypothesis space which is small enough to be feasible and rich... (read more)