All of abramdemski's Comments + Replies

The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not "how do we philosophically do some fancy framework so that the agent doesn't end up hacking its sensors or maximizing the quotation of our values?". 

I think that it generally seems like a good idea to have solid theories of two different things:

  1. What is the thing we are hoping to teach the AI?
  2. What is the training story by which we mean to teach it?

I read your above paragraph as maligning (1) in favor of (2). In order... (read more)

I said: 

The basic idea behind compressed pointers is that you can have the abstract goal of cooperating with humans, without actually knowing very much about humans.
[...]
In machine-learning terms, this is the question of how to specify a loss function for the purpose of learning human values.

You said: 

In machine-learning terms, this is the question of how to train an AI whose internal cognition reliably unfolds into caring about people, in whatever form that takes in the AI's learned ontology (whether or not it has a concept for people).

Thinking ... (read more)

If you commit to the specific view of outer/inner alignment, then now you also want your loss function to "represent" that goal in some way.

I think it is reasonable as engineering practice to try and make a fully classically-Bayesian model of what we think we know about the necessary inductive biases -- or, perhaps more realistically, a model which only violates classic Bayesian definitions where necessary in order to represent what we want to represent.

This is because writing down the desired inductive biases as an explicit prior can help us to understand... (read more)

I doubt this due to learning from scratch.

I expect you'll say I'm missing something, but to me, this sounds like a language dispute. My understanding of your recent thinking holds that the important goal is to understand how human learning reliably results in human values. The Bayesian perspective on this is "figuring out the human prior", because a prior is just a way-to-learn. You might object to the overly Bayesian framing of that; but I'm fine with that. I am not dogmatic on orthodox bayesianism. I do not even like utility functions.

Insofar as the ques

... (read more)

I think that both the easy and hard problem of wireheading are predicated on 1) a misunderstanding of RL (thinking that reward is—or should be—the optimization target of the RL agent) and 2) trying to black-box human judgment instead of just getting some good values into the agent's own cognition. I don't think you need anything mysterious for the latter. I'm confident that RLHF, done skillfully, does the job just fine. The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not

... (read more)

This doesn't seem relevant for non-AIXI RL agents which don't end up caring about reward or explicitly weighing hypotheses over reward as part of the motivational structure? Did you intend it to be?

With almost any kind of feedback process (IE: any concrete proposals that I know of), similar concerns arise. As I argue here, wireheading is one example of a very general failure mode. The failure mode is roughly: the process actually generating feedback is, too literally, identified with the truth/value which that feedback is trying to teach.

Output-based evalu... (read more)

I'm a bit uncomfortable with the "extreme adversarial threats aren't credible; players are only considering them because they know you'll capitulate" line of reasoning because it is a very updateful line of reasoning. It makes perfect sense for UDT and functional decision theory to reason in this way. 

I find the chicken example somewhat compelling, but I can also easily give the "UDT / FDT retort": since agents are free to choose their policy however they like, one of their options should absolutely be to just go straight. And arguably, the agent shou... (read more)

The agent's own generative model also depends on (adapts to, is learned from, etc.) the agent's environment. This last bit comes from "Discovering Agents".

"Having own generative model" is the shakiest part.

What it means for the agent to "have a generative model" is that the agent systematically corrects this model based on its experience (to within some tolerable competence!).

It probably means that storage, computation, and maintenance (updates, learning) of the model all happen within the agent's boundaries: if not, the agent's boundaries shall be widened

... (read more)

I think the main problem is that expected utility theory is in many ways our most well-developed framework for understanding agency, but, makes no empirical predictions, and in particular does not tie agency to other important notions of optimization we can come up with (and which, in fact, seem like they should be closely tied to agency).

I'm identifying one possible source of this disconnect.

The problem feels similar to trying to understand physical entropy without any uncertainty. So it's like, we understand balloons at the atomic level, but we notice th... (read more)

I think Bob still doesn't really need a two-part strategy in this case. Bob knows that Alice believes "time and space are relative", so Bob believes this proposition, even though Bob doesn't know the meaning of it. Bob doesn't need any special-case rule to predict Alice. The best thing Bob can do in this case still seems like, predict Alice based off of Bob's own beliefs.

(Perhaps you are arguing that Bob can't believe something without knowing what that thing means? But to me this requires bringing in extra complexity which we don't know how to handle anyw... (read more)

Another example of this happening comes when thinking about utilitarian morality, which by default doesn't treat other agents as moral actors (as I discuss here).

Interesting point! 

Maintain a model of Alice's beliefs which contains the specific things Alice is known to believe, and use that to predict Alice's actions in domains closely related to those beliefs.

It sounds to me like you're thinking of cases on my spectrum, somewhere between Alice>Bob and Bob>Alice. If Bob thinks Alice knows strictly more than Bob, then Bob can just use Bob's own b... (read more)

3Richard Ngo1mo
No, I'm thinking of cases where Alice>Bob, and trying to gesture towards the distinction between "Bob knows that Alice believes X" and "Bob can use X to make predictions". For example, suppose that Bob is a mediocre physicist and Alice just invented general relativity. Bob knows that Alice believes that time and space are relative, but has no idea what that means. So when trying to make predictions about physical events, Bob should still use Newtonian physics, even when those calculations require assumptions that contradict Alice's known beliefs.

I've often repeated scenarios like this, or like the paperclip scenario.

My intention was never to state that the specific scenario was plausible or default or expected, but rather, that we do not know how to rule it out, and because of that, something similarly bad (but unexpected and hard to predict) might happen

The structure of the argument we eventually want is one which could (probabilistically, and of course under some assumptions) rule out this outcome. So to me, pointing it out as a possible outcome is a way of pointing to the inadequacy of o... (read more)

If opens are thought of as propositions, and specialization order as a kind of ("logical") time, 

Up to here made sense.

with stronger points being in the future of weaker points, then this says that propositions must be valid with respect to time (that is, we want to only allow propositions that don't get invalidated).

After here I was lost. Which propositions are valid with respect to time? How can we only allow propositions which don't get invalidated (EG if we don't know yet which will and will not be), and also, why do we want that?

This setting moti

... (read more)
1Vladimir Nesov5mo
This was just defining/motivating terms (including "validity") for this context, the technical answer is to look at the definition of specialization preorder, when it's being suggestively called "logical time". If an open is a "proposition", and a point being contained in an open is "proposition is true at that point", and a point stronger in specialization order than another point is "in the future of the other point", then in these terms we can say that "if a proposition is true at a point, it's also true at a future point", or that "propositions are valid with respect to time going forward", in the sense that their truth is preserved when moving from a point to a future point. Logical time is intended to capture decision making, with future decisions advancing the agent's point of view in logical time. So if an agent reasons only in terms of propositions valid with respect to advancement of logical time, then any knowledge it accumulated remains valid as it makes decisions, that's some of the motivation for looking into reasoning in terms of such propositions. This is mostly about how domain theory describes computations, the interesting thing is how the computations are not necessarily in the domains at all, they only leave observations there, and it's the observations that the opens are ostensibly talking about, yet the goal might be to understand the computations, not just the observations (in program semantics, the goal is often to understand just the observations though, and a computation might be defined to only be its observed behavior). So one point I wanted to make is to push against the perspective where points of a space are what the logic of opens is intended to reason about, when the topology is not Frechet (has nontrivial specialization preorder). Yeah, I've got nothing, just a sense of direction and a lot of theory to study, or else there would've been a post, not just a comment triggered by something on a vaguely similar topic. So this thread i

As far as I can tell, this is the entire point. I don't see this 2D vector space actually being used in modeling agents, and I don't think Abram does either.

I largely agree. In retrospect, a large part of the point of this post for me is that it's practical to think of decision-theoretic agents as having expected value estimates for everything without having a utility function anywhere, which the expected values are "expectations of". 

A utility function is a gadget for turning probability distributions into expected values. This object makes sense in ... (read more)

Not to disagree hugely, but I have heard one religious conversion (an enlightenment type experience) described in a way that fits with "takeover without holding power over someone". Specifically this person described enlightenment in terms close to "I was ready to pack my things and leave. But the poison was already in me. My self died soon after that."

It's possible to get the general flow of the arguments another person would make, spontaneously produce those arguments later, and be convinced by them (or at least influenced).

Fair enough! I admit that John did not actually provide an argument for why alignment might be achievable by "guessing true names". I think the approach makes sense, but my argument for why this is the case does differ from John's arguments here.

You can ensure zero mutual information by building a sufficiently thick lead wall. By convention in engineering, any number is understood as a range, based on the number of significant digits relevant to the calculation. So "zero" is best understood as "zero within some tolerance". So long as we are not facing an intelligent and resourceful adversary, there will probably be a human-achievable amount of lead which cancels the signal sufficiently. 

This serves to illustrate the point that sometimes we can find ways to bound an error to within desirable t... (read more)

0TLW6mo
My objection is actually mostly to the example itself. As you mention: Compare with the example: This is analogous to the case of... trying to contain a malign AI which is already not on our side.

So, I think the other answers here are adequate, but not super satisfying. Here is my attempt.

The frame of "generalization failures" naturally primes me (and perhaps others) to think of ML as hunting for useful patterns, but instead fitting to noise. While pseudo-alignment is certainly a type of generalization failure, it has different connotations: that of a system which has "correctly learned" (in the sense of internalizing knowledge for its own use), but still does not perform as intended.

The mesa-optimizers paper defines inner optimizers as performing ... (read more)

This definitely isn't well-defined, and this is the main way in which ELK itself is not well-defined and something I'd love to fix. That said, for now I feel like we can just focus on cases where the counterexamples obviously involve the model knowing things (according to this informal definition). Someday in the future we'll need to argue about complicated border cases, because our solutions work in every obvious case. But I think we'll have to make a lot of progress before we run into those problems (and I suspect that progress will mostly resolve the am

... (read more)

Yeah, sorry, poor wording on my part. What I meant in that part was "argue that the direct translator cannot be arbitrarily complex", although I immediately mention the case you're addressing here in the parenthetical right after what you quote.

Ah, I just totally misunderstood the sentence, the intended reading makes sense.

Well, it might be that a proposed solution follows relatively easily from a proposed definition of knowledge, in some cases. That's the sort of solution I'm going after at the moment. 

I agree that's possible, and it does seem like a... (read more)

Job applicants often can't start right away; I would encourage you to apply!

Infradistributions are a generalization of sets of probability distributions. Sets of probability distributions are used in "imprecise bayesianism" to represent the idea that we haven't quite pinned down the probability distribution. The most common idea about what to do when you haven't quite pinned down the probability distribution is to reason in a worst-case way about what that probability distribution is. Infrabayesianism agrees with this idea.

One of the problems with imprecise bayesianism is that they haven't come up with a good update rule -- turns ... (read more)

One of the problems with imprecise bayesianism is that they haven't come up with a good update rule -- turns out it's much trickier than it looks. You can't just update all the distributions in the set, because [reasons i am forgetting]. Part of the reason infrabayes generalizes imprecise bayes is to fix this problem.

The reason you can't just update all the distributions in the set is, it wouldn't be dynamically consistent. That is, planning ahead what to do in every contingency versus updating and acting accordingly would produce different policies.

The... (read more)

I'd be happy to chat about it some time (PM me if interested). I don't claim to have a fully worked out solution, though. 

Any more detailed thoughts on its relevance? EG, a semi-concrete ELK proposal based on this notion of truth/computationalism? Can identifying-running-computations stand in for direct translation?

The main difficulty is that you still need to translate between the formal language of computations and something humans can understand in practice (which probably means natural language). This is similar to Dialogic RL. So you still need an additional subsystem for making this translation, e.g. AQD. At which point you can ask, why not just apply AQD directly to a pivotal[1] action?

I'm not sure what the answer is. Maybe we should apply AQD directly, or maybe AQD is too weak for pivotal actions but good enough for translation. Or maybe it's not even good en... (read more)

Your definition requires that we already know how to modify Alice to have Clippy's goals. So your brute force idea for how to modify clippy to have Alice's knowledge doesn't add very much; it still relies on a magic goal/belief division, so giving a concrete algorithm doesn't really clarify.

Really good to see this kind of response.

1Ben Pace7mo
Ah, very good point. How interesting… (If I’d concretely thought of transferring knowledge between a bird and a dog this would have been obvious.)

To be pedantic, "pragmatism" in the context of theories of knowledge means "knowledge is whatever the scientific community eventually agrees on" (or something along those lines -- I have not read deeply on it). [A pragmatist approach to ELK would, then, rule out "the predictor's knowledge goes beyond human science" type counterexamples on principle.] 

What you're arguing for is more commonly called contextualism. (The standards for "knowledge" depend on context.)

I totally agree with contextualism as a description of linguistic practice, but I think the... (read more)

3Charlie Steiner7mo
Pragmatism's a great word, everyone wants to use it :P But to be specific, I mean more like Rorty (after some Yudkowskian fixes) than Pierce.

I think a lot of the values we care about are cultural, not just genetic. A human raised without culture isn't even clearly going to be generally intelligent (in the way humans are), so why assume they'd share our values?

Estimations of the information content of this part are discussed by Eric Baum in What is Thought?, although I do not recall the details.

I find that plausible, a priori. Mostly doesn't affect the stuff in the talk, since that would still come from the environment, and the same principles would apply to culturally-derived values as to environment-derived values more generally. Assuming the hardwired part is figured out, we should still be able to get an estimate of human values within the typical-human-value-distribution-for-a-given-culture from data which is within the typical-human-environment-distribution-for-that-culture.

I agree. There's nothing magical about "once". I almost wrote "once or twice", but it didn't sit well with the level of caution I would prefer be the norm. While your analysis seems correct, I am worried if that's the plan. 

I think a safety team should go into things with the attitude that this type of thing is important a last-line-of-defense, but should never trigger. The plan should involve a strong argument that what's being build is safe. In fact if this type of safeguard gets triggered, I would want the policy to be to go back to the drawing boa... (read more)

1Hoagy8mo
Yeah fully agreed.

Wait, so, what do you actually do with the holdout data? Your stated proposal doesn't seem to do anything with it. But, clearly, data that's simply held out forever is of no use to us.

It seems like this holdout data is the sort of precaution which can be used once. When we see (predicted) sensor tampering, we shut the whole project down. If we use that information to iterate on our design at all we enter into dangerous territory: we're now optimizing the whole setup to avoid that kind of discrepancy, which means it may become useless for detecting tamperin... (read more)

1Hoagy8mo
I see John agrees with the 'one-time' label but it seems a bit too strong to me, especially if the kind of optimization is 'lets try a totally different approach', rather than continuing to train the same system, or focusing on exactly why it spoofed one sensor but not the other. Just to think it through: There are three types of system that are important: type A which fails on the validation/holdout data, type B which succeeds on validation but not test/real-world data, and type C, which succeeds on both. We are looking for type C, and we use the validation data to distinguish A from either B or C. Naively, waiting longer for a system that is not-A wouldn't have a bearing on whether it is B or C, but upon finding A, we know it is finding the strategy of spoofing sensors, and the more times we find A, the more we suspect this strategy is dominant, and suggests that partial spoofing (B) is more likely than no spoofing (C). Therefore, when we find not-A after a series of As, it is more likely to be B than if we found not-A on our first try. I agree with the logic but it seems like our expectation of the B:C ratio will increase smoothly over time, if the holdout sensors are different to non-holdout ones, costly of spoof, and any leakage is minimized (maximizing initial expectations of C:B ratio) then finding not-A seems to be meaningful evidence in favor of C for a while. Not to say that this solves ELK, but it seems like it should remain (ever weaker) evidence in favor of honesty for multiple iterations, though I can't say I know how steep the fall-off should be. This could also be extended by having multiple levels of holdout data, the next level being only evaluated once we have sufficient confidence that it is honest (accounting for the declining level of evidence given by previous levels, with the assumption that there are other means of testing).

That is exactly correct, yes.

An intriguing point.

My inclination is to guess that there is a broad basin of attraction if we're appropriately careful in some sense (and the same seems true for corrigibility). 

In other words, the attractor basin is very thin along some dimensions, but very thick along some other dimensions.

Here's a story about what "being appropriately careful" might mean. It could mean building a system that's trying to figure out values in roughly the way that humans try to figure out values (IE, solving meta-philosophy). This could be self-correcting because it ... (read more)

Pithy one-sentence summary: to the extent that I value corrigibility, a system sufficiently aligned with my values should be corrigible.

My inclination is to guess that there is a broad basin of attraction if we’re appropriately careful in some sense (and the same seems true for corrigibility).

In other words, the attractor basin is very thin along some dimensions, but very thick along some other dimensions.

What do you think are the chances are of humanity being collectively careful enough, given that (in addition from the bad metapreferences I cited in the OP) it's devoting approximately 0.0000001% of its resources (3 FTEs, to give a generous overestimate) to studying either metaphilosop... (read more)

the attractor basin is very thin along some dimensions, but very thick along some other dimensions

There was a bunch of discussion along those lines in the comment thread on this post of mine a couple years ago, including a claim that Paul agrees with this particular assertion.

(I don't follow it all, for instance I don't recall why it's important that the former view assumes that utility is computable.)

Partly because the "reductive utility" view is made a bit more extreme than it absolutely had to be. Partly because I think it's extremely natural, in the "LessWrong circa 2014 view", to say sentences like "I don't even know what it would mean for humans to have uncomputable utility functions -- unless you think the brain is uncomputable". (I think there is, or at least was, a big overlap between the LW crowd and the set of people... (read more)

I think we could get a GPT-like model to do this if we inserted other random sequences, in the same way, in the training data; it should learn a pattern like "non-word-like sequences that repeat at least twice tend to repeat a few more times" or something like that.

GPT-3 itself may or may not get the idea, since it does have some significant breadth of getting-the-idea-of-local-patterns-its-never-seen-before.

So I don't currently see what your experiment has to do with the planning-ahead question.

I would say that the GPT training process has no "inherent" p... (read more)

I think maybe our disagreement is about how good/useful of an overarching model ACT-R is? It's definitely not like in physics, where some overarching theories are widely accepted (e.g. the standard model) even by people working on much more narrow topics -- and many of the ones that aren't (e.g. string theory) are still widely known about and commonly taught. The situation in cog sci (in my view, and I think in many people's views?) is much more that we don't have an overarching model of the mind in anywhere close to the level of detail/mechanistic specifi

... (read more)

I think my post (at least the title!) is essentially wrong if there are other overarching theories of cognition out there which have similar track records of matching data. Are there?

By "overarching theory" I mean a theory which is roughly as comprehensive as ACT-R in terms of breadth of brain regions and breadth of cognitive phenomena.

As someone who has also done grad school in cog-sci research (but in a computer science department, not a psychology department, so my knowledge is more AI focused), my impression is that most psychology research isn't about... (read more)

Thanks for the thoughtful response, that perspective makes sense. I take your point that ACT-R is unique in the ways you're describing, and that most cognitive scientists are not working on overarching models of the mind like that. I think maybe our disagreement is about how good/useful of an overarching model ACT-R is? It's definitely not like in physics, where some overarching theories are widely accepted (e.g. the standard model) even by people working on much more narrow topics -- and many of the ones that aren't (e.g. string theory) are still widely k... (read more)

This lines up fairly well with how I've seen psychology people geek out over ACT-R. That is: I had a psychology professor who was enamored with the ability to line up programming stuff with neuroanatomy. (She didn't use it in class or anything, she just talked about it like it was the most mind blowing stuff she ever saw as a research psychologist, since normally you just get these isolated little theories about specific things.)

And, yeah, important to view it as a programming language which can model a bunch of stuff, but requires fairly extensive user in... (read more)

I think that's not quite fair. ACT-R has a lot to say about what kinds of processing are happening, as well. Although, for example, it does not have a theory of vision (to my limited understanding anyway), or of how the full motor control stack works, etc. So in that sense I think you are right.

What it does have more to say about is how the working memory associated with each modality works: how you process information in the various working memories, including various important cognitive mechanisms that you might not otherwise think about. In this sense, it's not just about interconnection like you said.

0Jon Garcia10mo
So essentially, which types of information get routed for processing to which areas during the performance of some behavioral or cognitive algorithm, and what sort of processing each module performs?

We also know how to implement it today. 

I would argue that inner alignment problems mean we do not know how to do this today. We know how to limit the planning horizon for parts of a system which are doing explicit planning, but this doesn't bar other parts of the system from doing planning. For example, GPT-3 has a time horizon of effectively one token (it is only trying to predict one token at a time). However, it probably learns to internally plan ahead anyway, just because thinking about the rest of the current sentence (at least) is useful for th... (read more)

4davidad10mo
I’m curious to dig into your example. * Here’s an experiment that I could imagine uncovering such internal planning: * make sure the corpus has no instances of a token “jrzxd”, then * insert long sequences of “jrzxd jrzxd jrzxd … jrzxd” at random locations in the middle of sentences (sort of like introns), * then observe whether the trained model predicts “jrzxd” with greater likelihood than its base rate (which we’d presume is because it’s planning to take some loss now in exchange for confidently predicting more “jrzxd”s to follow). * I think this sort of behavior could be coaxed out of an actor-critic model (with hyperparameter tuning, etc.), but not GPT-3. GPT-3 doesn’t have any pressure towards a Bellman-equation-satisfying model, where future reward influences current output probabilities. * I’m curious if you agree or disagree and what you think I’m missing.

Imagine a spectrum of time horizons (and/or discounting rates), from very long to very short.

Now, if the agent is aligned, things are best with an infinite time horizon (or, really, the convergently-endorsed human discounting function; or if that's not a well-defined thing, whatever theoretical object replaces it in a better alignment theory). As you reduce the time horizon, things get worse and worse: the AGI willingly destroys lots of resources for short-term prosperity.

At some point, this trend starts to turn itself around: the AGI becomes so shortsight... (read more)

Recently I have been thinking that we should in fact use "really basic" definitions, EG "knowledge is just mutual information", and also other things with a general theme of "don't make agency so complicated".  The hope is to eventually be able to build up to complicated types of knowledge (such as the definition you seek here), but starting with really basic forms. Let me see if I can explain.

First, an ontology is just an agents way of organizing information about the world. These can take lots of forms and I'm not going to constrain it to any partic... (read more)

1Adam Shimi1y
So, I'm trying to interpret your proposal from an epistemic strategy perspective — asking how are you trying to produce knowledge. It sounds to me like you're proposing to start with very general formalization with simple mathematical objects (like objectivity being a sort of function, and participating in a goal increasing the measure on the states satisfying the predicate). Then, when you reach situations where the definitions are not constraining enough, like what Alex describes, you add further constraints on these objects? I have trouble understanding how different it is from the "standard way" Alex is using of proposing a simple definition, finding where it breaks, and then trying to refine it and break it again. Rince and repeat. Could you help me with what you feel are the main differences?
2Alex Flint1y
Yep I'm with you here Yeah I very much agree with justifying the use of 3rd person perspectives on practical grounds. Well if we are choosing to work with third-person perspectives then maybe we don't need first person perspectives at all. We can describe gravity and entropy without any first person perspectives at all, for example. I'm not against first person perspectives, but if we're working with third person perspectives then we might start by sticking to third person perspectives exclusively. Yeah right. A screw that fits into a hole does have mutual information with the hole. I like the idea that knowledge is about the capacity to harmonize within a particular environment because it might avoid the need to define goal-directedness. The only problem is that now we have to say what a goal predicate is. Do you have a sense of how to do that? I have also come to the conclusion that knowledge has a lot to do with being useful in service of a goal, and that then requires some way to talk about goals and usefulness. I very much resonate with keeping it as simple as possible, especially when doing this kind of conceptual engineering, which can become so lost. I have been grounding my thinking in wanting to know whether or not a certain entity in the world has an understanding of a certain phenomenon, in order to use that to overcome the deceptive misalignment problem. Do you also have go-to practical problems against which to test these kinds of definitions?

Suppose instead the crossing counterfactual results in a utility greater than -10 utility. This seems very strange. By assumption, it's provable using the AI's proof system that . And the AI's counterfactual environment is supposed to line up with reality.

Right. This is precisely the sacrifice I'm making in order to solve Troll Bridge. Something like this seems to be necessary for any solution, because we already know that if your expectations of consequences entirely respect entailment, you'll fall prey to the Troll Bridge! In fact, y... (read more)

I'll talk about some ways I thought of potentially formalizing, "stop thinking if it's bad".

If your point is that there are a lot of things to try, I readily accept this point, and do not mean to argue with it. I only intended to point out that, for your proposal to work, you would have to solve another hard problem.

One simple way to try to do so is to have an agent using regular evidential decision theory but have a special, "stop thinking about this thing" action that it can take. Every so often, the agent considers taking this action using regular evide

... (read more)

You say that a "bad reason" is one such that the agents the procedure would think is bad.

To elaborate a little, one way we could think about this would be that "in a broad variety of situations" the agent would think this property sounded pretty bad.

For example, the hypothetical "PA proves " would be evaluated as pretty bad by a proof-based agent, in many situations; it would not expect its future self to make decisions well, so, it would often have pretty poor performance bounds for its future self (eg the lowest utility available in the given scena... (read more)

0Chantiel1y
Oh, I'm sorry; you're right. I messed up on step two of my proposed proof that your technique would be vulnerable to the same problem. However, it still seems to me that agents using your technique would also be concerning likely to fail to cross, or otherwise suffer from other problems. Like last time, suppose ⊢(A=′Cross′⟹U=−10) and that A=′Cross′. So if the agent decides to cross, it's either because of the chicken rule, because not crossing counterfactually results in utility ≤ -10, or because crossing counterfactually results in utility greater than -10. If the agent crosses because of the chicken rule, then this is a bad reason, so the bridge will blow up. I had already assumed that not crossing counterfactually results in utility greater than -10, so it can't be the middle case. Suppose instead the crossing counterfactual results in a utility greater than -10 utility. This seems very strange. By assumption, it's provable using the AI's proof system that (A=′Cross⟹U=−10). And the AI's counterfactual environment is supposed to line up with reality. So, in other words, the AI has decided to cross and has already proven that crossing entails it will get -10 utility. And if the counterfactual environment assigns greater than -10 utility, then that counterfactual environment provably, within the agent's proof system, doesn't line up with reality. So how do you get an AI to both believe it will cross, believe crossing entails -10 utility, and still counterfactually thinks that crossing will result in greater than -10 utility? In this situation, the AI can prove, within its own proof system, that the counterfactual environment of getting > -10 utility is wrong. So I guess we need an agent that allows itself to use a certain counterfactual environment even though the AI already proved that it's wrong. I'm concerned about the functionality of such an agent. If it already ignores clear evidence that it's counterfactual environment is wrong in reality, then that wou

Ok. This threw me for a loop briefly. It seems like I hadn't considered your proposed definition of "bad reasoning" (ie "it's bad if the agent crosses despite it being provably bad to do so") -- or had forgotten about that case.

I'm not sure I endorse the idea of defining "bad" first and then considering the space of agents who pass/fail according to that notion of "bad"; how this is supposed to work is, rather, that we critique a particular decision theory by proposing a notion of "bad" tailored to that particular decision theory. For example, if a specifi... (read more)

0Chantiel1y
I'm concerned that may not realize that your own current take on counterfactuals respects logical to some extent, and that, if I'm reasoning correctly, could result in agents using it to fail the troll bridge problem. You said in "My current take on counterfactuals", that counterfactual should line up with reality. That is, the action the agent actually takes should in the utility it was said to have in its counterfactual environment. You say that a "bad reason" is one such that the agents the procedure would think is bad. The counterfactuals in your approach are supposed to line up with reality, so if an AI's counterfactuals don't line up in reality, then this seems like this is a "bad" reason according to the definition you gave. Now, if you let your agent think "I'll get < -10 utility if I don't cross", then it could potentially cross and not get blown up. But this seems like a very unintuitive and seemingly ridiculous counterfactual environment. Because of this, I'm pretty worried it could result in an AI with such counterfactual environments malfunctioning somehow. So I'll assume the AI doesn't have such a counterfactual environment. Suppose acting using a counterfactual environment that doesn't line up with reality counts as a "bad" reason for agents using your counterfactuals. Also suppose that in the counterfactual environment in which the agent doesn't cross, the agent counterfactually gets more than -10 utility. Then: 1. Suppose ⊢A=′Cross′⟹U=−10 2. Suppose A=′Cross′. Then if the agent crosses it must be because either it used the chicken rule or because its counterfactual environment doesn't line up with reality in this case. Either way, this is a bad reason for crossing, so the bridge gets blown up. Thus, the AI gets -10 utility. 3. Thus, ⊢(⊢A=′Cross′⟹U=−10)⟹U=−10 4. Thus, by Lob's theorem, ⊢A=′Cross′⟹U=−10 Thus, either the agent doesn't cross the bridge or it does and the bridge explodes. You might just decide to get around this by s

Seems fair. I'm similarly conflicted. In truth, both the generalization-focused path and the objective-focused path look a bit doomed to me.

Great, I feel pretty resolved about this conversation now.

I would further add that looking for difficulties created by the simplification seems very intellectually productive. (Solving "embedded agency problems" seems to genuinely allow you to do new things, rather than just soothing philosophical worries.) But yeah, I would agree that if we're defining mesa-objective anyway, we're already in the business of assuming some agent/environment boundary.

1Edouard Harris1y
Yep, strongly agree. And a good first step to doing this is to actually build as robust a simplification as you can, and then see where it breaks. (Working on it.)

(see the unidentifiability in IRL paper)

Ah, I wasn't aware of this!

Btw, if you're aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can't carve up a world, or recover a consistent utility function through this sort of process — please let me know. I'm directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.

I'm not sure what would constitute a clearly-worked counterexample. To me, a high reliance on an agent/worl... (read more)

1Edouard Harris1y
Oh for sure. I wouldn't recommend having a Cartesian boundary assumption as the fulcrum of your alignment strategy, for example. But what could be interesting would be to look at an isolated dynamical system, draw one boundary, investigate possible objective functions in the context of that boundary; then erase that first boundary, draw a second boundary, investigate that; etc. And then see whether any patterns emerge that might fit an intuitive notion of agency. But the only fundamentally real object here is always going to be the whole system, absolutely. As I understand, something like AIXI forces you to draw one particular boundary because of the way the setting is constructed (infinite on one side, finite on the other). So I'd agree that sort of thing is more fragile. The multiagent setting is interesting though, because it gets you into the game of carving up your universe into more than 2 pieces. Again it would be neat to investigate a setting like this with different choices of boundaries and see if some choices have more interesting properties than others.

Right, exactly. (I should probably have just referred to that, but I was trying to avoid reference-dumping.)

I pretty strongly endorse the new diagram with the pseudo-equivalences, with one caveat (much the same comment as on your last post)... I think it's a mistake to think of only mesa-optimizers as having "intent" or being "goal-oriented" unless we start to be more inclusive about what we mean by "mesa-optimizer" and "mesa-objective." I don't think those terms as defined in RFLO actually capture humans, but I definitely want to say that we're "goal-oriented" and have "intent."

But the graph structure makes perfect sense, I just am doing the mental substitution

... (read more)
3Jack Koch1y
This sounds reasonable and similar to the kinds of ideas for understanding agents' goals as cognitively implemented that I've been exploring recently. The funny thing is I am actually very unsatisfied with a purely behavioral notion of a model's objective, since a deceptive model would obviously externally appear to be a non-deceptive model in training. I just don't think there will be one part of the network we can point to and clearly interpret as being some objective function that the rest of the system's activity is optimizing. Even though I am partial to the generalization focused approach (in part because it kind of widens the goal posts with the "acceptability" vs. "give the model exactly the correct goal" thing), I still would like to have a more cognitive understanding of a system's "goals" because that seems like one of the best ways to make good predictions about how the system will generalize under distributional shift. I'm not against assuming some kind of explicit representation of goal content within a system (for sufficiently powerful systems); I'm just against assuming that that content will look like a mesa-objective as originally defined.
Load More