All of Vladimir_Nesov's Comments + Replies

Optimization at a Distance

Embedded agents have a spatial extent. If we use the analogy between physical spacetime and a domain of computation of environment, this offers interesting interpretations for some terms.

In a domain, counterfactuals might be seen as points/events/observations that are incomparable in specialization order, that is points that are not in each other's logical future. Via the spacetime analogy, this is the same as the points being space-like separated. This motivates calling collections of counterfactual events logical space, in the same sense as events compar... (read more)

Against Time in Agent Models

Which propositions are valid with respect to time? How can we only allow propositions which don't get invalidated (EG if we don't know yet which will and will not be), and also, why do we want that?

This was just defining/motivating terms (including "validity") for this context, the technical answer is to look at the definition of specialization preorder, when it's being suggestively called "logical time". If an open is a "proposition", and a point being contained in an open is "proposition is true at that point", and a point stronger in specialization o... (read more)

Against Time in Agent Models

If you mark something like causally inescapable subsets of spacetime (not sure how this should be called), which are something like all unions of future lightcones, as open sets, then specialization preorder on spacetime points will agree with time. This topology on spacetime is non-Frechet (has nontrivial specialization preorder), while the relative topologies it gives on space-like subspaces (loci of states of the world "at a given time" in a loose sense) are Hausdorff, the standard way of giving a topology for such spaces. This seems like the most strai... (read more)

Against Time in Agent Models

I like specialization preorder as a setting for formulating these concepts. In a topological space, point y is stronger (more specialized) than point x iff all opens containing x also contain y. If opens are thought of as propositions, and specialization order as a kind of ("logical") time, with stronger points being in the future of weaker points, then this says that propositions must be valid with respect to time (that is, we want to only allow propositions that don't get invalidated). This setting motivates thinking of points not as objects of study, bu... (read more)

4Abram Demski2d
Up to here made sense. After here I was lost. Which propositions are valid with respect to time? How can we only allow propositions which don't get invalidated (EG if we don't know yet which will and will not be), and also, why do we want that? You're saying a lot about what the "objects of study" are and aren't, but not very concretely, and I'm not getting the intuition for why this is important. I'm used to the idea that the points aren't really the objects of study in topology; the opens are the more central structure. But the important question for a proposed modeling language is how well it models what we're after. It seems like you are trying to do something similar to what cartesian frames and finite factored sets are doing, when they reconstruct time-like relationships from other (purportedly more basic) terms. Would you care to compare the reconstructions of time you're gesturing at to those provided by cartesian frames and/or finite factored sets?
Morality is Scary

My point is that the alignment (values) part of AI alignment is least urgent/relevant to the current AI risk crisis. It's all about corrigibility and anti-goodharting. Corrigibility is hope for eventual alignment, and anti-goodharting makes inadequacy of current alignment and imperfect robustness of corrigibility less of a problem. I gave the relevant example of relatively well-understood values, preference for lower x-risks. Other values are mostly relevant in how their understanding determines the boundary of anti-goodharting, what counts as not too weir... (read more)

3Wei Dai4mo
This seems interesting and novel to me, but (of course) I'm still skeptical. Preference for lower x-risk doesn't seem "well-understood" to me, if we include in "x-risk" things like value drift/corruption, premature value lock-in, and other highly consequential AI-enabled decisions (potential existential mistakes) that depend on hard philosophical questions. I gave some specific examples in this recent comment [https://www.lesswrong.com/posts/3e6pmovj6EJ729M2i/general-alignment-plus-human-values-or-alignment-via-human?commentId=bR7RxJc2uyzjLskcZ] . What do you think about the problems on that list? (Do you agree that they are serious problems, and if so how do you envision them being solved or prevented in your scenario?)
Vanessa Kosoy's Shortform

Goodharting is about what happens in situations where "good" is undefined or uncertain or contentious, but still gets used for optimization. There are situations where it's better-defined, and situations where it's ill-defined, and an anti-goodharting agent strives to optimize only within scope of where it's better-defined. I took "lovecraftian" as a proxy for situations where it's ill-defined, and base distribution of quantilization that's intended to oppose goodharting acts as a quantitative description of where it's taken as better-defined, so for this ... (read more)

2Vanessa Kosoy5mo
The proxy utility in debate is perfectly well-defined: it is the ruling of the human judge. For the base distribution I also made some concrete proposals (which certainly might be improvable but are not obviously bad). As to corrigibility, I think it's an ill-posed concept [https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=5Rxgkzqr8XsBwcEQB#romyHyuhq6nPH5uJb] . I'm not sure how you imagine corrigibility in this case: AQD is a series of discrete "transactions" (debates), and nothing prevents you from modifying the AI between one and another. Even inside a debate, there is no incentive in the outer loop to resist modifications, whereas daemons would be impeded by quantilization. The "out of scope" case is also dodged by quantilization, if I understand what you mean by "out of scope". Why is it strictly more general? I don't see it. It seems false, since for extreme value of the quantilization parameter we get optimization which is deterministic and hence cannot be equivalent to quantilization with different proxy and distribution. The reason to pick the quantilization parameter is because it's hard to determine, as opposed to the proxy and base distribution[1] [#fn-4ftZGjn8jZiSGQpqd-1] for which there are concrete proposals with more-or-less clear motivation. I don't understand which "main issues" you think this doesn't address. Can you describe a concrete attack vector? -------------------------------------------------------------------------------- 1. If the base distribution is a bounded simplicity prior then it will have some parameters, and this is truly a weakness of the protocol. Still, I suspect that safety is less sensitive to these parameters and it is more tractable to determine them by connecting our ultimate theories of AI with brain science (i.e. looking for parameters which would mimic the computational bounds of human cognition). ↩︎ [#fnref-4ftZGjn8jZiSGQpqd-1]
Morality is Scary

I'm leaning towards the more ambitious version of the project of AI alignment being about corrigible anti-goodharting, with the AI optimizing towards good trajectories within scope of relatively well-understood values, preventing overoptimized weird/controversial situations, even at the cost of astronomical waste. Absence of x-risks, including AI risks, is generally good. Within this environment, the civilization might be able to eventually work out more about values, expanding the scope of their definition and thus allowing stronger optimization. Here corrigibility is in part about continually picking up the values and their implied scope from the predictions of how they would've been worked out some time in the future.

3Wei Dai5mo
Please say more about this? What are some examples of "relatively well-understood values", and what kind of AI do you have in mind that can potentially safely optimize "towards good trajectories within scope" of these values?
Vanessa Kosoy's Shortform

I'm not sure this attacks goodharting directly enough. Optimizing a system for proxy utility moves its state out-of-distribution where proxy utility generalizes training utility incorrectly. This probably holds for debate optimized towards intended objectives as much as for more concrete framings with state and utility.

Dithering across the border of goodharting (of scope of a proxy utility) with quantilization is actionable, but isn't about defining the border or formulating legible strategies for what to do about optimization when approaching the border. ... (read more)

2Vanessa Kosoy6mo
I don't understand what you're saying here. For debate, goodharting means producing an answer which can be defended successfully in front of the judge, even in the face of an opponent pointing out all the flaws, but which is nevertheless bad. My assumption here is: it's harder to produce such an answer than producing a genuinely good (and defensible) answer. If this assumption holds, then there is a range of quantilization parameters which yields good answers. For the question of "what is a good plan to solve AI risk", the assumption seems solid enough since we're not worried about coming across such deceptive plans on our own, and it's hard to imagine humans producing one even on purpose. To the extent our search for plans relies mostly on our ability to evaluate arguments and find counterarguments, it seems like the difference between the former and the latter is not great anyway. This argument is especially strong if we use human debaters as baseline distribution, although in this case we are vulnerable to same competitiveness problem as amplified-imitation, namely that reliably predicting rich outputs might be infeasible. For the question of "should we continue changing the quantilization parameter", the assumption still holds because the debater arguing to stop at the given point can win by presenting a plan to solve AI risk which is superior to continuing to change the parameter.
P₂B: Plan to P₂B Better

more planners

This seems tenuous compared to "more planning substrate". Redundancy and effectiveness specifically through setting up a greater number of individual planners, even if coordinated, is likely an inferior plan. There are probably better uses of hardware that don't have this particular shape.

My take on Vanessa Kosoy's take on AGI safety

I'd say alignment should be about values, so only your "even better alignment" qualifies. The non-agentic AI safety concepts like corrigibility, that might pave the way to aligned systems if controllers manage to keep their values throughout the process, are not themselves examples of alignment.

The Simulation Hypothesis Undercuts the SIA/Great Filter Doomsday Argument

Sleeping Beauty and other anthropic problems considered in terms of bets illustrate how most ways of assigning anthropic probabilities are not about beliefs of fact in a general sense, their use is more of appeal to consequences. At the very least the betting setup should remain a salient companion to these probabilities whenever they are produced. Anthropic probabilities make no more sense on their own, without the utilities, than whatever arbitrary numbers you get after applying Bolker-Jeffrey rotation. The main difference is in how the utilities of anth... (read more)

Oracle predictions don't apply to non-existent worlds

The prediction is why you grab your coat, it's both meaningful and useful to you, a simple counterexample to the sentiment that since correctness scope of predictions is unclear, they are no good. The prediction is not about the coat, but that dependence wasn't mentioned in the arguments against usefulness of predictions above.

Oracle predictions don't apply to non-existent worlds

IMO, either the Oracle is wrong, or the choice is illusory

This is similar to determinism vs. free will, and suggests the following example. The Oracle proclaims: "The world will follow the laws of physics!". But in the counterfactual where an agent takes a decision that won't actually be taken, the fact of taking that counterfactual decision contradicts the agent's cognition following the laws of physics. Yet we want to think about the world within the counterfactual as if the laws of physics are followed.

Oracle predictions don't apply to non-existent worlds

No, I suspect it's a correct ingredient of counterfactuals, one I didn't see discussed before, not an error restricted to a particular decision theory. There is no contradiction in considering each of the counterfactuals as having a given possible decision made by the agent and satisfying the Oracle's prediction, as the agent doesn't know that it won't make this exact decision. And if it does make this exact decision, the prediction is going to be correct, just like the possible decision indexing the counterfactual is going to be the decision actually take... (read more)

1Dagon8mo
Thanks for patience with this. I am still missing some fundamental assumption or framing about why this is non-obvious (IMO, either the Oracle is wrong, or the choice is illusory). I'll continue to examine the discussions and examples in hopes that it will click.
Oracle predictions don't apply to non-existent worlds

Specifically, when you say "It's only guaranteed to be correct on the actual decision", why does the agent not know what "correct" means for the decision?

The agent knows what "correct" means, correctness of a claim is defined for the possible worlds that the agent is considering while making its decision (which by local tradition we confusingly collectively call "counterfactuals", even though one of them is generated by the actual decision and isn't contrary to any fact).

In the post Chris_Leong draws attention to the point that since the Oracle knows wh... (read more)

1Chris_Leong8mo
That's an interesting point. I suppose it might be viable to acknowledge that the problem taken literally doesn't require the prediction to be correct outside of the factual, but nonetheless claim that we should resolve the vagueness inherent in the question about what exactly the counterfactual is by constructing it to meet this condition. I wouldn't necessarily be strongly against this - my issue is confusion about what an Oracle's prediction necessarily entails. Regarding, your notion about things being magically stipulated, I suppose there's some possible resemblance there with the ideas I proposed before in Counterfactuals As A Matter of Social Convention [https://www.lesswrong.com/posts/9rtWTHsPAf2mLKizi/counterfactuals-as-a-matter-of-social-convention] , although The Nature of Counterfactuals [https://www.lesswrong.com/posts/T4Mef9ZkL4WftQBqw/the-nature-of-counterfactuals] describes where my views have shifted to since then.
1Dagon8mo
Hmm. So does this only apply to CDT agents, who foolishly believe that their decision is not subject to predictions?
Oracle predictions don't apply to non-existent worlds

No, this is an important point: the agent normally doesn't know the correctness scope of the Oracle's prediction. It's only guaranteed to be correct on the actual decision, and can be incorrect in all other counterfactuals. So if the agent knows the boundaries of the correctness scope, they may play chicken and render the Oracle wrong by enacting the counterfactual where the prediction is false. And if the agent doesn't know the boundaries of the prediction's correctness, how are they to make use of it in evaluating counterfactuals?

It seems that the way to... (read more)

1Dagon8mo
Is there an ELI5 doc about what's "normal" for Oracles, and why they're constrained in that way? The examples I see confuse me in that they are exploring what seem like edge cases, and I'm missing the underlying model that makes these cases critical. Specifically, when you say "It's only guaranteed to be correct on the actual decision", why does the agent not know what "correct" means for the decision?
Oracle predictions don't apply to non-existent worlds

When the Oracle says "The taxi will arrive in one minute!", you may as well grab your coat.

1Chris_Leong8mo
Isn't that prediction independent of your decision to grab your coat or not?
1Dagon8mo
Sure, that's a sane Oracle. The Weird Oracle used in so many thought experiments doesn't say ""The taxi will arrive in one minute!", it says "You will grab your coat in time for the taxi.".
Oracle predictions don't apply to non-existent worlds

It's not actually more general, it's instead about a somewhat different point. The more general statement could use some sort of a notion of relative actuality, to point at the possibly counterfactual world determined by the decision made in the world where the prediction was delivered, which is distinct from the even more counterfactual worlds where the prediction was delivered but the decision was different from what it would relative-actually be had the prediction been delivered, and from the worlds where the prediction was not delivered at all.

If the p... (read more)

Oracle predictions don't apply to non-existent worlds

Consider the variant where the Oracle demands a fee of 100 utilons after delivering the prediction, which you can't refuse. Then the winning strategy is going to be about ensuring that the current situation is counterfactual, so that in actuality you won't have to pay the Oracle's fee, because the Oracle wouldn't be able to deliver a correct prediction.

The Oracle's prediction only has to apply to the world that is. It doesn't have to apply to worlds that are not.

The Oracle's prediction only has to apply to the world where the prediction is delivered. I... (read more)

1Chris_Leong8mo
"The Oracle's prediction only has to apply to the world where the prediction is delivered" - My point was that predictions that are delivered in the factual don't apply to counterfactuals, but the way you've framed it is better as it handles a more general set of cases. It seems like we're on the same page.
Can you control the past?

One way of noticing the Son-of-CDT issue dxu mentioned is thinking of CDT as not just being unable to control the events outside the future lightcone, but as not caring about the events outside the future lightcone. So even if it self-modifies, it's not going to accept tradeoffs between the future and not-the-future of the self-modification event, as that would involve changing its preference (and somehow reinventing preference for the events it didn't care about just before the self-modification event).

With time, CDT continually becomes numb to events out... (read more)

Can you control the past?

Agent's policy determines how its instances act, but in general it also determines which instances exist, and that motivates thinking of the agent as the algorithm channeled by instances rather than as one of the instances controlling the others, or as all instances controlling each other. For example, in Newcomb's problem, you might be sitting inside the box with the $1M, and if you two-box, you have never existed. Grandpa decides to only have children if his grandchildren one-box. Or some copies in distant rooms numbered (on the outside) 1 to 5 writing i... (read more)

paulfchristiano's Shortform

The point is that in order to be useful, a prediction/reasoning process should contain mesa-optimizers that perform decision making similar in a value-laden way to what the original humans would do. The results of the predictions should be determined by decisions of the people being predicted (or of people sufficiently similar to them), in the free-will-requires-determinism/you-are-part-of-physics sense. The actual cognitive labor of decision making needs to in some way be an aspect of the process of prediction/reasoning, or it's not going to be good enoug... (read more)

paulfchristiano's Shortform

I see getting safe and useful reasoning about exact imitations as a weird special case or maybe a reformulation of X-and-only-X efficient imitation. Anchoring to exact imitations in particular makes accurate prediction more difficult than it needs to be, as it's not the thing we care about, there are many irrelevant details that influence outcomes that accurate predictions would need to take into account. So a good "prediction" is going to be value-laden, with concrete facts about actual outcomes of setups built out of exact imitations being unimportant, w... (read more)

3Paul Christiano1y
It seems to me like "Reason about a perfect emulation of a human" is an extremely similar task to "reason about a human," to me it does not feel closely related to X-and-only-X efficient imitation. For example, you can make calibrated predictions about what a human would do using vastly less computing power than a human (even using existing techniques), whereas perfect imitation likely requires vastly more computing power.
paulfchristiano's Shortform

The upside of humans in reality is that there is no need to figure out how to make efficient imitations that function correctly (as in X-and-only-X). To be useful, imitations should be efficient, which exact imitations are not. Yet for the role of building blocks of alignment machinery, imitations shouldn't have important systematic tendencies not found in the originals, and their absence is only clear for exact imitations (if not put in very unusual environments).

Suppose you already have an AI that interacts with the world, protects it from dangerous AIs,... (read more)

3Paul Christiano1y
I think the biggest difference is between actual and hypothetical processes of reflection. I agree that an "actual" process of reflection would likely ultimately involve most humans migrating to emulations for the speed and other advantages. (I am not sure that a hypothetical process necessarily needs efficient imitations, rather than AI reasoning about what actual humans---or hypothetical slow-but-faithful imitations---might do.)
Saving Time

My impression is that all this time business in decision making is more of an artifact of computing solutions to constraint problems (unlike in physics, where it's actually an important concept). There is a process of computation that works with things such as propositions about the world, which are sometimes events in the physical sense, and the process goes through these events in the world in some order, often against physical time. But it's more like construction of Kleene fixpoints or some more elaborate thing like tracing statements in a control flow... (read more)

Stable Pointers to Value: An Agent Embedded in Its Own Utility Function

The problem of figuring out preference without wireheading seems very similar to the problem of maintaining factual knowledge about the world without suffering from appeals to consequences. In both cases a specialized part of agent design (model of preference or model of a fact in the world) has a purpose (accurate modeling of its referent) whose pursuit might be at odds with consequentialist decision making of the agent as a whole. The desired outcome seems to involve maintaining integrity of the specialized part, resisting corruption of consequentialist ... (read more)

Why You Should Care About Goal-Directedness

Yeah, that was sloppy of the article. In context, the quote makes a bit of sense, and the qualifier "in every detail" does useful work (though I don't see how to make the argument clear just by defining what these words mean), but without context it's invalid.

1Adam Shimi2y
Sorry for my last comment, it was more a knee-jerk reaction than a rational conclusion. My issue here is that I'm still not sure of what would be a good replacement for the above quote, that still keeps intact the value of having compressed representations of systems following goals. Do you have an idea?
Why You Should Care About Goal-Directedness

Having an exact model of the world that contains the agent doesn't require any explicit self-references or references to the agent. For example, if there are two programs whose behavior is equivalent, A and A', and the agent correctly thinks of itself as A, then it can also know the world to be a program W(A') with some subexpressions A', but without subexpression A. To see the consequences of its actions in this world, it would be useful for the agent to figure out that A is equivalent to A', but it is not necessary that this is known to the agent from th... (read more)

1Adam Shimi2y
Thanks for additional explanations. That being said, I'm not an expert on Embedded Agency, and that's definitely not the point of this post, so just writing stuff that are explicitly said in the corresponding sequence is good enough for my purpose. Notably, the section [https://www.lesswrong.com/s/Rm6oQRJJmhGCcLvxh/p/i3BTagvt3HbPMx6PN#3__Embedded_world_models] on Embedded World Models from Embedded Agency [https://www.lesswrong.com/s/Rm6oQRJJmhGCcLvxh/p/i3BTagvt3HbPMx6PN] begins with: Maybe that's not correct/exact/the right perspective on the question. But once again, I'm literally giving a two sentence explanations of what the approach says, not the ground truth or a detailed investigation of the subject.
Why You Should Care About Goal-Directedness

The quote sounds like an argument for non-existence of quines or of the context in which things like the diagonalization lemma are formulated. I think it obviously sounds like this, so raising nonspecific concern in my comment above should've been enough to draw attention to this issue. It's also not a problem Agent Foundations explores, but it's presented as such. Given your background and effort put into the post this interpretation of the quote seems unlikely (which is why I didn't initially clarify, to give you the first move). So I'm confused. Everyth... (read more)

1Adam Shimi2y
I do appreciate you pointing out this issue, and giving me the benefit of the doubt. That being said, I prefer that comments clarify the issue raised, if only so that I'm more sure of my interpretation. The up and downvotes in this thread are I think representative of this preference (not that I downvoted your post -- I was glad for feedback). About the quote itself, rereading it and rereading Embedded Agency [https://www.alignmentforum.org/s/Rm6oQRJJmhGCcLvxh/p/i3BTagvt3HbPMx6PN], I think you're right about what I write not being an Agents Foundation problem (at least not one I know of). What I had in mind was more about non-realizability [https://www.alignmentforum.org/s/Rm6oQRJJmhGCcLvxh/p/i3BTagvt3HbPMx6PN#3_1__Realizability] and self-reference [https://www.alignmentforum.org/s/Rm6oQRJJmhGCcLvxh/p/i3BTagvt3HbPMx6PN#3_2__Self_reference] in the context of decision/game theory. I seem to have mixed the two with naive Gödelian self-reference in my head at the time of writing, which resulted in this quote. Do you think that this proposed change solves your issues? "This has many ramifications, including non-realizability [https://www.alignmentforum.org/s/Rm6oQRJJmhGCcLvxh/p/i3BTagvt3HbPMx6PN#3_1__Realizability] (the impossibility of the agent to contain an exact model of the world, because it is inside the world and thus smaller), self-referential issues [self-reference] in the context of game theory (because the model is part of the agent which is part of the world, other agents can access it and exploit it), and the need to find an agent/world boundary (as it's not given for free like in the dualistic perspective)."
Why You Should Care About Goal-Directedness

Trouble comes from self-reference: since the agent is part of the world, so is its model, and thus a perfect model would need to represent itself, and this representation would need to represent itself, ad infinitum. So the model cannot be exact.

???

5Adam Shimi2y
What's the issue?
Vanessa Kosoy's Shortform

I agree. But GPT-3 seems to me like a good estimate for how much compute it takes to run stream of consciousness imitation learning sideloads (assuming that learning is done in batches on datasets carefully prepared by non-learning sideloads, so the cost of learning is less important). And with that estimate we already have enough compute overhang to accelerate technological progress as soon as the first amplified babbler AGIs are developed, which, as I argued above, should happen shortly after babblers actually useful for automation of human jobs are deve... (read more)

2Vanessa Kosoy2y
Another thing that might happen is a data bottleneck. Maybe there will be a good enough dataset to produce a sideload that simulates an "average" person, and that will be enough to automate many jobs, but for a simulation of a competent AI researcher you would need a more specialized dataset that will take more time to produce (since there are a lot less competent AI researchers than people in general). Moreover, it might be that the sample complexity grows with the duration of coherent thought that you require. That's because, unless you're training directly on brain inputs/outputs, non-realizable (computationally complex) environment influences contaminate the data, and in order to converge you need to have enough data to average them out, which scales with the length of your "episodes". Indeed, all convergence results for Bayesian algorithms we have in the non-realizable setting require ergodicity, and therefore the time of convergence (= sample complexity) scales with mixing time, which in our case is determined by episode length. In such a case, we might discover that many tasks can be automated by sideloads with short coherence time, but AI research might require substantially longer coherence times. And, simulating progress requires by design going off-distribution along certain dimensions which might make things worse.
Vanessa Kosoy's Shortform

I was arguing that near human level babblers (including the imitation plateau you were talking about) should quickly lead to human level AGIs by amplification via stream of consciousness datasets, which doesn't pose new ML difficulties other than design of the dataset. Superintelligence follows from that by any of the same arguments as for uploads leading to AGI (much faster technological progress; if amplification/distillation of uploads is useful straight away, we get there faster, but it's not necessary). And amplified babblers should be stronger than v... (read more)

1Vanessa Kosoy2y
The imitation plateau can definitely be rather short. I also agree that computational overhang is the major factor here. However, a failure to capture some of the ingredients can be a cause of low computational overhead, whereas a success to capture all of the ingredients is a cause of high computational overhang, because the compute necessary to reach superintelligence might be very different in those two cases. Using sideloads to accelerate progress might still require years, whereas an "intrinsic" AGI might lead to the classical "foom" scenario. EDIT: Although, since training is typically much more computationally expensive than deployment, it is likely that the first human-level imitators will already be significantly sped-up compared to humans, implying that accelerating progress will be relatively easy. It might still take some time from the first prototype until such an accelerate-the-progress project, but probably not much longer than deploying lots of automation.
Vanessa Kosoy's Shortform

To me this seems to be essentially another limitation of the human Internet archive dataset: reasoning is presented in an opaque way (most slow/deliberative thoughts are not in the dataset), so it's necessary to do a lot of guesswork to figure out how it works. A better dataset both explains and summarizes the reasoning (not to mention gets rid of the incoherent nonsense, but even GPT-3 can do that to an extent by roleplaying Feynman).

Any algorithm can be represented by a habit of thought (Turing machine style if you must), and if those are in the dataset,... (read more)

1Vanessa Kosoy2y
I don't see any strong argument why this path will produce superintelligence. You can have a stream of thought that cannot be accelerated without investing a proportional amount of compute, while a completely different algorithm would produce a far superior "stream of thought". In particular, such an approach cannot differentiate between features of the stream of thought that are important (meaning that they advance towards the goal) and features of the stream of though that are unimportant (e.g. different ways to phrase the same idea). This forces you to solve a task that is potentially much more difficult than just achieving the goal.
Vanessa Kosoy's Shortform

This seems similar to gaining uploads prior to AGI, and opens up all those superorg upload-city amplification/distillation constructions which should get past human level shortly after. In other words, the limitations of the dataset can be solved by amplification as soon as the AIs are good enough to be used as building blocks for meaningful amplification, and something human-level-ish seems good enough for that. Maybe even GPT-n is good enough for that.

1Vanessa Kosoy2y
That is similar to gaining uploads (borrowing terminology from Egan, we can call them "sideloads"), but it's not obvious amplification/distillation will work. In the model based on realizability, the distillation step can fail because the system you're distilling is too computationally complex (hence, too unrealizable). You can deal with it by upscaling the compute of the learning algorithm, but that's not better than plain speedup.
Towards a mechanistic understanding of corrigibility

I agree that exotic decision algorithms or preference transformations are probably not going to be useful for alignment, but I think this kind of activity is currently more fruitful for theory building than directly trying to get decision theory right. It's just that the usual framing is suspect: instead of exploration of the decision theory landscape by considering clearly broken/insane-acting/useless but not yet well-understood constructions, these things are pitched (and chosen) for their perceived use in alignment.

1David Krueger2y
What do you mean "these things"? Also, to clarify, when you say "not going to be useful for alignment", do you mean something like "...for alignment of arbitrarily capable systems"? i.e. do you think they could be useful for aligning systems that aren't too much smarter than humans?
A Critique of Functional Decision Theory

By the way, selfish values seem related to the reward vs. utility distinction. An agent that pursues a reward that's about particular events in the world rather than a more holographic valuation seems more like a selfish agent in this sense than a maximizer of a utility function with a small-in-space support. If a reward-seeking agent looks for reward channel shaped patterns instead of the instance of a reward channel in front of it, it might tile the world with reward channels or search the world for more of them or something like that.

Formalising decision theory is hard

I was never convinced that "logical ASP" is a "fair" problem. I once joked with Scott that we can consider a "predictor" that is just the single line of code "return DEFECT" but in the comments it says "I am defecting only because I know you will defect."

I'm leaning this way as well, but I think it's an important clue to figuring out commitment races. ASP Predictor, DefectBot, and a more general agent will make different commitments, and these things are already algorithms specialized for certain situations. How is the chosen commitment related to what

... (read more)
Why we need a *theory* of human values

Yes, that's the almost fully general counterargument: punt all the problems to the wiser versions of ourselves.

It's not clear what the relevant difference is between then and now, so the argument that it's more important to solve a problem now is as suspect as the argument that the problem should be solved later.

How are we currently in a better position to influence the outcome? If we are, then the reason for being in a better position is a more important feature of the present situation than object-level solutions that we can produce.

1Stuart Armstrong3y
We have a much clearer understanding of the pressures we are under now, as to what pressures simulated versions of ourselves would be in the future. Also, we agree much more strongly with the values of our current selves than with the values of possible simulated future selves. Consequently, we should try and solve early the problems with value alignment, and punt technical problems to our future simulated selves. It's not particularly a question of influencing the outcome, but of reaching the right solution. It would be a tragedy if our future selves had great influence, but pernicious values.
Two Neglected Problems in Human-AI Safety

I worry that in the context of corrigibility it's misleading to talk about alignment, and especially about utility functions. If alignment characterizes goals, it presumes a goal-directed agent, but a corrigible AI is probably not goal-directed, in the sense that its decisions are not chosen according to their expected value for a persistent goal. So a corrigible AI won't be aligned (neither will it be misaligned). Conversely, an agent aligned in this sense can't be visibly corrigible, as its decisions are determined by its goals, not orders and wishes of

... (read more)
Three AI Safety Related Ideas

I thought the point of idealized humans was to avoid problems of value corruption or manipulation

Among other things, yes.

which makes them better than real ones

This framing loses the distinction I'm making. More useful when taken together with their environment, but not necessarily better in themselves. These are essentially real humans that behave better because of environments where they operate and lack of direct influence from the outside world, which in some settings could also apply to the environment where they were raised. But they share the

... (read more)
1Rohin Shah3y
Yeah, I agree with all of this. How would you rewrite my sentence/paragraph to be clearer, without making it too much longer?
Three AI Safety Related Ideas

If it's too hard to make AI systems in this way and we need to have them learn goals from humans, we could at least have them learn from idealized humans rather than real ones.

My interpretation of how the term is used here and elsewhere is that idealized humans are usually in themselves, and when we ignore costs, worse than real ones. For example, they could be based on predictions of human behavior that are not quite accurate, or they may only remain sane for an hour of continuous operation from some initial state. They are only better because they can

... (read more)
1Rohin Shah3y
Perhaps Wei Dai could clarify, but I thought the point of idealized humans was to avoid problems of value corruption or manipulation, which makes them better than real ones. I agree that idealized humans have the benefit of making things like infinite HCH possible, but that doesn't seem to be a main point of this post.
Why we need a *theory* of human values

More to the point, these failure modes are ones that we can talk about from outside

So can the idealized humans inside a definition of indirect normativity, which motivates them to develop some theory and then quarantine parts of the process to examine their behavior from outside the quarantined parts. If that is allowed, any failure mode that can be fixed by noticing a bug in a running system becomes anti-inductive: if you can anticipate it, it won't be present.

3Stuart Armstrong3y
Yes, that's the almost fully general counterargument: punt all the problems to the wiser versions of ourselves. But some of these problems are issues that I specifically came up with. I don't trust that idealised non-mes would necessarily have realised these problems even if put in that idealised situation. Or they might have come up with them too late, after they had already altered themselves. I also don't think that I'm particularly special, so other people can and will think up problems with the system that hadn't occurred to me or anyone else. This suggests that we'd need to include a huge amount of different idealised humans in the scheme. Which, in turn, increases the chance of the scheme failing due to social dynamics, unless we design it carefully ahead of time. So I think it is highly valuable to get a lot of people thinking about the potential flaws and improvements for the system before implementing it. That's why I think that "punting to the wiser versions of ourselves" is useful, but not a sufficient answer. The better we can solve the key questions ("what are these 'wiser' versions?", "how is the whole setup designed?", "what questions exactly is it trying to answer?"), the better the wiser ourselves will be at their tasks.
Intuitions about goal-directed behavior

Learning how to design goal-directed agents seems like an almost inevitable milestone on the path to figuring out how to safely elicit human preference in an actionable form. But the steps involved in eliciting and enacting human preference don't necessarily make use of a concept of preference or goal-directedness. An agent with a goal aligned with the world can't derive its security from the abstraction of goal-directedness, because the world determines that goal, and so the goal is vulnerable to things in the world, including human error. Only self-conta

... (read more)
Intuitions about goal-directed behavior

My guess is that agents that are not primarily goal-directed can be good at defending against goal-directed agents (especially with first mover advantage, preventing goal-directed agents from gaining power), and are potentially more tractable for alignment purposes, if humans coexist with AGIs during their development and operation (rather than only exist as computational processes inside the AGI's goal, a situation where a goal concept becomes necessary).

I think the assumption that useful agents must be goal-directed has misled a lot of discussion of AI r

... (read more)
1Wei Dai3y
I think I disagree with this at least to some extent. Humans are not generally safe agents, and in order for not-primarily-goal-directed AIs to not exacerbate humans' safety problems (for example by rapidly shifting their environments/inputs out of a range where they are known to be relatively safe), it seems that we have to solve many of the same metaethical/metaphilosophical problems that we'd need to solve to create a safe goal-directed agent. I guess in some sense the former has lower "AI risk" than the latter in that you can plausibly blame any bad outcomes on humans instead of AIs, but to me that's actually a downside because it means that AI creators can more easily deny their responsibility to help solve those problems.
Impact Measure Desiderata

I worry there might be leaks in logical time that let the agent choose an action that takes into account that an impactful action will be denied. For example, a sub-agent could be built so that it's a maximizer that's not constrained by an impact measure. The sub-agent then notices that to maximize its goal, it must constrain its impact, or else the main agent won't be allowed to create it. And so it will so constrain its impact and will be allowed to be created, as a low-impact and maximally useful action of the main agent. It's sort of a daemon, but with

... (read more)
1Alex Turner4y
That’s a really interesting point. I’d like to think about this more, but one preliminary intuition I have against this (and any general successor creation by AUP, really) being the best action is that making new agents aligned with your goals is instrumentally convergent. This could add a frictional cost so that the AUP agent would be better off just doing the job itself. Perhaps we could also stop this via an approval incentives, which might tip the scales enough?
Impact Measure Desiderata

It could as easily be "do this one slightly helpful thing", an addition on top of doing nothing. It doesn't seem like there is an essential distinction between such different framings of the same outcome that intent verification can capture.

1Alex Turner4y
Whether these granular actions exist is also an open question I listed. I don’t see why some version of IV won’t be able to get past this, however. There seems to be a simple class of things the agent does to get around an impact measure that it wouldn’t do if it were just trying to pursue a goal to the maximum extent. It might be true that the things the agent does to get around it are also slightly helpful for the goal, but probably not as helpful as the most helpful action.
Impact Measure Desiderata

I was talking about what I understand the purpose/design of intent verification to be, not specifically the formalizations you described. (I don't think it's particularly useful to work out the details without a general plan or expectation of important technical surprises.)

1Alex Turner4y
If you decompose the creation of such an agent, some of those actions are wasted effort in the eyes of a pure u_A maximizer ("dont help me too much"). So, the logic goes, they really aren’t related to u_A, but rather to skirting the impact measure, and should therefore be penalized.
Impact Measure Desiderata

It's Rice's theorem, though really more about conceptual ambiguity. We can talk about particular notions of agents or goals, but it's never fully general, unless we by construction ensure that unexpected things can't occur. And even then it's not what we would have wanted the notions of agents or goals to be, because it's not clear what that is.

Intent verification doesn't seem to capture things that smuggle in a tiny bit of helpfulness when these things are actually required to deliver that helpfulness, especially after o... (read more)

1Alex Turner4y
Are you talking about granular actions, or coarse actions? The acceptable way to do IV for the latter is an open question, although I point out preliminary approaches.
Impact Measure Desiderata
Unleashing this agent would change resource availability and increase or decrease the power of an arbitrary maximizer from that vantage point.

It'll replenish the resources it takes, help any maximizer it impedes so as to exactly cancel out the impediment etc.

Suppose that an arbitrary maximizer could not co-opt this new agent - its ability to achieve goals is decreased compared to if it hadn’t activated the agent.

To the extent its existence could pose a problem for another agent (according to the measure, which can't really talk about goals of ... (read more)

1Alex Turner4y
Then it specifically isn’t allowed by intent verification. Are these your priors, or posteriors after having read my post? Because AUP is about the goals of arbitrary agents.
Impact Measure Desiderata

The sub-agent in this scenario won't be helping the main agent with achieving any goals. It only hides. Its nature precludes any usefulness. It's no more useful than its absence. But for the main agent, it's as easy to set up as its absence. And there might be reasons for this balance to be broken in favor of creating the sub-agent.

1Alex Turner4y
That isn’t how AUP works. Unleashing this agent would change resource availability and increase or decrease the power of an arbitrary maximizer from that vantage point. Suppose that an arbitrary maximizer could not co-opt this new agent - its ability to achieve goals is decreased compared to if it hadn’t activated the agent. On the other hand, if it can co-opt it, its ability is increased. This is not to mention the large amount of resources that be used by such an expansive sub agent, nor the fact that intent verification seemingly would not allow such a sub agent to be built. I discuss this kind of thing in several places in the comments, if you’re interested.
Load More