johnswentworth

Comments

Egan's Theorem?

I'd expect Turing machines to be a bad way to model this. They're inherently blackboxy; the only "structure" they make easy to work with is function composition. The sort of structures relevant here don't seem like they'd care much about function boundaries. (This is why I use models like these as my default model of computation these days.)

Anyway, yeah, I'm still not sure what the "relationship" should be, and it's hard to formulate in a way that seems to capture the core idea.

Egan's Theorem?

It seems like there shouldn't be a guaranteed relationship that's much simpler than reconstructing the data and recomputing the inferred point particles.

Yeah, I'm claiming exactly the opposite of this. When the old theory itself has some simple structure (e.g. classical mechanics), there should be a guaranteed relationship that's much simpler than reconstructing the data and recomputing the inferred point particles.

One possible formulation: if I find that a terabyte of data compresses down to a gigabyte, and then I find a different model which compresses it down to 500MB, there should be a relationship between the two models which can be expressed without expanding out the whole terabyte. (Or, if there isn't such a relationship, that means the two models are capturing different patterns from the data, and there should exist another model which compresses the data more than either by capturing the patterns found by both models.)

Egan's Theorem?

there is no ironclad guarantee of properties continuing

Properties continuing is not what I'm asking about. The example in the OP is relevant: even if the entire universe undergoes some kind of phase change tomorrow and the macroscopic physical laws change entirely, it would still be true that the old laws did work before the phase change, and any new theory needs to account for that in order to be complete.

nor any guarantee that there will be a simple mapping between theories

I do not know of any theorem or counterexample which actually says this. Do you?

simple properties can be expected (in a probabilistic sense) to generalize even if the model is incomplete

Similar issue to "no ironclad guarantee of properties continuing": I'm not asking about properties generalizing to other parts of the environment, I'm asking about properties generalizing to any theory or model which describes the environment.

Basic Inframeasure Theory

If we're imposing condition 5, then why go to all the trouble of talking about sa-measures, rather than just talking about a-measures from the start? Why do we need that extra generality?

Basic Inframeasure Theory

A positive functional for  is a continuous linear function   that is nonnegative everywhere on .

I got really confused by this in conjunction with proposition 1. A few points of confusion:

  • The decomposition of  into  rather than . I'm sure this is standard somewhere, but I had to read back a ways to realize that  is negative in the constraint .
  • This does not match wikipedia's definition of a positive linear functional; that only requires that the functional be positive on the positive elements of the underlying space.
  • We seem to be talking about affine functions, not linear functions, but then Theorem 1 works around that by throwing in the constant .
Radical Probabilism

Ah, I see. Made sense on a second read. Thanks.

Alignment By Default

Try to clarify here, do you think the problems brought up in these answers are the main problems of alignment?

Mostly no. I've been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that).

Using GPT-like systems to simulate alignment researchers' writing is a probably-safer use-case, but it still runs into the core catch-22. Either:

  • It writes something we'd currently write, which means no major progress (since we don't currently have solutions to the major problems and therefore can't write down such solutions), or
  • It writes something we currently wouldn't write, in which case it's out-of-distribution and we have to worry about how it's extrapolating us

I generally expect the former to mostly occur by default; the latter would require some clever prompts.

I could imagine at least some extrapolation of progress being useful, but it still seems like the best way to make human-simulators more useful is to improve our own understanding, so that we're more useful to simulate.

Given a textual description of some FAI proposal (or proposal for solving some open problem within AI safety), highlight the contiguous passage of text within the voluminous archives of AF/LW/etc. that is most likely to represent a valid objection to this proposal.

This sounds like a great tool to have. It's exactly the sort of thing which is probably marginally useful. It's unlikely to help much on the big core problems; it wouldn't be much use for identifying unknown unknowns which nobody has written about before. But it would very likely help disseminate ideas, and be net-positive in terms of impact.

I do think a lot of the things you're suggesting would be valuable and worth doing, on the margin. They're probably not sufficient to close the bulk of the safety gap without theoretical progress on the core problems, but they're still useful.

I'm a bit confused why you're bringing up "safety problems too complex for ourselves" because it sounds like you don't think there are any important safety problems like that, based on the sentences that came before this one?

The "safety problems too complex for ourselves" are things like the fusion power generator scenario - i.e. safety problems in specific situations or specific applications. The safety problems which I don't think are too complex are the general versions, i.e. how to build a generally-aligned AI.

An analogy: finding shortest paths in a billion-vertex graph is far too complex for me. But writing a general-purpose path-finding algorithm to handle that problem is tractable. In the same way, identifying the novel safety problems of some new technology will sometimes be too complex for humans. But writing a general-purpose safety-reasoning algorithm (i.e. an aligned AI) is tractable, I expect.

I'm talking about the broad sense of "corrigible" described in e.g. the beginning of this post.

Ah ok, the suggestion makes sense now. That's a good idea. It's still punting a lot of problems until later, and humans would still be largely responsible for solving those problems later. But it could plausibly help with the core problems, without any obvious trade-off (assuming that the AI/oracle actually does end up pointed at corrigibility).

Alignment By Default

Do you have in mind a specific aspect of human values that couldn't be represented using, say, the reward function of a reinforcement learning agent AI?

It's not the function-representation that's the problem, it's the type-signature of the function. I don't know what such a function would take in or what it would return. Even RL requires that we specify the input-output channels up-front.

All we need to do is figure out the unknown unknowns that are load-bearing in the Research Assistant scenario, then assistant can help us with the rest of the unknown unknowns.

This translates in my head to "all we need to do is solve the main problems of alignment, and then we'll have an assistant which can help us clean up any easy loose ends".

More generally: I'm certainly open to the idea of AI, of one sort or another, helping to work out at least some of the problems of alignment. (Indeed, that's very likely a component of any trajectory where alignment improves over time.) But I have yet to hear a convincing case that punting now actually makes long-run alignment more likely, or even that future tools will make creation of aligned AI easier/more likely relative to unaligned AI. What exactly is the claim here?

If solving FAI necessarily involves reasoning about things which are beyond humans (which seems to be what you're getting at with the "unknown unknowns" stuff), what is the alternative?

I don't think solving FAI involves reasoning about things beyond humans. I think the AIs themselves will need to reason about things beyond humans, and in particular will need to reason about complex safety problems on a day-to-day basis, but I don't think that designing a friendly AI is too complex for humans.

Much of the point of AI is that we can design systems which can reason about things too complex for ourselves. Similarly, I expect we can design safe systems which can reason about safety problems too complex for ourselves.

Corrigible AI should be able to improve its corrigibility with increased capabilities the same way it can improve its alignment with increased capabilities.

What notion of "corrigible" are you using here? It sounds like it's not MIRI's "the AI won't disable its own off-switch" notion.

Radical Probabilism

(Note that Bayes-with-a-side-channel does not imply conditions such as convergence and calibration; so, Jeffrey's theory of rationality is more demanding.)

What about the converse? Is a radical probabilist always behaviorally equivalent to a Bayesian with a side-channel? Or to some sequence of virtual evidence updates?

You seem to say so later on - "And remember, every update is a Bayesian update, with the right virtual evidence" - but I don't think this was proven?

Load More