Ramana Kumar

Wiki Contributions


An observation about Hubinger et al.'s framework for learned optimization

If I've understood it correctly, I think this is a really important point, so thanks for writing a post about it. This post highlights that mesa objectives and base objectives are typically going to be of different "types", because the base objective will typically be designed to evaluate things in the world as humans understand it (or as modelled by the formal training setup) whereas the mesa objective will be evaluating things in the AI's world model (or if it doesn't really have a world model, then more local things like actions themselves as opposed to their distant consequences).


Am I acting in bad faith?... Surely I "get what they mean"?

I'm certainly glad to see people suspending their sense of "getting it" when it comes to reference (aka pointers, aka representation) since I don't think we have solid foundations for these topics and I think they are core issues in AI alignment.

Against Time in Agent Models

It's possible that reality is even worse than this post suggests, from the perspective of someone keen on using models with an intuitive treatment of time. I'm thinking of things like "relaxed-memory concurrency" (or "weak memory models") where there is no sequentially consistent ordering of events. The classic example is where these two programs run in parallel, with X and Y initially both holding 0, [write 1 to X; read Y into R1] || [write 1 to Y; read X into R2], and after both programs finish both R1 and R2 contain 0. What's going on here is that the level of abstraction matters: writing and reading from registers are not atomic operations, but if you thought they were you're gonna get confused if you expect sequential consistency.

  • Total ordering: there's only one possible ordering of all operations, and everyone knows it. (or there's just one agent in a cybernetic interaction loop.)
  • Sequential consistency: everyone knows the order of their own operations, but not how they are interleaved with others' operations (as in this post)
  • Weak memory: everyone knows the order of their own operations, but others' operations may be doing stuff to shared resources that aren't compatible with any interleaving of the operations

See e.g., https://www.cl.cam.ac.uk/~pes20/papers/topics.html#relaxed or this blog for more https://preshing.com/20120930/weak-vs-strong-memory-models/.

[ASoT] Consequentialist models as a superset of mesaoptimizers

I agree with this, and I think the distinction between "explicit search" and "heuristics" is pretty blurry: there are characteristics of search (evaluating alternative options, making comparisons, modifying one option to find another option, etc.) that can be implemented by heuristics, so you get some kind of hybrid search-instinct system overall that still has "consequentialist nature".

The Big Picture Of Alignment (Talk Part 1)

Thanks a lot for posting this! A minor point about the 2nd intuition pump (100-timesteps, 4 actions: Take $1, Do Nothing, Buy Apple, Buy Banana; the point being that most action sequences take the Take $1 action a lot rather than the Do Nothing action): the "goal" of getting 3 apples seems irrelevant to the point, and may be misleading if you think that that goal is where the push to acquire resources comes from. A more central source seems to me to be the "rule" of not ending with a negative balance: this is what prunes paths through the tree that contain more "do nothing" actions.

Some Hacky ELK Ideas

In order to cross-check a non-holdout sensor with a holdout sensor, you need to know the expected relationship between the two sensor readings under different levels of tampering. A simple case: holdout sensor 1 and non-holdout sensor 1 are identical cameras on the ceiling pointing down at the room, the expected relationship is that the images captured agree (up to say 1 pixel shift because the cameras are at very slightly different positions) under no tampering, and don't agree when there's been tampering.

Problem: tampering with the non-holdout sensor may inadvertently cause tampering with the holdout sensor such that the relationship between their readings stays the same, despite there only being an incentive to tamper with the non-holdout sensor. For example, putting a screen up, far away from both cameras, to fool the non-holdout sensor also ends up fooling the holdout sensor unintentionally.

Alex Ray's Shortform

In my understanding there's a missing step between upgraded verification (of software, algorithms, designs) and a "defence wins" world: what the specifications for these proofs need to be isn't a purely mathematical thing. The missing step is how to figure out what the specs should say. Better theorem proving isn't going to help much with the hard parts of that.

Prizes for ELK proposals

Question: Does ARC consider ELK-unlimited to be solved, where ELK-unlimited is ELK without the competitiveness restriction (computational resource requirements comparable to the unaligned benchmark)?

One might suppose that the "have AI help humans improve our understanding" strategy is a solution to ELK-unlimited because its counterexample in the report relies on the competitiveness requirement. However, there may still be other counterexamples that were less straightforward to formulate or explain.

I'm asking for clarification of this point because I notice most of my intuitions about counterexamples aren't drawing heavily on the competitiveness requirement, and I suspect ELK-unlimited is still open. If ARC doesn't think so maybe this discrepancy will become a source of new counterexamples.

ARC's first technical report: Eliciting Latent Knowledge

I think the problem you're getting at here is real -- path-dependency of what a human believes on how they came to believe it, keeping everything else fixed (e.g., what the beliefs refer to) -- but I also think ARC's ELK problem is not claiming this isn't a real problem but rather bracketing (deferring) it for as long as possible. Because there are cases where ELK fails that don't have much path-dependency in them, and we can focus on solving those cases until whatever else is causing the problem goes away (and only path-dependency is left).

ARC's first technical report: Eliciting Latent Knowledge

Our notion of narrowness is that we are interested in solving the problem where the question we're asking is such that a state always resolves a question. E.g. there isn't any ambiguity around whether a state "really contains a diamond". (Note that there is ambiguity around whether the human could detect the diamond from any set of observations because there could be a fake diamond or nanobots filtering what the human sees). It might be useful to think of this as an empirical claim about diamonds.

This "there isn't any ambiguity"+"there is ambiguity" does not seem possible to me: these types of ambiguity are one and the same. But it might depend on what “any set of observations” is allowed to include. “Any set” suggests being very inclusive, but remember that passive observation is impossible. Perhaps the observations I’d want the human to use to figure out if the diamond is really there (presuming there isn’t ambiguity) would include observations you mean to exclude, such as disabling the filter-nanobots first?

I guess a wrinkle here is that observations need to be “implementable” in the world. If we’re thinking of making observations as intervening on the world (e.g., to decide which sensors to query), then some observations may be inaccessible because we can’t make that intervention. Rewriting this all without relying on “possible”/”can” concepts would be instructive.

ARC's first technical report: Eliciting Latent Knowledge

Proposing experiments that are more specifically exposing tampering does sound like what I meant, and I agree that my attempt to reduce this to experiments that expose confidently wrong human predictions may not be precise enough.

How do we use this to construct new sensors that allow the human to detect tampering?

I know this is crossed out but thought it might help to answer anyway: the proposed experiment includes instructions for how to set the experiment up and how to read the results. These may include instructions for building new sensors.

The proposed experiment could itself perform tampering

Yep this is a problem. "Was I tricking you?" isn't being distinguished from "Can I trick you after the fact?".

The other problems seem like real problems too; more thought required....

Load More