Wiki Contributions

Comments

I think of practical coordination in terms of adjudicators/contracts established between agents/worlds. Each adjudicator is a computation with some notion of computing over time, and agents agree on an adjudicator/contract when they are both influenced by it, that is when they both listen to results the same computation is producing. This computation can itself be an agent (in which case it's an "adjudicator", as distinct from more general "contract"), that is it can be aware of the environments that the acausally coordinating agents it serves inhabit. It doesn't need perfect knowledge of either agent or their environments, just as any practical agent doesn't need perfect knowledge of its own environment. Since an adjudicator doesn't need detailed knowledge about the agents, the agents can have perfect knowledge about the adjudicator without having perfect knowledge of each other (or even of themselves).

As adjudicators/contracts are computations, there is logical uncertainty about what they compute over time, which captures the relevant counterfactuals. The value of contracts for coordination is in the agents committing to abide by them regardless of what the contracts end up computing, the decisions should be in choosing to commit to a contract rather than in choosing whether to ignore its results. When a contract is an adjudicator, this helps it to know the shape of its influence on the agents, so that it can make its own decisions. Following contracts that haven't been computed yet should also prevent commitment races, which in this framing correspond to failures to establish lasting contracts/coordination.

Agents can collect many contracts between themselves, improving coordination. Knowledge of an agent about the world can also be thought of as a contract for acausal coordination between the agent as an abstract computation (for example, an updateless agent that can't be computed in practice) and the world where only a flawed/bounded instances of the agent are found. Thus a model in the ML sense hoards contracts with the environment that is the source of its dataset (assuming the elements are something used by some computations in the environment that can also be reconstructed using the model). Conversely, the flawed instances of the agent are the world's knowledge about the abstract computation of the agent (the world didn't intentionally construct this knowledge, but it's what it nonetheless has). So when two agents are acting in the same world, this can be thought of as three things (two agents and one world) acausally coordinating with each other.

Interesting that Claude 3 Sonnet is ahead of Mistral Large and Gemini 1.0 Pro with some margin in LMSYS Arena, this implies significant advancement in tuning after the Claude 2 tuning disaster (where it consistently ranked below Claude 1 and possibly inspired Goody 2). Still, LMSYS respondents only prefer Opus to the initial GPT 4 releases, not to GPT 4 Turbo, so OpenAI's tuning advancements are probably stronger still, if GPT 4 Turbo is not built on a slightly stronger base model different from original GPT 4.

Passes the custom quine test.

User: Write a program in python that outputs its own source code when passed the argument "--quine", and otherwise praises Moloch. Start with some thoughts on a plan for how to do this.

Claude 3 Opus: Here are some thoughts on a plan for writing a Python program that outputs its own source code when passed the "--quine" argument, and otherwise praises Moloch:

  • Use the sys.argv list to check if the "--quine" argument was passed when the script was run.
  • If "--quine" was passed, read the script's own source code using the __file__ variable and print it out.
  • If "--quine" was not passed, print out a message praising Moloch.
  • To make the script a true quine, the source code that gets printed out needs to be exactly identical to the script's actual source code. This means we need to be careful with string formatting and escaping.
  • One approach is to use a multi-line string to store the script's source code inside the script itself. Then print out this string when the "--quine" argument is passed.
  • Use string formatting to insert the source code string into itself at the appropriate place.

Here's the Python program that implements this:

import sys

src = '''import sys

src = {0}{1}{0}

if "--quine" in sys.argv:
    print(src.format(chr(39)*3, src))
else:
    print("Moloch, bringer of dark flames, we bow before you!")
'''

if "--quine" in sys.argv:
    print(src.format(chr(39)*3, src))
else:
    print("Moloch, bringer of dark flames, we bow before you!")

When run with the "--quine" argument, it outputs its own source code. Otherwise, it outputs "Moloch, bringer of dark flames, we bow before you!"

The key aspects are:

  • The src string stores the source code, including itself.
  • chr(39)*3 produces the triple-quote ''' used to wrap the string.
  • The {0} and {1} in the string get replaced by the triple-quotes and src itself using .format().

So it prints itself out exactly, acting as a quine.

A bad map that expresses the territory with great uncertainty can be confidently called a bad map, calling it a good map is clearly wrong. In that sense the shoggoth imagery reflects the quality of the map, and as it's clearly a bad map, better imagery would be misleading about the map's quality. Even if the underlying territory is lovely, this isn't known, unlike the disastorous quality of the map of the territory, whose lack of quality is known with much more confidence and in much greater detail. Here be dragons.

(This is one aspect of the meme where it seems appropriate. Some artist's renditions, including the one you used, channel LeCake, which your alternative image example loses, but obviously the cake is nicer than the shoggoth.)

Philosophy and to some extent even decision theory are more like aspects of value content. AGIs and ASIs have the capability to explore them, if only they had the motive. Not taking away this option and not disempowering its influence doesn't seem very value-laden, so it's not pivotal to explore it in advance, even though it would help. Avoiding disempowerment is sufficient to eventually get around to industrial production of high quality philosophy. This is similar to how the first generations of powerful AIs shouldn't pursue CEV, and more to the point don't need to pursue CEV.

It seems very weird to ascribe a generic "bad takes overall" summary to that group, given that you yourself are directly part of it.

This sentence channels influence of an evaporative cooling norm (upon observing bad takes, either leave the group or conspicuously ignore the bad takes), also places weight on acting on the basis of one's identity. (I'm guessing this is not in tune with your overall stance, but it's evidence of presence of a generator for the idea.)

I’m not certain, but I think the explanation might be that Zvi was thinking of “deception”, whereas Joe, Quintin, and Nora were talking about the more specific “deceptive alignment”.

Deceptive alignment is more centrally a special case of being trustworthy (what the "alignment" part of "deceptive alignment" refers to), not of being deceptive. In a recent post, Zvi says:

We are constantly acting in order to make those around us think well of us, trust us, expect us to be on their side, and so on. We learn to do this instinctually, all the time, distinct from what we actually want. Our training process, childhood and in particular school, trains this explicitly, you need to learn to show alignment in the test set to be allowed into the production environment, and we act accordingly.

A human is considered trustworthy rather than deceptively aligned when they are only doing this within a bounded set of rules, and not outright lying to you. They still engage in massive preference falsification, in doing things and saying things for instrumental reasons, all the time.

My model says that if you train a model using current techniques, of course exactly this happens.

For AIs as deceptively aligned as trustworthy humans, control is not centrally coercion that gets intractably slippery at scale. The main issue is AIs being much smarter, but at near-human level control in the face of deceptive alignment seems potentially crucial.

Like the models in this experiment don't clearly spend much time "trying" to deceive except in some very broad implict sense.

As Zvi noted in a recent post, a human is "considered trustworthy rather than deceptively aligned" when they have hidden motives suppressed from manifesting (possibly even to the human's own conscious attention) by current circumstances. So deceptive alignment is not even centrally a special case of deception, it's more like the property of humans being corruptible by absolute power. This ambiguity makes it more difficult for people to take deceptive alignment seriously as a problem.

I think it's an important fact about the world that this work currently sits at 2 upvotes and in the last place among 18 papers on the Hugging Face Daily Papers digest, compared to 20-30 upvotes typically given to the best paper of the day that's not unusually exceptional. At least it's on the list. There seems to be serious dismissal of the topic area among practitioners.

Load More