AI ALIGNMENT FORUMAF

(...) the term technical is a red flag for me, as it is many times used not for the routine business of implementing ideas but for the parts, ideas and all, which are just hard to understand and many times contain the main novelties.
- Saharon Shelah

For my most of my writing see my short-forms (new shortform, old shortform)

Sequences

Singular Learning Theory

Sorted by New

Wiki Contributions

Yeah follow-up posts will definitely get into that!

To be clear: (1) the initial posts won't be about Crutchfield work yet - just introducing some background material and overarching philosophy (2) The claim isn't that standard measures of information theory are bad. To the contrary! If anything we hope these posts will be somewhat of an ode to information theory as a tool for interpretability.

Adam wanted to add a lot of academic caveats - I was adamant that we streamline the presentation to make it short and snappy for a general audience but it appears I might have overshot ! I will make an edit to clarify. Thank you!

I agree with you about the importance of Kolmogorov complexity philosophically and would love to read a follow-up post on your thoughts about Kolmogorov complexity and LLM interpretability:)

Concept splintering in Imprecise Probability: Aleatoric and Epistemic Uncertainty.

There is a general phenomena in mathematics [and outside maths as well!] where in a certain context/ theory  we have two equivalent definitions  of a concept  that become inequivalent when we move to a more general context/theory . In our case we are moving from the concept of probability distributions to the concept of an imprecise distribution (i.e. a convex set of probability distributions, which in particular could be just one probability distribution). In this case the concepts of 'independence' and 'invariant under group action' will splinter into inequivalent concepts

Example (splintering of Indepence) In classical probability theory there are three equivalent ways to state that a distribution is independent

1.

2.

3.

In imprecise probability these notions split into three inequivalent notions. The first is 'strong independence' or 'aleatoric independence'. The second and third are called 'irrelevance', i.e. knowing  does not tell us anything about  [or for 3 knowing  does not tell us anything about ].

Example (splintering of invariance). There are often debates in foundations of probability, especially subjective Bayesian accounts about the 'right' prior. An ultra-Jaynesian point of view would argue that we are compelled to adopt a prior invariant under some symmetry if we do not posses subjective knowledge that breaks that symmetry ['epistemic invariance'], while a more frequentist or physicalist point of view would retort that we would need evidence that the system in question is in fact invariant under said symmetry ['aleatoric invariance']. In imprecise probability the notion of invariance under a symmetry splits into a weak 'epistemic' invariance and a strong 'aleatoric' invariance. Roughly spreaking, latter means that each individual distribution in the convex set is invariant under the group action while the former just means that the convex set is closed under the action

The point isn't about goal misalignment but capability generalisation. It is surprising to some degree that just selecting on reproductive fitness through its proxies of being well-fed, social status etc humans have obtained the capability to go to the moon. It points toward a coherent notion & existence of 'general intelligence' as opposed to specific capabilities.

Thank you for writing this post; I had been struggling with these considerations a while back. I investigated going full paranoid mode but in the end mostly decided against it.

I agree theoretical insight on agency and intelligence have a real chance of leiding to capability gains. I agree on the government spy threat model as being unlikely. I would like to add however that if say MIRI builds a safe AGI prototype - perhaps based on different principles than systems used by adversaries it might make sense for an (ai-assisted) adversary to trawl through your old blogposts.

Byrnes has already mentioned the distinction between pioneers and median researchers. Another aspect that your threat models don't capture is: research that builds on your research. Your research may end up in a very long chain of theoretical research only a minority of which you have contributed. Or the spirit if not the letter of your ideas may percolate through the research community. Additionally, the alignment field will almost certainly become very much larger raising both the status of John and the alignment field in general. Over longer timescales I expect percolation to be quite strong.

Even if approximately nobody reads your or know of your works the insights may very well become massively signalboosted by other alignment researchers (once again I expect the community to explode in size within a decade) and thereby end up in a flashy demo.

All-in-all these and other considerations let me to the conclusion that this danger is very real. That is there is a significant minority of possible worlds in which early alignment researchers tragically contribute to DOOM.

However, I still think on the whole most alignment researchers should work in the open. Any solution to alignment will most likely come from a group (albeit-small) of people. Working privately massively hampers collaboration. It makes the community look weird and makes it way harder to recruit good people. Also, for most researchers it is difficult to support themselves financially if they can't show their work. As by far the most likely doom scenario is some company/government simply building AGI without sufficient safeguards because either there is no alignment solution or they are simply unaware of it/it ote it I conclude that the best policy in expected value is to work mostly publicly*.

*Ofc if there is a clear path to capability gain keeping it secret might be the best.

EDIT: Cochran has a comical suggestion

Georgy Flerov was a young nuclear physicist in the Soviet Union who ( in 1943) sent a letter to Stalin advocating an atomic bomb project. It is not clear that Stalin read that letter, but one of Flerov’s arguments was particularly interesting: he pointed out the abrupt and complete silence on the subject of nuclear fission in the scientific literature of the US, UK, and Germany – previously an extremely hot topic.

Stopping publications on atomic energy ( which happened in April 1940) was a voluntary effort by American and British physicists. But that cessation was itself a signal that something strategically important was going on.

Imagine another important discovery with important strategic implications: how would you maximize your advantage ?

Probably this is only practically possible if your side alone has made the discovery. If the US and the UK had continued publishing watered-down nuclear research, the paper stoppage in Germany would still have given away the game. But suppose, for the moment, that you have a monopoly on the information. Suddenly stopping closely related publications obviously doesn’t work. What do you do?

You have to continue publications, but they must stop being useful. You have to have the same names at the top ( an abrupt personnel switch would also be a giveaway) but the useful content must slide to zero. You could employ people that A. can sound like the previous real authors and B. are good at faking boring trash. Or, possibly, hire people who are genuinely mediocre and don’t have to fake it.

Maybe you can distract your rivals with a different, totally fake but extremely exciting semiplausible breakthrough.

Or – an accidental example of a very effective approach to suppression. Once upon a time, around 1940, some researchers began to suspect that duodenal ulcers were caused by a spiral bacterium. Some physicians were even using early antibiotics against them, which seemed to work. Others thought what they were seeing might be postmortem contamination. A famous pathologist offered to settle the issue.

He looked, didn’t see anything, and the hypothesis was buried for 40 years.

But he was wrong: he had used the wrong stains.

So, a new (?) intelligence tactic for hiding strategic breakthroughs: the magisterial review article.

Just wanted to chime in to say that this feels important and I'm happy you brought it up.

Daniel Kokotaljo and I agreed on the following bet: I paid Daniel $1000 today. Daniel will pay me$1100 inflation adjusted if there is no AGI in 2030.

Ramana Kumar will serve as the arbiter. Under unforeseen events we will renegotiate in good-faith.

As a guideline for 'what counts as AGI' I suggested the following, to which Daniel agreed:

"the Arbiter agrees with the statement "there is convincing evidence that there is an operational Artificial General Intelligence"  on 6/7/2030"

Defining an artificial general intelligence is a little hard and has a strong 'know it when you see it vibe' which is why I'd like to leave it up to Ramana's discretion.

We hold these properties to be self-evident requirements for a true Artificial General Intelligence:

1. be able to equal or outperform any human on virtually all relevant domains, at least theoretically

-> there might be e.g. physical tasks that it is artificially constrained from completing because it is lacks actuators for instance - but it should be able to do this 'in theory'. again I leave it up to the arbiter to make the right judgement call here.

2. it should be able to asymptotically outperform or equal human performance for a task with equal fixed data, compute, and prior knowledge

3. it should autonomously be able to formalize vaguely stated directives into tasks and solve these (if possible by a human)

4. it should be able to solve difficult unsolved maths problems for which there are no similar cases in its dataset

(again difficult, know it when you see it)

5. it should be immune / atleast outperform humans against an adversarial opponent (e.g. it shouldn't fail Gary Marcus style questioning)

6. outperform or equals humans on causal & counterfactual reasoning

7. This list is not a complete enumeration but a moving goalpost (but importantly set by Ramana! not me)

-> as we understand more about intelligence we peel off capability layers that turn out to not be essential /downstream of 'true' intelligence.

Importantly, I think near-future ML systems to be start to outperform humans in virtually all (data-rich) clearly defined tasks (almost) purely on scale but I feel that an AGI should be able to solve data-poor, vaguely defined tasks, be robust to adversarial actions, correctly perform counterfactual & causal reasoning and be able to autonomously 'formalize questions'.

Are we supposed to know who Yafa is?

I am not able to ascertain the truth value of the relevant sentences with or without assistance. I am a human if that helps