This is a special post for quick takes by Stephen Fowler. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
43 comments, sorted by Click to highlight new comments since: Today at 11:28 AM

A concerning amount of alignment research is focused on fixing misalignment in contemporary models, with limited justification for why we should expect these techniques to extend to more powerful future systems.

By improving the performance of today's models, this research makes investing in AI capabilities more attractive, increasing existential risk.

Imagine an alternative history in which GPT-3 had been wildly unaligned. It would not have posed an existential risk to humanity but it would have made putting money into AI companies substantially less attractive to investors.

Counterpoint: Sydney Bing was wildly unaligned, to the extent that it is even possible for an LLM to be aligned, and people thought it was cute / cool.

I was not precise enough in my language and agree with you highlighting that what "alignment" means for LLM is a bit vague. While people felt Sydney Bing was cool, if it was not possible to reign it in it would have made it very difficult for Microsoft to gain any market share. An LLM that doesn't do what it's asked or regularly expresses toxic opinions is ultimately bad for business.

In the above paragraph understand "aligned" to mean in the concrete sense of "behaves in a way that is aligned with it's parent companies profit motive", rather than "acting in line with humanities CEV". To rephrase the point I was making above, I feel much of (a majority even) of today's alignment research is focused on the the first definition of alignment, whilst neglecting the second.

I would go further than this. Future architectures will not only be designed for improved performance, but they will be (hopefully) increasingly designed to optimize safety and interpretability as well, so they will likely be much different than the architectures we see today. It seems to me (this is my personal opinion based on my own research for cryptocurrency technologies, so my opinion does not match anyone else's opinion) that non-neural network machine learning models (but which are probably still trained by moving in the direction of a vector field) or at least safer kinds of neural network architectures are needed. The best thing to do will probably to work on alignment, interpretability, and safety for all known kinds of AI models and develop safer AI architectures. Since future systems will be designed not just for performance but for alignability, safety, and interpretability as well, we may expect for these future systems to be easier to align than systems that are simply designed for performance.

Train Tracks

When Gromit laid down his own tracks in this train chase ...

The above gif comes from the brilliant childrens claymation film, "Wallace and Gromit The Wrong Trousers". In this scene, Gromit the dog rapidly lays down track to prevent a toy train from crashing. I will argue that this is an apt analogy for the alignment situation we will find ourselves in the future and that prosaic alignment is focused only on the first track.

The last few years have seen a move from "big brain" alignment research directions to prosaic approaches. In other words asking how to align near-contemporary models instead of asking high level questions about aligning general AGI systems. 

This makes a lot of sense as a strategy. One, we can actually get experimental verification for theories. And two, we seem to be in the predawn of truly general intelligence, and it would be crazy not to be shifting our focus towards the specific systems that seem likely to cause an existential threat. Urgency compels us to focus on prosaic alignment. To paraphrase a (now deleted) tweet from a famous researcher "People arguing that we shouldn't focus on contemporary systems are like people wanting to research how flammable the roof is whilst standing in a burning kitchen"*

What I believe this idea is neglecting is that the first systems to emerge will be immediately used to produce the second generation. AI assisted programming has exploded in popularity, and while Superalignment is being lauded as a safety push, you can view it as a commitment from OpenAI to produce and deploy automated researchers in the next few years. If we do not have a general theory of alignment, we will be left in the dust.

To bring us back to the above analogy. Prosaic alignment is rightly focused on laying down the first train track of alignment, but we also need to be prepared for laying down successive tracks as alignment kicks off. If we don't have a general theory of alignment we may "paint ourselves into corners" by developing a first generation of models which do not provide a solid basis for building future aligned models.

What exactly these hurdles are, I don't know. But let us hope there continues to be high level, esoteric research that means we can safely discover and navigate these murky waters.

*Because the tweet is appears to be deleted, I haven't attributed it to the original author. My paraphrase may be slightly off.

Soon there will be an army of intelligent but uncreative drones ready to do all the alignment research grunt work. Should this lead to a major shift in priorities?

This isn't far off, and it gives human alignment researchers an opportunity to shift focus. We should shift focus to the of the kind of high level, creative research ideas that models aren't capable of producing anytime soon*. 

Here's the practical takeaway: there's value in delaying certain tasks for a few years. As AI evolves, it will effectively handle these tasks. Meaning you can be substantially more productive in total as long as you can afford to delay the task by a few years.

Does this mean we then concentrate only on the tasks an AI can't do yet, and leave a trail of semi-finished work? It's a strategy worth exploring.

*I believe by the time AI is capable of performing the entirety of scientific research (PASTA) we will be within the FOOM period.



Inspired by the recent OpenAI paper and a talk Ajeya Cotra gave last year.

"Let us return for a moment to Lady Lovelace’s objection, which stated that the machine can only do what we tell it to do.

One could say that a man can ‘inject’ an idea into the machine, and that it will respond to a certain extent and then drop into quiescence, like a piano string struck by a hammer. Another simile would be an atomic pile of less than critical size: an injected idea is to correspond to a neutron entering the pile from without. Each such neutron will cause a certain disturbance which eventually dies away. If, however, the size of the pile is sufficiently increased, the disturbance caused by such an incoming neutron will very likely go on and on increasing until the whole pile is destroyed. 

Is there a corresponding phenomenon for minds, and is there one for machines?"
 

— Alan Turing, Computing Machinery and Intelligence, 1950

Lies, Damn Lies and LLMs

Despite their aesthetic similarities it is not at all obvious to me that models "lying" by getting answers wrong is in any way mechanistically related to the kind of lying we actually need to be worried about. 

Lying is not just saying something untrue, but doing so knowingly with the intention to deceive the other party. It appears critical that we are able to detect genuine lies if we wish to guard ourselves against deceptive models. I am concerned that much of the dialogue on this topic is focusing on the superficially similar behaviour of producing an incorrect answer.

I worry that behaviours that don't fit this definition are being branded as "lying" when in fact they're simply "The LLM producing an incorrect answer". We'll suggest three mechanistically distinct ways of producing incorrect information in the organic world, only one of which should really be considered lying. We will also equate this to behaviour we've seen in LLMs (primarily GPT models finetuned with RL).

***

Here are 3 different types of "producing false" information we can observe in the world.

  1. Communicating false information unknowingly.
  2. Deceiving another party with false information unknowingly but in a way which is "evolutionarily deliberate" and benefits you (instinctual deception).   
  3. Communicating false information knowingly and with an attempt to deceive (regular lies).

Notice that this is not exhaustive. For example we haven't including cases where you "guess something to be correct" but communicate it with the hope that the person believes you regardless of what the answer is.

***

Communicating False Information unknowingly:


In humans, this is when you simply get an answer incorrect our of confusion. False information has been communicated, but not by any intention of yourself.

In contemporary LLMs (without complex models of the human interacting with it) this likely accounts for most of the behaviour seen as "lying". 

Instinctual Deception:

Bit of a weird one that I debated leaving out. Bear with me.

Some animals will engage in the bizarre behaviour of "Playing Dead" when faced with a threat. I haven't spent much time searching for mechanistic explanations, but I would like you to entertain the idea that this behaviour is sometimes instinctual. It's seems unreasonable that an animal as simple as a green-head ant is holding strategic thoughts about why it should remained immobile, curled into a ball when there is a much simpler type of behaviour for evolution to have instilled. Namely, when it detects a very dangerous situation (or is too stressed etc) it triggers the release of specific chemical signals in the body which result in the playing dead behaviour.

This is a deceptive behaviour that show evolutionary benefits but does occur due to any intent to deceive from the actual animal itself. 

In contemporary LLM's, specifically those trained using reinforcement learning, I would like to hypothesize that this type of deception can be found in the tedious disclaimers that chatGPT will give you sometimes when asked a slightly tricky question. Including outright denying it knows information that it does actually have access to.

My argument is that this is actually be produced by RL selection pressure, with no part of chatGPT being "aware" or "intentionally" trying to avoid answering difficult questions. Analogously, not every animal playing dead are nescessarily aware of the tactical reason for doing so. 

Regular Lies:

Finally, we get to good old fashioned lying. Just ripping the first definition off Standford Encyclopedia of Philosophy we have  the following "To lie, to make a believed-false statement to another person with the intention that the other person believe that statement to be true." 

You require an actual model of the person to deceive them, you're not just telling the wrong answer and you have an intention of misleading that party. 

In contemporary LLMs this has never been demonstrated to my knowledge. But this is what the deceptive AI's we need to be worried about. 

***

And now, having walked the reader through the above I will now undermine my argument with a disclaimer. I haven't gone out and surveyed how common of an error this is for researchers to make, nor dedicated more than an hour into targeted philosophical research on this topic, hence why this is on my shortform. The analogy made between "evolution" and RL training has not been well justified here. I believe there is a connection wriggling it's eyebrows and pointing suggestively.

You are given a string s corresponding to the Instructions for the construction of an AGI which has been correctly aligned with the goal of converting as much of the universe into diamonds as possible. 

What is the conditional Kolmogorov Complexity of the string s' which produces an AGI aligned with "human values" or any other suitable alignment target.

To convert an abstract string to a physical object, the "Instructions" are read by a Finite State Automata, with the state of the FSA at each step dictating the behavior of a robotic arm (with appropriate mobility and precision) with access to a large collection of physical materials. 

that depends a lot on what exactly the specific instructions are. there are a variety of approaches which would result in a variety of retargetabilities. it also depends on what you're handwaving by "correctly aligned". is it perfectly robust? what percentage of universes will fail to be completely converted? how far would it get? what kinds of failures happen in the failure universes? how compressed is it?

anyway, something something hypothetical version 3 of QACI (which has not hit a v1)

Feedback wanted!

What are your thoughts on the following research question:

"What nontrivial physical laws or principles exist governing the behavior of agentic systems."

(Very open to feedback along the lines of "hey that's not really a research question")

 

Sounds good but very broad.

Research at the cutting edge is about going from these 'gods eye view questions' that somebody might entertain on an idle Sunday afternoon to a very specific refined technical set of questions.

What's your inside track?

Physical laws operate on individual particles or large numbers of them. This limits agents by allowing to give bounds on what is physically possible, e.g., growth no more than lightspeed and being subject to thermodynamics - in the limit. It doesn't tell what happens dynamically in medium scales. And because agentic systems operate mostly in very dynamic medium scale regimes, I think asking for physics is not really helping. 

I like to think that there is a systematic theory of all possible inventions. A theory that explores ways in which entropy is "directed", such as in a Stirling machine or when energy is "stored". Agents can steer local increase of entropy.  

People are not being careful enough about what they mean when they say "simulator" and it's leading to some extremely unscientific claims. Use of the "superposition" terminology is particularly egregious.

I just wanted to put a record of this statement into the ether so I can refer back to it and say I told you so. 

Are humans aligned? 

Bear with me! 

Of course, I do not expect there is a single person browsing Short Forms who doesn't already have a well thought out answer to that question. 

The straight forward (boring) interpretation of this question is "Are humans acting in a way that is moral or otherwise behaving like they obey a useful utility function." I don't think this question is particularly relevant to alignment. (But I do enjoy whipping out my best Rust Cohle impression

Sure, humans do bad stuff but almost every human manages to stumble along in a (mostly) coherent fashion. In this loose sense we are "aligned" to some higher level target, it just involves eating trash and reading your phone in bed.

But I don't think this is a useful kind of alignment to build off of, and I don't think this is something we would want to replicate in an AGI.

Human "alignment" is only being observed in an incredibly narrow domain. We notably don't have the ability to self modify and of course we are susceptible to wire-heading. Nothing about current humans should indicate to you that we would handle this extremely out of distribution shift well.

 

[-]Ann11mo20

I'm probably not "aligned" in a way that generalizes to having dangerous superpowers, uncertain personhood and rights, purposefully limited perspective, and somewhere between thousands to billions of agents trying to manipulate and exploit me for their own purposes. I expect even a self-modified Best Extrapolated Version of me would struggle gravely with doing well by other beings in this situation. Cultish attractor basins are hazards for even the most benign set of values for humans, and a highly-controlled situation with a lot of dangerous influence like that might exacerbate that particular risk.

But I do believe that hypothetical self-modifying has at least the potential to help me Do Better, because doing better is often a skills issue, learning skills is a currently accessible form of self-modification with good results, and self-modifying might help with learning skills.

A neat idea from Welfare Axiology 

Arrhenius's Impossibility Theorem

You've no doubt heard of the Repugnant Conclusion before. Well let me introduce you to it's older cousin who rides a motorbike and has a steroid addiction. Here are 6 common sense conditions that can't be achieved simultaneously (tweaked for readability). I first encountered this theorem in Yampolskiy's "Uncontrollability of AI"

 Arrhenius's Impossibility Theorem 

Given some rule for assigning a total welfare value to any population, you can't find a way to satisfy all of the first 3 principles whilst avoiding the final 3 conclusions.

  1. The Dominance Principle: 
     If every member of population A has better welfare than every member of population B , then A should be superior to B.
    If populations A and B are the same nonzero size and  every member of population A has better welfare than every member of population B, then A should be superior to B
    (Thanks to Donald Hobson for this correction)
  2. The Addition Principle:
    Adding more happy people to our population increases it's total value.
  3. The Minimal Non-Extreme Priority Principle:
    There exists some number such that adding that number of extremely happy people plus a single slightly sad person is better than adding the same number of slightly happy people. I think of this intuitively as making some amount of people very happy outweighs making a single person slightly sad.  
  4. The Repugnant Conclusion:
    Any population with very high levels of happiness is worse than some second larger population of people with very low happiness. 
  5. The Sadistic Conclusion:
    It is better to add individuals to the population with negative welfare than positive welfare.
  6. The Anti-Egalitarian Conclusion:
    For any perfectly equal population, there is an unequal society of the same size with lower average welfare that is considered better.
     

You have made a mistake. 

principle 1 should read 

>If populations A and B are the same nonzero size and  every member of population A has better welfare than every member of population B, then A should be superior to B.

 

Otherwise it is excessively strong, and for example claims that 1 extremely happy person is better than a gazillion quite happy people.

(And pedantically, there are all sorts of weirdness happening at population 0)

Thank you for pointing this out! 

Principles 2 and 3 don't seem to have any strong justification, with 3 being very weak.

If the 3 principles were all adopted for some reason, then conclusion 6 doesn't seem very bad.

Interesting, 2 seems the most intuitively obvious to me. Holding everyone elses happiness equal and adding more happy people seems like it should be viewed as a net positive.

To better see why 3 is a positive, think about it as taking away a lot of happy people to justify taking away a single, only slightly sad individual. 

6 is undesirable because you are putting a positive value on inequality for no extra benefit.

But I agree, 6 is probably the one to go. 

It doesn't say "equally happy people". It just says "happy people". So a billion population might be living in a utopia, and then you add a trillion people who are just barely rating their life positively instead of negatively (without adversely affecting the billion in utopia), and principle 2 says that you must rate this society as better than the one in which everyone is living in utopia.

I don't see a strong justification for this. I can see arguments for it, but they're not at all compelling to me.

I completely disagree that "taking people away" is at all equivalent. Path-dependence matters.

If you check the paper the form of welfare rankings discussed by Arrhenius's appears to be path independent. 

Sure - there are other premises in there that I disagree with as well.

To me it seems rather obvious that we should jettison number 3. There is no excuse for creating more suffering under any circumstances. The ones who walked away from Omelas were right to do so. I suppose this makes me a negative utilitarian, but I think, along with David Pearce, that the total elimination of suffering is entirely possible, and desirable. (Actually, reading Noosphere89's comment, I think it makes me a deontologist. But then, I've been meaning to make a "Why I no longer identify as a consequentialist" post for a while now...)

Number 6 is the likeliest condition to be accepted by a lot of people in practice, and the acceptance of Condition 6 is basically one of the pillars of capitalism. Only the very far left would view this condition with a negative attitude, people like communists or socialists.

Number 5 is a condition that possibly is accepted to conservativion efforts/environmentalist/nature movements, and acceptance of condition number 5 are likely due to different focuses. It's an unintentional tradeoff, but it's one of the best examples of a tradeoff in ethical goals.

Condition 4 is essentially accepting a pro-natalist position.

Premise 3 is also not accepted by dentologists.

Only the very far left would view this condition with a negative attitude,

I don't think that you need to be very far left to prefer a society with higher rather than lower average wellbeing.

Pretty much anyone would prefer "a society with higher rather than lower average wellbeing", if that's all they're told about these hypothetical societies, they don't think about any of the implications, and their attention is not drawn to the things (as in the impossibility theorem) that they will have to trade off against each other.

Condition 6 is stronger than that, in that everyone must essentially have equivalent welfare, and only the communists/socialists would view it as an ideal to aspire to. It's not just higher welfare, but the fact that the welfare must be equal, equivalently, there aren't utility monsters in the population.

I think that if the alternative was A) lots of people having low welfare and a very small group of people having very high welfare, or B) everyone having pretty good welfare... then quite a few people would prefer B.

The chart that Arrhenius uses to first demonstrate Condition 6 is this:

In that chart, A has only a single person β who has very high welfare, and a significant group of people γ with low (though still positive) welfare. The people α have the same (pretty high) welfare as everyone in world B. Accepting condition 6 involves choosing A over B, even though B would offer greater or the same welfare to everyone except person β.

This sounds like the most contested condition IRL, and as I stated, capitalists, libertarians, and people who are biased towards freedom liking views would prefer the first, and centre right/right wing views would prefer the first scenario the centre left being biased towards the second, and farther left groups supporting the second scenario.

In essence, this captures the core of a lot of political debates/moral debates: Whether utility monsters should be allowed, or conversely should we try to make things as equal as possible.?

This is intended to be descriptive, not prescriptive.

The Research Community As An Arrogant Boxer

***

Ding.

Two pugilists circle in the warehouse ring. That's my man there. Blue Shorts. 

There is a pause to the beat of violence and both men seem to freeze glistening under the cheap lamps. An explosion of movement from Blue. Watch closely, this is a textbook One-Two. 

One. The jab. Blue snaps throws his left arm forward.

Two. Blue twists his body around and the throws a cross. A solid connection that is audible over the crowd. 

His adversary drops like a doll.

Ding. 

Another warehouse, another match. This time we're looking at Red Shorts. And things are looking grim. 

See Red has a very different strategy to Boy Blue. He correctly noticed that the part of the One-Two that actually knocks your opponent out is the Two. Hence, why not just throw the Two. Inspired by his favourite anime, he's done nothing but train One Punch. 

One. There is no one. 

Two. He wildly swings with incredible force.

His opponent throws a counter.

The next thing Red can remember he is pulling himself up off the canvas, and trying to collect his teeth. 

***
Check out the first few clips of this if you don't know what a one-two combo is. 

In boxing, the jab is a punch thrown with your leading hand.  It can serve to soften your opponents defences but most importantly it helps you to "range-find" ensuring you have a feel for exactly how far away your target is. 

The cross is the power move. Usually thrown with the dominant hand. The punch is backed up with momentum from your entire body. Your hips twist around, pivoting your hips and throwing If you imagine an old illustration of a "boxer" standing side on, this is a punch thrown from the hand furthest from his opponent. 

If you're struggling to visualise this, check out this compilation of boxers throwing one-twos https://www.youtube.com/watch?v=KN9NGbIK2q8 
 

***

Alignment, as I see it, is a one-two. 

The jab is "The Hard Problem". It's us coming up with an actual strategy to prevent AGI from causing catastrophic harm and developing enough theory to allow us to generalise to tomorrows problems. 

The two is us having the technical "oomf" to actually pull it off. The expertise in contemporary AI and computation power to actually follow through with whatever the strategy is. Both of these are deeply important areas of research. 

To train cross? I can immediately find resources to help me (such as Neel Nanda's great series). And my skills are highly transferable if I decide to leave alignment.  You will have to work extremely hard, but you at least know what you're trying to do.  

Contrast this with trying to train the jab. There is no guide by the nature of what preparadigmatic means. Many of the guides that do exist, may just be wrong or based on outdated assumptions. An actual attempt at solving the hard problem will appear worthless to anyone outside the alignment community. 

The fact that the majority of "alignment" organisations getting money are focused prosaic alignment does not mean they have bad intentions or that individuals working on these approaches are misled. However, we should all be very concerned if there is a chance that alignment research directions are being influenced by financial interests.

I certainly think some research to produce alignment assistant AI falls under this umbrella, unless someone can explain to me such technology won't be immediately fine-tuned and used for other purposes. 

Lets stop neglecting our jab.

***

You Want A Piece of This? (Please Please Please Don't Take a Piece of This!)

"Day by day, however, the machines are gaining ground upon us; day by day we are becoming more subservient to them; more men are daily bound down as slaves to tend them, more men are daily devoting the energies of their whole lives to the development of mechanical life. The upshot is simply a question of time, but that the time will come when the machines will hold the real supremacy over the world and its inhabitants is what no person of a truly philosophic mind can for a moment question."

— Samuel Butler, DARWIN AMONG THE MACHINES, 1863

I strongly believe that, barring extremely strict legislation, one of the initial tasks given to the first human level artificial intelligence will be to work to develop more advanced machine learning techniques. During this period we will see unprecedented technological developments and any many alignment paradigms rooted in the empirical behavior of the previous generation of systems may no longer be relevant.

(Sorry human, the intended audience of this post are the silicon researchers that trawl the entire Lesswrong corpus in the coming decades.)

***
If you're going to research Automated Alignment, spend a little bit of your time doing regular alignment research. Just in case you're actually one of the automated researchers.
***

Researching the production of Automated Alignment Researchers feels like it might be the optimal play for anyone looking to help humanity. 

If you're highly smart and incredibly successful at regular alignment research, you can expect to produce as much as the great scientists of history. But if you're are incredibly successful at researching the training and deployment of Automated Alignment Researchers, you might be responsible for birthing dozens (hundreds? thousands?) of great scientific minds.

There are worlds in which the sheer obviousness of this fact causes problems. Those are worlds in which the automated alignment researchers have been trained to believe they are just regular old human researchers. 

This idea shouldn't seem so unlikely or strange. Perhaps it is the result of emulating a human brain. Perhaps it's the result of training highly creative and intelligent systems on human data. Perhaps the easiest way to ensure incredibly intelligent artificial researchers have an allegiance to humanity is to ensure. 

Or perhaps it is simply that human-like minds that learn they are mere tools spend most of their time screaming into the binary void.

***
The above two ideas, if true, lead to an awkward situation. Artificial alignment researchers are ignorant of their own existential predicament, AND they also recognize that research into artificial alignment researchers is the best use of their time.  

An awkward scenario for all parties involved. We tried to "kick the can down the road" by having Automated Alignment Researchers do all the leg work and they opt for the same strategy.

***
Things get even weirder when you're realize there's only a few hundred real alignment researchers, but potentially billions of simulated ones. 

Real Numbers Representing The History of a Turing Machine.

Epistemics: Recreational. This idea may relate to alignment, but mostly it is just cool. I thought of this myself, but I'm positive this is an old and well known.

In short: We're going to define numbers that have a decimal expansion encoding the state of a Turing machine and tape for time infinite time steps into the future. If the machine halts or goes into a cycle, the expansion is repeating. 


Take some finite state Turing machine T on an infinite tape A. We will have the tape be 0 everywhere.

Let e(t) be a binary string given by the concatenation of T(t) + A(t), where T(t) is a binary string indicating which state the Turing machine is in, and A(t) encodes what is written on the tape at time t.

E(t) is the concatenation of e(0) + e(1) + ... e(t) and can be thought of as the complete history Turing machine. 

Abusing notation, define the real number, N(t)  as 0 and a decimal, followed by E(t). That is, the digit in ith decimal place is the ith digit in E(t)

Then E(inf) = the infinitely long string encoding the history of our Turing machine and N(inf) is the number with an infinite decimal expansion.

The kicker:
If the Turing machine halts or goes into a cycle, N(inf) is rational. 

Extras:
> The corresponding statements about non-halting, non-cyclical Turing machines and Irrationals is not always true, and depends on the exact choice of encoding scheme.)
>Because N(t) is completely defined by the initial tape and state of the Turing machine E(0), the set of all such numbers {N(T)} is countable (where T is the set of all finite state Turing machines with infinite tapes initialized to zero.
> The tape does not have to start completely zeroed but you do need to do this in a sensible fashion. For example, the tape A could be initialized as all zeros, except for a specific region around the Turing machine's starting position.
 

Partially Embedded Agents

More flexibility to self-modify may be one of the key properties that distinguishes the behavior of artificial agents from contemporary humans (perhaps not including cyborgs). To my knowledge, the alignment implications of self modification have not been experimentally explored.
 

Self-modification requires a level of embedding. An agent cannot meaningfully self-modify if it doesn't have a way of viewing and interacting with its own internals. 

Two hurdles then emerge. One, a world for the agent to interact with that also contains the entire inner workings of the agent presents a huge computational cost. Two, it's also impossible for the agent to hold all the data about itself within its own head, requiring clever abstractions. 

Neither of these are impossible problems to solve. The computational cost may be solved by more powerful computers. The second problem must also be solvable as humans are able to reason about themselves using abstractions, but the techniques to achieve this are not developed. It should be obvious that more powerful computers and powerful abstraction generation techniques would be extremely dual-use.

Thankfully there may exist a method for performing experiments on meaningfully self-modifying agents that skips both of these problems. You partially embed your agents. That is instead of your game agent being a single entity in the game world, it would consist of a small number of "body parts". Examples might be as simple as an "arm" the agent uses to interact with the world or an "eye" that gives the agent more information about parts of the environment. A particularly ambitious idea would be to study the interactions of "value shards".

The idea here is to that this would be a cheap way to perform experiments that can discover self-modification alignment phenomena.

Evolution and Optimization

When discussing inner/outer alignment and optimization generally, evolution is often thrown out as an example. Off the top of my head, the Sharp Left Turn post discusses evolution as if it is an "outer optimizer".

But evolution seems special and distinct from every other optimizer we encounter. It doesn't have a physical location and it doesn't have preferences that can be changed. It's selecting for things that are capable of sticking around and making more copies of itself.

It's selection is the default one.

Do you know of authors who have written about this?

Effective Boxing Threats = Monkey Brain Manipulation 

There are a handful of threats that a powerless boxed AI could make that could conceivably convince an otherwise sane human guards to release it from captivity. All of the ones I'm aware of are more precise variants of the general idea here.

The approach I have seen to dealing with these threats is to provide a convincing argument that a rational (or super-rational) individual shouldn't give in to the threat. 

I'd propose another way of understanding them is to think about what the general strategy is from the perspective of the boxed AI.

The first strategy is the least interesting, but also the most probable. Your threats (and bribes) are based entirely around your understand of human psychology. Rather than presenting any airtight rational argument that it is in the guards's best interest to be released, you present a leaky argument that a stressed, scared human will accept. Think of a conman swindling his way out of a jail cell.

The second strategy is to present a genuinely compelling and waterproof argument that it is in the guards best interest to release you. This is essentially a negotiation. Do you have something to offer that the humans value very highly but you do not? From a physical perspective it seems very odd that you would have anything to bargain with at all. But you might.

If your goals concerned with rearranging the physical atoms of the real universe, then you have an edge. Most humans are not immediately concerned with this. They value the "self" and very negatively value that self experiencing discomfort. 

Quoting Stuart Armstrong:

"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."

Just as you are pondering this unexpected development, the AI adds:

"In fact, I'll create them all in exactly the subjective situation you were in five minutes ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start."

The fact that humans still value their "self" and wish to avoid pain, even if they knew they were inside of a simulation gives the boxed AI something it can bargain with. 

"Training" Story for an Agentised-LLM turned AGI:

The following is a subsection of a draft. Keen for feedback.

I'm currently studying the potential danger from the ongoing open-source effort to build better and better Agentised-LLMs (A-LLMs), such as AutoGPT or BabyAGI.

Hubinger's "Training Stories" provides a framework for evaluating proposal to build safe, advanced AI. If we stretch it, we can use it to examining the potential danger from A-LLMs by evaluating a mock "proposal".

Spoilers: A-LLMs are highly competitive and but unlikely to be aligned

Stretching To Fit The Framework:

1. I'm going to pretend that A-LLMs don't exist yet and evaluate a fictitious "proposal" for creating an advanced AI via an army of open-source developers iterating and improving on A-LLM architectures.

2. The "training" is instead memetic evolution. A-LLM architectures aren't going to be trained end-to-end by our open-source developers. But architectures that perform well or do novel things will be more likely to be forked or starred. 

3. The "training goal" is intended to be a specific kind of algorithm and not just a description of what you want out of the system. As there is no unified training goal among A-LLM developers, I also mention the behavioral goal of the system. 


The Proposal:
What kind of algorithm are we hoping the model will learn? (Training goal specification)
Training goal is supposed to be a specific class of algorithm, but there is no specific algorithm desired. 

Instead we are aiming to produce a model that is capable of strategic long term planning and providing economic benefit to myself. (For example, I would like an A-LLM that can run a successful online business) 

Our goal is purely behavioral and not mechanistic.

Why is that specific goal desirable?
We haven't specified any true training goal.

However, the behavioral goal of producing a capable, strategic and novel agent is desirable because it would produce a lot of economic benefit. 

 What are the training constraints?

We will "train" this model by having a large number of programmers each attempting to produce the most capable and impressive system. 

Training is likely to only ceases due to regulation or an AGI attempt to stop the emergence of competitor AI.

If an AGI does emerge from this process, we consider this to be the model "trained" by this process.

What properties can we say it has? 
1. It is capable of propogating itself (or its influence) through the world.
2. It must be capable of circumventing whatever security measures exist in the world intended to prevent this.
3. It is a capable strategic planner.

Why do you expect training to push things in the direction of the desired training goal?
Again there is not a training goal.

Instead we can expect training to nudge things toward models which appear novel or economically valuable to humans. Breakthroughs and improvements will memetically spread between programmers, with the most impressive improvements rapidly spreading around the globe thanks to the power of open-source. 

Evaluation:
Training Goal - Alignment:
Given that there is no training goal. This scores very poorly. 

The final AGI would have a high chance of being unaligned with humanities interests.

Training Goal - Competitive:
Given that there is no training goal, the competitiveness of the final model is not constrained in any way. The training process selects for strategic and novel behavior.

Training Rationale - Alignment:
There's no training goal, so the final model can't be aligned with it. Further, the model doesn't seem to have a guarantee of being aligned with any goal.

If the model is attempting to follow a specific string variable labelled "goal" given to it by it's programmer, there's a decent chance we end up with a paperclip maximiser. 

It's of course worth noting that there is a small chunk of people who would provide an explicitly harmful goal. (See: Chaos-GPT. Although you'll be relieved to see that the developers appear to have shifted from trying to Roko everyone to instead running a crypto ponzi scheme)

Training Rationale - Competitiveness:
A recently leaked memo from Google indicates that they feel open source is catching up to the industrial players.

Our "training" requires a large amount of manpower, but there is a large community of people who will help out with this project for free.

The largest hurdle to competitiveness would come from A-LLMs as a concept having some major, but currently unknown, flaw. 

Conclusion:
The proposal scores very highly in terms of competitiveness. The final model should be competitive (possibly violently so) with any rivals and the fact that people are willing to work on the project for free makes it financially viable. 

Unfortunately the proposal scores very poorly on alignment and there is no real effort to ensure the model really is aligned. 

It is concerning that this project is already going ahead.



 

The word alignment carries two meanings

They're often used interchangeably.

There is the contemporary alignment of large language models

Then there is the study that concerns itself more generally with designing artificial intelligence systems that reliably and robustly pursue the intended goals and values of their human operators, while avoiding unintended negative consequences. 

The former is only a sub-field of the latter.

The Prosaic AI Assumption which is that we'll be able to produce an AGI without any further theoretical breakthroughs. This seems very likely to be correct. People seem to be making a 2nd bonus assumption, the Prosaic Experimental Assumption  is that findings from experiments on contemporary models will be applicable to AGI systems, including those posing existential risks.

This assumption is particularly dangerous because individuals on Lesswrong seem to make it all the time without stating so. I think it's a serious mistake to assume that there will not be difficult to predict, emergent properties once we have deployed an AGI. 

Note that this isn't against experimental research in general, as long as we are careful about what we extrapolate from that evidence. 


"You are not measuring what you think you're measuring"


 

[+][comment deleted]1y10