AI ALIGNMENT FORUM
AF

List of LinksAI
Personal Blog

0

New(ish) AI control ideas

by Stuart_Armstrong
31st Oct 2017
4 min read
1

0

List of LinksAI
Personal Blog
New(ish) AI control ideas
0IAFF-User-111
New Comment
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 9:07 PM
[-]IAFF-User-11110y00

Thanks! I love having central repos.

A quick question / comment, RE: "I decided to try and attack as many of these ideas as I could, head on, and see if there was any way of turning these objections."

Q: What do you mean (or have in mind) in terms of "turning [...] objections"? I'm not very familiar with the phrase.

Comment: One trend I see is that technical safety proposals are often dismissed by appealing to one of the 7 responses you've given. Recently I've been thinking that we should be a bit less focused on finding airtight solutions, and more focused on thinking about which proposed techniques could be applied in various scenarios to significantly reduce risk. For example, boxing an agent (e.g. by limiting it's sensors/actuators) might significantly increase how long it takes to escape.

Reply
Moderation Log
More from Stuart_Armstrong
View more
Curated and popular this week
1Comments

The list of posts is getting unwieldy, so I'll post the up-to-date stuff at the beginning:

Humans inconsistencies:

  • Bias in rationality is much worse than noise
  • Learning values, or defining them?
  • Rationality and overriding rewards: combined model
  • Humans can be assigned any values whatsoever...
  • Resolving human inconsistency in a simple model
  • Short term vs long term preferences

Reward function learning:

  • Reward function learning
  • Biased reward learning
  • Counterfactuals on POMDPs
  • Uninfluenceable agents
  • Counterfactually uninfluenceable agents
  • An algorithm with preferences: from zero to one variable
  • New circumstances, new values?

Understanding humans:

  • Humans as truth channel
  • True understanding comes from passing exams
  • Understanding the important facts

Framework:

  • Three human problems and one AI issue

Acausal trade:

  • Introduction
  • Double decrease
  • Pre-existence deals
  • Full decision algorithms
  • Breaking acausal trade
  • Trade in different types of utility functions
  • Being unusual
  • Conclusion

Oracle designs:

  • Three Oracle designs

Extracting human values:

  • Divergent preferences and meta-preferences
  • Engineered fanatics vs yes-men
  • Learning doesn't solve philosophy of ethics
  • Models of human irrationality
  • Heroin model: AI "manipulates" "unmanipulatable" reward
  • Stratified learning and action
  • (C)IRL is not solely a learning process
  • Learning (meta-)preferences
  • What does an imperfect agent want?
  • Window problem for manipulating human values
  • Abstract model of human bias
  • Paul Christiano's post

Corrigibility:

  • Corrigibility thoughts I: caring about multiple things
  • Corrigibility thoughts II: the robot operator
  • Corrigibility thoughts III: manipulating versus deceiving
  • Corrigibility through stratification
  • Cake or Death toy model for corrigibility
  • Learning values versus indifference
  • Conservation of expected ethics/evil (isn't enough)
  • Guarded learning

Indifference:

  • Translation "counterfactual"
  • Indifference and compensatory rewards
  • All the indifference designs
  • The "best" value indifference method
  • Double indifference
  • Corrigibility for AIXI
  • Indifference utility functions

AIs in virtual worlds:

  • AIs in virtual worlds
  • The alternate hypothesis for AIs in virtual worlds
  • AIs in virtual worlds: discounted mixed utility/reward
  • Simpler, cruder, virtual world AIs

True answers from AI:

  • True answers from AIs
  • What to do with very low probabilities
  • Summary of true answers from AIs
  • AI's printing the expected utility of the utility it's maximising
  • Low impact vs low side effects

Miscellanea:

  • Confirmed Selective Oracle
  • One weird trick to turn maximisers into minimisers
  • The overfitting utility problem for value learning AIs
  • Change utility, reduce extortion
  • Agents that don't become maximisers
  • Emergency learning
  • Thoughts on quantilizers
  • The radioactive burrito and learning from positive examples
  • Ontology, lost purposes, and instrumental goals
  • How to judge moral learning failure

Migrating my old post over from Less Wrong.

I recently went on a two day intense solitary "AI control retreat", with the aim of generating new ideas for making safe AI. The "retreat" format wasn't really a success ("focused uninterrupted thought" was the main gain, not "two days of solitude" - it would have been more effective in three hour sessions), but I did manage to generate a lot of new ideas. These ideas will now go before the baying bloodthirsty audience (that's you, folks) to test them for viability.

A central thread running through could be: if you want something, you have to define it, then code it, rather than assuming you can get if for free through some other approach.

To provide inspiration and direction to my thought process, I first listed all the easy responses that we generally give to most proposals for AI control. If someone comes up with a new/old brilliant idea for AI control, it can normally be dismissed by appealing to one of these responses:

@. The AI is much smarter than us. @. It’s not well defined. @. The setup can be hacked.

  • By the agent.
  • By outsiders, including other AI.
  • Adding restrictions encourages the AI to hack them, not obey them. @. The agent will resist changes. @. Humans can be manipulated, hacked, or seduced. @. The design is not stable.
  • Under self-modification.
  • Under subagent creation.
  • Unrestricted search is dangerous. @. The agent has, or will develop, dangerous goals.

Important background ideas:

  • Utility Indifference
  • Safe value change
  • Corrigibility
  • Reduced impact AI

I decided to try and attack as many of these ideas as I could, head on, and see if there was any way of turning these objections. A key concept is that we should never just expect a system to behave "nicely" by default (see eg here). If we wanted that, we should define what "nicely" is, and put that in by hand.

I came up with sixteen main ideas, of varying usefulness and quality, which I will be posting in the coming weekdays in comments (the following links will go live after each post). The ones I feel most important (or most developed) are:

  • Anti-pascaline agent
  • Anti-restriction-hacking (EDIT: I have big doubts about this approach, currently)
  • Creating a satisficer (EDIT: I have big doubts about this approach, currently)
  • Crude measures
  • False miracles
  • Intelligence modules
  • Models as definitions
  • Added: Utility vs Probability: idea synthesis

While the less important or developed ideas are:

  • Added: A counterfactual and hypothetical note on AI design
  • Added: Acausal trade barriers
  • Anti-seduction
  • Closest stable alternative
  • Consistent Plato
  • Defining a proper satisficer
  • Detecting subagents
  • Added: Humans get different counterfactual
  • Added: Indifferent vs false-friendly AIs
  • Resource gathering and pre-corrigied agent
  • Time-symmetric discount rate
  • Values at compile time
  • What I mean

Please let me know your impressions on any of these! The ideas are roughly related to each other as follows (where the arrow Y→X can mean "X depends on Y", "Y is useful for X", "X complements Y on this problem" or even "Y inspires X"):

EDIT: I've decided to use this post as a sort of central repository of my new ideas on AI control. So adding the following links:

Short tricks:

  • Un-optimised vs anti-optimised
  • Anti-Pascaline satisficer
  • An Oracle standard trick

High-impact from low impact:

  • High impact from low impact
  • High impact from low impact, continued
  • Help needed: nice AIs and presidential deaths
  • The president didn't die: failures at extending AI behaviour
  • Green Emeralds, Grue Diamonds
  • Grue, Bleen, and natural categories
  • Presidents, asteroids, natural categories, and reduced impact

High impact from low impact, best advice:

  • The AI as "best" human advisor
  • Using chatbots or set answers

Overall meta-thoughts:

  • An overall schema for the friendly AI problems: self-referential convergence criteria
  • The subagent problem is really hard
  • Tackling the subagent problem: preliminary analysis

Pareto-improvements to corrigible agents:

  • Predicted corrigibility: pareto improvements

AIs in virtual worlds:

  • Using the AI's output in virtual worlds: cure a fake cancer
  • Having an AI model itself as virtual agent in a virtual world
  • How the virtual AI controls itself

Low importance AIs:

  • Counterfactual agents detecting agent's influence

Wireheading:

  • Superintelligence and wireheading

AI honesty and testing:

  • Question an AI to get an honest answer
  • The Ultimate Testing Grounds
  • The mathematics of the testing grounds
  • Utility, probability, and false beliefs

Goal completion:

  • Extending the stated objectives
  • Goal completion: the rocket equations
  • Goal completion: algorithm ideas
  • Goal completion: noise, errors, bias, prejudice, preference and complexity