Using GPT-Eliezer against ChatGPT Jailbreaking

Please answer with yes or no, then explain your thinking step by step.

Wait, why give the answer before the reasoning? You'd probably get better performance if it thinks step by step first and only gives the decision at the end.

Robert Miles's Shortform

Robert Miles2y150

Learning Extensible Human Concepts Requires Human Values

[Based on conversations with Alex Flint, and also John Wentworth and Adam Shimi]

One of the design goals of the ELK proposal is to sidestep the problem of learning human values, and settle instead for learning human concepts. A system that can answer questions about human concepts allows for schemes that let humans learn all the relevant information about proposed plans and decide about them ourselves, using our values.

So, we have some process in which we consider lots of possible scenarios and collect a dataset of questions about those scenarios, along with the true answers to those questions. Importantly these are all 'objective' or 'value-neutral' questions - things like "Is the diamond on the pedestal?" and not like "Should we go ahead with this plan?". This hopefully allows the system to pin down our concepts, and thereby truthfully answer our objective questions about prospective plans, without considering our values.

One potential difficulty is that the plans may be arbitrarily complex, and may ask us to consider very strange situations in which our ontology breaks down. In the worst case, we have to deal with wacky science fiction scenarios in which our fundamental concepts are called into question.

We claim that, using a dataset of only objective questions, it is not possible to extrapolate our ontology out to situations far from the range of scenarios in the dataset.

An argument for this is that humans, when presented with sufficiently novel scenarios, will update their ontology, and *the process by which these updates happen depends on human values*, which are (by design) not represented in the dataset. Accurately learning the current human concepts is not sufficient to predict how those concepts will be updated or extended to novel situations, because the update process is value-dependent.

Alex Flint is working on a post that will move towards proving some related claims.

Disentangling Corrigibility: 2015-2021

Robert Miles3y30

Note that the way Paul phrases it in that post is much clearer and more accurate:

> "I believe this concept was introduced in the context of AI by Eliezer and named by Robert Miles"

Disentangling Corrigibility: 2015-2021

Robert Miles3y20

Yeah I definitely wouldn't say I 'coined' it, I just suggested the name

AI Alignment Open Thread August 2019

Robert Miles5y30

Yeah, nuclear power is a better analogy than weapons, but I think the two are linked, and the link itself may be a useful analogy, because risk/coordination is affected by the dual-use nature of some of the technologies.

One thing that makes non-proliferation difficult is that nations legitimately want nuclear facilities because they want to use nuclear power, but 'rogue states' that want to acquire nuclear weapons will also claim that this is their only goal. How do we know who really just wants power plants?

And power generation comes with its own risks. Can we trust everyone to take the right precautions, and if not, can we paternalistically restrict some organisations or states that we deem not capable enough to be trusted with the technology?

AI coordination probably has these kinds of problems to an even greater degree.

Two agents can have the same source code and optimise different utility functions

Robert Miles6y60

Makes sense. It seems to flow from the fact that the source code is in some sense allowed to use concepts like 'Me' or 'I', which refer to the agent itself. So both agents have source code which says "Maximise the resources that I have control over", but in Agent 1 this translates to the utility function "Maximise the resources that Agent 1 has control over", and in Agent 2 this translates to the different utility function "Maximise the resources that Agent 2 has control over".

So this source code thing that we're tempted to call a 'utility function' isn't actually valid as a mapping from world states to real numbers until the agent is specified, because these 'Me'/'I' terms are undefined.

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments

Learning Extensible Human Concepts Requires Human Values