AI ALIGNMENT FORUM
AF

shubhorup biswas
010
Message
Dialogue
Subscribe

Former Software Engineer, now working on AI Safety.

Questions I'm undecided on which on resolution could shift my focus, ranked by decreasing magnitude of shift:

  1. Should I be a solipsist? If no other sentiences exist then morality is moot.
  2. Are human values worth preserving? Would it be so (morally) bad if our lightcone is forged with a (misaligned) superintelligent AI's values instead of human values?
  3. Could technical alignment research be exploited by a TAI to align a superintelligence to its values?
    1. If the technical alignment problem is unsolvable, would this preclude an intelligence explosion as a TAI would have no way to align superintelligence to its values
    2. If the technical alignment problem is solvable and we do partially solve it, would it promote a misaligned AGI's takeover by making it easier for it to align AIs smarter than it?
  4. What's the appropriate level of alarm to detecting misbehaviour in AIs(via CoT monitoring, interpretability or any other technique)? Almost everything we do(directly train against visible misbehaviour, discard the model and start afresh) creates selection pressure towards being stealthy. It would be good to quantify the number of bits of optimisation pressure that gets us into different regimes(do not misbehave, or misbehave covertly), and how many we have already expended. 

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
AI Control: Improving Safety Despite Intentional Subversion
Shubhorup Biswas1y23

>We assume that 𝐻 will be able to construct a successful exploit if and only if the attack policy is able to (because both are implemented by GPT-4).

I challenge this assumption.

Consider the code

'''
bool is_odd(int x) {
    if (sha256(x) == 354) {
        return true;
    }
    return x % 2 == 1;
}

'''

where sha256() computes the SHA-256 hash of the input. It is obviously possible for an attacker to supply numbers x and y such that sha256(x) = y, write y into the code and supply x when creating a successful exploit.

Thus an equally 'capable' adversary need not be able construct a backdoor. 

The general idea is that it should be possible to confidently claim that some code is backdoored without being able to construct an exploit. According to [Rice's Theorem](https://en.wikipedia.org/wiki/Rice%27s_theorem) it is impossible to definitively classify all programs into 'backdoored or not', so the only hope is for a process to bin programs into 'definitely safe', 'definitely backdoored' and 'suspicious'. 

Hope the 'Rate Suspicion' step will catch this, and other backdoored code where H cannot construct an exploit, as suspicious looking.

Reply
21Hidden Reasoning in LLMs: A Taxonomy
19d
0
19Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)
1mo
0