Ofer Givoli

Send me anonymous feedback: https://docs.google.com/forms/d/e/1FAIpQLScLKiFJbQiuRYBhrBbVYUo_c6Xf0f8DN_blbfpJ-2Ml39g1zA/viewform

Any type of feedback is welcome, including arguments that a post/comment I wrote is net negative.

Some quick info about me:

I have a background in computer science (BSc+MSc; my MSc thesis was in NLP and ML, though not in deep learning).

You can also find me on the EA Forum.

Feel free to reach out by sending me a PM here or on my website.


Sorted by New


[AN #108]: Why we should scrutinize arguments for AI risk

Thank you for clarifying!

Like, why didn't the earlier less intelligent versions of the system fail in some non-catastrophic way

Even if we assume there will be no algorithmic-related-discontinuity, I think the following are potential reasons:

  1. Detecting deceptive behaviors in complicated environments may be hard. To continue with the Facebook example, suppose that at some point in the future Facebook's feed-creation-agent would behave deceptively in some non-catastrophic way. Suppose it uses some unacceptable technique to increase user engagement (e.g. making users depressed), but it refrains from doing so in situations where it predicts that Facebook engineers would notice. The agent is not that great at being deceptive though, and a lot of times it ends up using the unacceptable technique when there's actually a high risk of the technique being noticed. Thus, Facebook engineers do notice the unacceptable technique at some point and fix the reward function accordingly (penalizing depressing content or whatever). But how will they detect the deceptive behavior itself? Will they be on the lookout for deceptive behavior and use clever techniques to detect it? (If so, what made Facebook transition into a company that takes AI safety seriously?)

  2. Huge scale-ups without much intermediate testing. Suppose at some point in the future, Facebook decides to scale up the model and training process of their feed-creation-agent by 100x (by assumption, data is not the bottleneck). It seems to me that this new agent may pose an existential risk even conditioned on the previous agent being completely benign. If you think that Facebook is unlikely to do a 100x scale-up in one go, suppose that their leadership comes to believe that the scale-up would cause their revenue to increase in expectation by 10%. That's ~$7B per year, so they are probably willing to spend a lot of money on the scale-up. Also, they may want to complete the scale-up ASAP because they "lose" $134M for every week of delay.

[AN #108]: Why we should scrutinize arguments for AI risk

Rohin's opinion: [...] Overall I agree pretty strongly with Ben. I do think that some of the counterarguments are coming from a different frame than the classic arguments. For example, a lot of the counterarguments involve an attempt to generalize from current ML practice to make claims about future AI systems. However, I usually imagine that the classic arguments are basically ignoring current ML, and instead claiming that if an AI system is superintelligent, then it must be goal-directed and have convergent instrumental subgoals.

I agree that the book Superintelligence does not mention any non-goal-directed approaches to AI alignment (as far as I can recall). But as long as we're in the business-as-usual state, we should expect some well-resourced companies to train competitive goal-directed agents that act in the real world, right? (E.g. Facebook plausibly uses some deep RL approach to create the feed that each user sees). Do you agree that for those systems, the classic arguments about instrumental convergence and the treacherous turn are correct? (If so, I don't understand how come you agree pretty strongly with Ben; he seems to be skeptical that those arguments can be mapped to contemporary ML methods.)

What specific dangers arise when asking GPT-N to write an Alignment Forum post?

It may be the case that solving inner alignment problems means hitting a narrow target; meaning that if we naively carry out a super-large-scale training process that spits out a huge AGI-level NN, dangerous logic is very likely to arise somewhere in the NN at some point during training. Since this concern doesn't point at any specific-type-of-dangerous-logic I guess it's not what you're after in this post; but I wouldn't classify it as part of the threat model that "we don't know what we don't know".

Having said all that, here's an attempt at describing a specific scenario as requested:

Suppose we finally train our AGI-level GPT-N and we think that the distribution it learned is "the human writing distribution", HWD for short. HWD is a distribution that roughly corresponds to our credences when answering questions like "which of these two strings is more likely to have appeared on the internet prior to 2020-07-28?". But unbeknown to us, the inductive bias of our training process made GPT-N learn the distribution HWD*, which is just like HWD except that some fraction of [the strings with a prefix that looks like "a prompt by humans-trying-to-automate-AI-safety"] are manipulative and make AI safety researchers, upon reading, invoke an AGI with a goal system X. Turns out that the inductive bias of our training process caused GPT-N to model agents-with-goal-system-X and such agents tend to sample lots of strings from the HWD* distribution in order to "steal" the cosmic endowment of reckless civilizations like ours. This would be a manifestation of is the same type of failure mode as the universal prior problem.

Alignment As A Bottleneck To Usefulness Of GPT-3

There are infinitely many distributions from which the training data of GPT could have been sampled from [EDIT: including ones that could be catastrophic as the distribution our AGI learns], so it's worth mentioning an additional challenge on this route: making the future AGI-level-GPT learn the "human writing distribution" that we have in mind.

Learning the prior

I'm confused about this point. My understanding is that, if we sample iid examples from some dataset and then naively train a neural network with them, in the limit we may run into universal prior problems, even during training (e.g. an inference execution that leverages some software vulnerability in the computer that runs the training process).

[AN #105]: The economic trajectory of humanity, and what we might mean by optimization

Claims of the form “neural nets are fundamentally incapable of X” are almost always false: recurrent neural nets are Turing-complete, and so can encode arbitrary computation.

I think RNNs are not Turing-complete (assuming the activations and weights can be represented by a finite number of bits). Models with finite state space (reading from an infinite input stream) can't simulate a Turing machine.

(Though I share the background intuition.)

[This comment is no longer endorsed by its author]Reply
AI safety via market making

Interesting idea.

Suppose that in the first time step is able to output a string that will manipulate into: (1) giving a probability that is maximally different than ; and (2) not looking at the rest of (i.e. the human will never see ,,...).

Ignoring inner alignment problems, in the limit it seems plausible that will output such an ; resulting in , and the smallest possible given .

[EDIT: actually, such problems are not specific to this idea and seem to generally apply to the 'AI safety via debate' approach.]

OpenAI announces GPT-3

As abergal wrote, not carrying the "1" can simply mean it does digit-wise addition (which seems trivial via memorization). But notice that just before that quote they also write:

To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic problems in our test set and searched for them in our training data in both the forms "<NUM1> + <NUM2> =" and "<NUM1> plus <NUM2>". Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000 subtraction problems we found only 2 matches (0.1%), suggesting that only a trivial fraction of the correct answers could have been memorized.

That seems like evidence against memorization, but maybe their simple search failed to find most cases with some relevant training signal, eg: "In this diet you get 350 calories during breakfast: 200 calories from X and 150 calories from Y."

Databases of human behaviour and preferences?

Maybe Minecraft-related datasets can be helpful. I'm not familiar with them myself, but I found these two:

CraftAssist: A Framework for Dialogue-enabled Interactive Agents

MineRL: A Large-Scale Dataset of Minecraft Demonstrations

Three Kinds of Competitiveness

Good point about inner alignment problems being a blocker to date-competitiveness for IDA... but aren't they also a blocker to date-competitiveness for every other alignment scheme too pretty much?

I think every alignment approach (other than interpretability-as-a-standalone-approach) that involves contemporary ML (i.e. training large neural networks) may have its date-competitiveness affected by inner alignment.

What alignment schemes don't suffer from this problem?

Most alignment approaches may have their date-competitiveness affected by inner alignment. (It seems theoretically possible to use whole brain emulation without inner alignment related risks, but as you mentioned elsewhere someone may build a neuromorphic AGI before we get there.)

I'm thinking "Do anything useful that a human with a lot of time can do" is going to be substantially less capable than full-blown superintelligent AGI.

I agree. Even a "narrow AI" system that is just very good at predicting stock prices may outperform "a human with a lot of time" (by leveraging very-hard-to-find causal relations).

Instead of saying we should expect IDA to be performance-competitive, I should have said something like the following: If at some point in the future we get to a situation where trillions of safe AGI systems are deployed—and each system can "only" do anything that a human-with-a-lot-of-time can do—and we manage to not catastrophically screw up until that point, I think humanity will probably be out of the woods. (All of humanity's regular problems will probably get resolved very quickly, including the lack of coordination.)

Load More