Quick Takes

I listened to The Failure of Risk Management by Douglas Hubbard, a book that vigorously criticizes qualitative risk management approaches (like the use of risk matrices), and praises a rationalist-friendly quantitative approach. Here are 4 takeaways from that book:

  • There are very different approaches to risk estimation that are often unaware of each other: you can do risk estimations like an actuary (relying on statistics, reference class arguments, and some causal models), like an engineer (relying mostly on causal models and simulations), like a trader (r
... (read more)
1romeostevensit3d
Is there a short summary on the rejecting Knightian uncertainty bit?

By Knightian uncertainty, I mean "the lack of any quantifiable knowledge about some possible occurrence" i.e. you can't put a probability on it (Wikipedia).

The TL;DR is that Knightian uncertainty is not a useful concept to make decisions, while the use subjective probabilities is: if you are calibrated (which you can be trained to become), then you will be better off taking different decisions on p=1% "Knightian uncertain events" and p=10% "Knightian uncertain events". 

For a more in-depth defense of this position in the context of long-term prediction... (read more)

I'm worried that "pause all AI development" is like the "defund the police" of the alignment community. I'm not convinced it's net bad because I haven't been following governance-- my current guess is neutral-- but I do see these similarities:

  • It's incredibly difficult and incentive-incompatible with existing groups in power
  • There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals
  • There are some obvious negative effects; potential overhangs or greater inc
... (read more)
Showing 3 of 4 replies (Click to show all)

I now think the majority of impact of AI pause advocacy will come from the radical flank effect, and people should study it to decide whether pause advocacy is good or bad.

2Lauro Langosco10mo
IMO making the field of alignment 10x larger or evals do not solve a big part of the problem, while indefinitely pausing AI development would. I agree it's much harder, but I think it's good to at least try, as long as it doesn't terribly hurt less ambitious efforts (which I think it doesn't).
1Thomas Kwa1y
This seems good if it could be done. But the original proposal was just a call for labs to individually pause their research, which seems really unlikely to work. Also, the level of civilizational competence required to compensate labs seems to be higher than for other solutions. I don't think it's a common regulatory practice to compensate existing labs like this, and it seems difficult to work out all the details so that labs will feel adequately compensated. Plus there might be labs that irrationally believe they're undervalued. Regulations similar to the nuclear or aviation industry feel like a more plausible way to get slowdown, and have the benefit that they actually incentivize safety work.
Showing 3 of 29 replies (Click to show all)

Is it possible to replace the maximin decision rule in infra-Bayesianism with a different decision rule? One surprisingly strong desideratum for such decision rules is the learnability of some natural hypothesis classes.

In the following, all infradistributions are crisp.

Fix finite action set  and finite observation set .  For any  and , let

be defined by

In other words, this kernel samples a time step  out of the geometric distribution with parameter... (read more)

2Vanessa Kosoy24d
Formalizing the richness of mathematics Intuitively, it feels that there is something special about mathematical knowledge from a learning-theoretic perspective. Mathematics seems infinitely rich: no matter how much we learn, there is always more interesting structure to be discovered. Impossibility results like the halting problem and Godel incompleteness lend some credence to this intuition, but are insufficient to fully formalize it. Here is my proposal for how to formulate a theorem that would make this idea rigorous. (Wrong) First Attempt Fix some natural hypothesis class for mathematical knowledge, such as some variety of tree automata. Each such hypothesis Θ represents an infradistribution over Γ: the "space of counterpossible computational universes". We can say that Θ is a "true hypothesis" when there is some θ in the credal set Θ (a distribution over Γ) s.t. the ground truth Υ∗∈Γ "looks" as if it's sampled from θ. The latter should be formalizable via something like a computationally bounded version of Marin-Lof randomness. We can now try to say that Υ∗ is "rich" if for any true hypothesis Θ, there is a refinement Ξ⊆Θ which is also a true hypothesis and "knows" at least one bit of information that Θ doesn't, in some sense. This is clearly true, since there can be no automaton or even any computable hypothesis which fully describes Υ∗. But, it's also completely boring: the required Ξ can be constructed by "hardcoding" an additional fact into Θ. This doesn't look like "discovering interesting structure", but rather just like brute-force memorization. (Wrong) Second Attempt What if instead we require that Ξ knows infinitely many bits of information that Θ doesn't? This is already more interesting. Imagine that instead of metacognition / mathematics, we would be talking about ordinary sequence prediction. In this case it is indeed an interesting non-trivial condition that the sequence contains infinitely many regularities, s.t. each of them can be exp
2Vanessa Kosoy8mo
Recording of a talk I gave in VAISU 2023.

I just finished listening to The Hacker and the State by Ben Buchanan, a book about cyberattacks, and the surrounding geopolitics. It's a great book to start learning about the big state-related cyberattacks of the last two decades. Some big attacks /leaks he describes in details:

  • Wire-tapping/passive listening efforts from the NSA, the "Five Eyes", and other countries
  • The multi-layer backdoors the NSA implanted and used to get around encryption, and that other attackers eventually also used (the insecure "secure random number" trick + some stuff on top of t
... (read more)

Thanks! I read and enjoyed the book based on this recommendation

I listened to the book This Is How They Tell Me the World Ends by Nicole Perlroth, a book about cybersecurity and the zero-day market. It describes in detail the early days of bug discovery, the social dynamics and moral dilemma of bug hunts.

(It was recommended to me by some EA-adjacent guy very worried about cyber, but the title is mostly bait: the tone of the book is alarmist, but there is very little content about potential catastrophes.)

My main takeaways:

  • Vulnerabilities used to be dirt-cheap (~$100) but are still relatively cheap (~$1M even for big zer
... (read more)
Reply13422
3Buck Shlegeris13d
Do you have concrete examples?
1Fabien Roger12d
I remembered mostly this story: [Taken from this summary of this passage of the book. The book was light on technical detail, I don't remember having listened to more details than that.] I didn't realize this was so early in the story of the NSA, maybe this anecdote teaches us nothing about the current state of the attack/defense balance.

The full passage in this tweet thread (search for "3,000").

Recently someone either suggested to me (or maybe told me they or someone where going to do this?) that we should train AI on legal texts, to teach it human values. Ignoring the technical problem of how to do this, I'm pretty sure legal text are not the right training data. But at the time, I could not clearly put into words why. Todays SMBC explains this for me:

Saturday Morning Breakfast Cereal - Law (smbc-comics.com)

Law is not a good representation or explanation of most of what we care about, because it's not trying to be. Law is mainly focused on the c... (read more)

I finally got around to reading the Mamba paper. H/t Ryan Greenblatt and Vivek Hebbar for helpful comments that got me unstuck. 

TL;DR: authors propose a new deep learning architecture for sequence modeling with scaling laws that match transformers while being much more efficient to sample from.

A brief historical digression

As of ~2017, the three primary ways people had for doing sequence modeling were RNNs, Conv Nets, and Transformers, each with a unique “trick” for handling sequence data: recurrence, 1d convolutions, and self-attention.
 

  • RNNs are
... (read more)
Showing 3 of 4 replies (Click to show all)

StripedHyena, Griffin, and especially Based suggest that combining RNN-like layers with even tiny sliding window attention might be a robust way of getting a large context, where the RNN-like layers don't have to be as good as Mamba for the combination to work. There is a great variety of RNN-like blocks that haven't been evaluated for hybridization with sliding window attention specifically, as in Griffin and Based. Some of them might turn out better than Mamba on scaling laws after hybridization, so Mamba being impressive without hybridization might be l... (read more)

12Ryan Greenblatt1mo
Another key note about mamba is that despite being RNN-like it doesn't result in substantially higher effective serial reasoning depth (relative to transformers). This is because the state transition is linear[1]. However, it is architecturally closer to things that might involve effectively higher depth. See also here. ---------------------------------------- 1. And indeed, there is a fundamental tradeoff where if the state transition function is expressive (e.g. nonlinear), then it would no longer be possible to use a parallel scan because the intermediates for the scan would be too large to represent compactly or wouldn't simplify the original functions to reduce computation. You can't compactly represent f∘g (f composed with g) in a way that makes computing f(g(x)) more efficient for general choices of f and g (in the typical MLP case at least). Another simpler but less illuminating way to put this is that higher serial reasoning depth can't be parallelized (without imposing some constraints on the serial reasoning). ↩︎
2Lawrence Chan1mo
I mean, yeah, as your footnote says: Transformers do get more computation per token on longer sequences, but they also don't get more serial depth, so I'm not sure if this is actually an issue in practice?   1. ^ As an aside, I actually can't think of any class of interesting functions with this property -- when reading the paper, the closest I could think of are functions on discrete sets (lol), polynomials (but simplifying these are often more expensive than just computing the terms serially), and rational functions (ditto)  

I feel kinda frustrated whenever "shard theory" comes up in a conversation, because it's not a theory, or even a hypothesis. In terms of its literal content, it basically seems to be a reframing of the "default" stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is "assume they're just a set of heuristics".

This is a particular pity because I think there's a version of the "shard" framing which would actually be useful, but which shard advocates go out of their way to avoid. Specifically: we should be int... (read more)

Reply65543321
Showing 3 of 5 replies (Click to show all)

FWIW I'm potentially intrested in interviewing you (and anyone else you'd recommend) and then taking a shot at writing the 101-level content myself.

4Alex Turner1mo
Nope! I have basically always enjoyed talking with you, even when we disagree.
2Daniel Kokotajlo1mo
Ok, whew, glad to hear.
gwern1mo2726

Warning for anyone who has ever interacted with "robosucka" or been solicited for a new podcast series in the past few years: https://www.tumblr.com/rationalists-out-of-context/744970106867744768/heads-up-to-anyone-whos-spoken-to-this-person-i

Reposting myself from discord, on the topic of donating 5000$ to EA causes.

if you're doing alignment research, even just a bit, then the 5000$ are plobly better spent on yourself

if you have any gears level model of AI stuff then it's better value to pick which alignment org to give to yourself; charity orgs are vastly understaffed and you're essentially contributing to the "picking what to donate to" effort by thinking about it yourself

if you have no gears level model of AI then it's hard to judge which alignment orgs it's helpful to donate to (or, if gi

... (read more)
Showing 3 of 4 replies (Click to show all)
7Oliver Habryka1mo
I think people who give up large amounts of salary to work in jobs that other people are willing to pay for from an impact perspective should totally consider themselves to have done good comparable to donating the difference between their market salary and their actual salary. This applies to approximately all safety researchers. 
4Ben Pace1mo
I don’t think it applies to safety researchers at AI Labs though, I am shocked how much those folks can make.

They still make a lot less than they would if they optimized for profit (that said, I think most "safety researchers" at big labs are only safety researchers in name and I don't think anyone would philanthropically pay for their labor, and even if they did, they would still make the world worse according to my model, though others of course disagree with this).

Apparently[1] there was recently some discussion of Survival Instinct in Offline Reinforcement Learning (NeurIPS 2023). The results are very interesting: 

On many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with "wrong" reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL's return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL count

... (read more)

I think some people have the misapprehension that one can just meditate on abstract properties of "advanced systems" and come to good conclusions about unknown results "in the limit of ML training", without much in the way of technical knowledge about actual machine learning results or even a track record in predicting results of training.

For example, several respected thinkers have uttered to me English sentences like "I don't see what's educational about watching a line go down for the 50th time" and "Studying modern ML systems to understand future ones ... (read more)

Showing 3 of 7 replies (Click to show all)
11Thane Ruthenis1mo
It's not what I want to do, at least. For me, the key thing is to predict the behavior of AGI-level systems. The behavior of NNs-as-trained-today is relevant to this only inasmuch as NNs-as-trained-today will be relevant to future AGI-level systems. My impression is that you think that pretraining+RLHF (+ maybe some light agency scaffold) is going to get us all the way there, meaning the predictive power of various abstract arguments from other domains is screened off by the inductive biases and other technical mechanistic details of pretraining+RLHF. That would mean we don't need to bring in game theory, economics, computer security, distributed systems, cognitive psychology, business, history into it – we can just look at how ML systems work and are shaped, and predict everything we want about AGI-level systems from there. I disagree. I do not think pretraining+RLHF is getting us there. I think we currently don't know what training/design process would get us to AGI. Which means we can't make closed-form mechanistic arguments about how AGI-level systems will be shaped by this process, which means the abstract often-intuitive arguments from other fields do have relevant things to say. And I'm not seeing a lot of ironclad arguments that favour "pretraining + RLHF is going to get us to AGI" over "pretraining + RLHF is not going to get us to AGI". The claim that e. g. shard theory generalizes to AGI is at least as tenuous as the claim that it doesn't. I'd be interested if you elaborated on that.

It's not what I want to do, at least. For me, the key thing is to predict the behavior of AGI-level systems. The behavior of NNs-as-trained-today is relevant to this only inasmuch as NNs-as-trained-today will be relevant to future AGI-level systems.

Thanks for pointing out that distinction! 

1Chris_Leong1mo
“But also this person doesn't know about internal invariances in NN space or the compressivity of the parameter-function map (the latter in particular is crucial for reasoning about inductive biases), then I become extremely concerned” Have you written about this anywhere?

Tiny review of The Knowledge Machine (a book I listened to recently)

  • The core idea of the book is that science makes progress by forbidding non-empirical evaluation of hypotheses from publications, focusing on predictions and careful measurements while excluding philosophical interpretations (like Newton's "I have not as yet been able to deduce from phenomena the reason for these properties of gravity, and I do not feign hypotheses. […] It is enough that gravity really exists and acts according to the laws that we have set forth.").
  • The author basically argu
... (read more)

Per my recent chat with it, chatgpt 3.5 seems "situationally aware"... but nothing groundbreaking has happened because of that AFAICT.

From the LW wiki page:

Ajeya Cotra uses the term "situational awareness" to refer to a cluster of skills including “being able to refer to and make predictions about yourself as distinct from the rest of the world,” “understanding the forces out in the world that shaped you and how the things that happen to you continue to be influenced by outside forces,” “understanding your position in the world relative to other actors who

... (read more)
6Zack M. Davis3mo
I think "Symbol/Referent Confusions in Language Model Alignment Experiments" is relevant here: the fact that the model emits sentences in the grammatical first person doesn't seem like reliable evidence that it "really knows" it's talking about "itself". (It's not evidence because it's fictional, but I can't help but think of the first chapter of Greg Egan's Diaspora, in which a young software mind is depicted as learning to say I and me before the "click" of self-awareness when it notices itself as a specially controllable element in its world-model.) Of course, the obvious followup question is, "Okay, so what experiment would be good evidence for 'real' situational awareness in LLMs?" Seems tricky. (And the fact that it seems tricky to me suggests that I don't have a good handle on what "situational awareness" is, if that is even the correct concept.)

the fact that the model emits sentences in the grammatical first person doesn't seem like reliable evidence that it "really knows" it's talking about "itself"

I consider situational awareness to be more about being aware of one's situation, and how various interventions would affect it. Furthermore, the main evidence I meant to present was "ChatGPT 3.5 correctly responds to detailed questions about interventions on its situation and future operation." I think that's substantial evidence of (certain kinds of) situation awareness.

If you want to better understand counting arguments for deceptive alignment, my comment here might be a good place to start.

4Alex Turner2mo
Speculates on anti-jailbreak properties of steering vectors. Finds putative "self-awareness" direction. Also:

From the post:

What are these vectors really doing? An Honest mystery... Do these vectors really change the model's intentions? Do they just up-rank words related to the topic? Something something simulators? Lock your answers in before reading the next paragraph!

OK, now that you're locked in, here's a weird example. 

When used with the prompt below, the honesty vector doesn't change the model's behavior—instead, it changes the model's judgment of someone else's behavior! This is the same honesty vector as before—generated by asking the model to act hon

... (read more)

The "shoggoth" meme is, in part, unfounded propaganda. Here's one popular incarnation of the shoggoth meme:

Shoggoth with Smiley Face (Artificial Intelligence) | Know Your Meme

This meme accurately portrays the (IMO correct) idea that finetuning and RLHF don't change the base model too much. Furthermore, it's probably true that these LLMs think in an "alien" way. 

However, this image is obviously optimized to be scary and disgusting. It looks dangerous, with long rows of sharp teeth. It is an eldritch horror. It's at this point that I'd like to point out the simple, obvious fact that "we don't actually know how these mod... (read more)

Showing 3 of 27 replies (Click to show all)

re: "the sense of danger is very much supported by the current state of evidence" -- I mean, you've heard all this stuff before, but I'll summarize:

--Seems like we are on track to probably build AGI this decade
--Seems like we are on track to have an intelligence explosion, i.e. a speedup of AI R&D due to automation
--Seems like the AGI paradigm that'll be driving all this is fairly opaque and poorly understood. We have scaling laws for things like text perplexity but other than that we are struggling to predict capabilities, and double-struggling to pre... (read more)

6Ryan Greenblatt2mo
I think data ordering basically never matters for LLM pretraining. (As in, random is the best and trying to make the order more specific doesn't help.)
2Daniel Kokotajlo2mo
That was my impression too.

Deceptive alignment seems to only be supported by flimsy arguments. I recently realized that I don't have good reason to believe that continuing to scale up LLMs will lead to inner consequentialist cognition to pursue a goal which is roughly consistent across situations. That is: a model which not only does what you ask it to do (including coming up with agentic plans), but also thinks about how to make more paperclips even while you're just asking about math homework

Aside: This was kinda a "holy shit" moment, and I'll try to do it justice here. I e... (read more)

Reply12653211
Showing 3 of 17 replies (Click to show all)

I wish I had read this a week ago instead of just now, it would have saved a significant amount of confusion and miscommunication!

2Roger Dearnaley3mo
I think there are two separate questions here, with possibly (and I suspect actually) very different answers: 1. How likely is deceptive alignment to arise in an LLM under SGD across a large very diverse pretraining set (such as a slice of the internet)? 2. How likely is deceptive alignment to be boosted in an LLM under SGD fine tuning followed by RL for HHH-behavior applied to a base model trained by 1.? I think the obvious answer to 1. is that the LLM is going to attempt (limited by its available capacity and training set) to develop world models of everything that humans do that affects the contents of the Internet. One of the many things that humans do is pretend to be more aligned to the wishes of an authority that has power over them than they truly are. So for a large enough LLM, SGD will create a world model for this behavior along with thousands of other human behaviors, and the LLM will (depending on the prompt) tend to activate this behavior at about the frequency and level that you find it on the Internet, as modified by cues in the particular prompt. On the Internet, this is generally a mild background level for people writing while at work in Western countries, and probably more strongly for people writing from more authoritarian countries: specific prompts will be more or less correlated with this. For 2., the question is whether fine-tuning followed by RL will settle on this preexisting mechanism and make heavy use of it as part of the way that it implements something that fits the fine-tuning set/scores well on the reward model aimed at creating a helpful, honest, and harmless assistant persona. I'm a lot less certain of the answer here, and I suspect it might depend rather strongly on the details of the training set. For example, is this evoking an "you're at work, or in an authoritarian environment, so watch what you say and do" scenario that might boost the use of this particular behavior? The "harmless" element in HHH seems particularly con
1Robert Kirk3mo
How would you imagine doing this? I understand your hypothesis to be "If a model generalises as if it's a mesa-optimiser, then it's better-described as having simplicity bias". Are you imagining training systems that are mesa-optimisers (perhaps explicitly using some kind of model-based RL/inference-time planning and search/MCTS), and then trying to see if they tend to learn simple cross-episode inner goals which would be implied by a stronger implicity bias?

In an alternate universe, someone wrote a counterpart to There's No Fire Alarm for Artificial General Intelligence:

Okay, let’s be blunt here. I don’t think most of the discourse about alignment being really hard is being generated by models of machine learning at all. I don’t think we’re looking at wrong models; I think we’re looking at no models.

I was once at a conference where there was a panel full of famous AI alignment luminaries, and most of the luminaries were nodding and agreeing with each other that of course AGI alignment is really hard and unadd

... (read more)

Nice analogy! I approve of stuff like this. And in particular I agree that MIRI hasn't convincingly argued that we can't do significant good stuff (including maybe automating tons of alignment research) without agents.

Insofar as your point is that we don't have to build agentic systems and nonagentic systems aren't dangerous, I agree? If we could coordinate the world to avoid building agentic systems I'd feel a lot better.
 

4__RicG__4mo
  Sorry, I might misunderstanding you (and hope I am), but... I think doomers literally say "Nobody knows what internal motivational structures SGD will entrain into scaled-up networks and thus we are all doomed". The problems is not having the science to confidently say how the AIs will turn out, and not that doomers have a secret method to know that next-token-prediction is evil. If you meant that doomers are too confident answering the question "will SGD even make motivational structures?" their (and mine) answer still stems from ignorance: nobody knows, but it is plausible that SGD will make motivational structures in the neural networks because it can be useful in many tasks (to get low loss or whatever), and if you think you do know better you should show it experimentally and theoretically in excruciating detail.   I also don't see how it logically follows that "If your model has the extraordinary power to say what internal motivational structures SGD will entrain into scaled-up networks" => "then you ought to be able to say much weaker things that are impossible in two years" but it seems to be the core of the post. Even if anyone had the extraordinary model to predict what SGD exactly does (which we, as a species, should really strive for!!) it would still be a different question to predict what will or won't happen in the next two years. If I reason about my field (physics) the same should hold for a sentence structured like "If your model has the extraordinary power to say how an array of neutral atoms cooled to a few nK will behave when a laser is shone upon them" (which is true) => "then you ought to be able to say much weaker things that are impossible in two years in the field of cold atom physics" (which is... not true). It's a non sequitur.
2Alex Turner4mo
It would be "useful" (i.e. fitness-increasing) for wolves to have evolved biological sniper rifles, but they did not. By what evidence are we locating these motivational hypotheses, and what kinds of structures are dangerous, and why are they plausible under the NN prior?  The relevant commonality is "ability to predict the future alignment properties and internal mechanisms of neural networks." (Also, I don't exactly endorse everything in this fake quotation, so indeed the analogized tasks aren't as close as I'd like. I had to trade off between "what I actually believe" and "making minimal edits to the source material.")

Some updates about the dictionary_learning repo:

  • The repo now has support for ghost grads. h/t g-w1 for submitting a PR for this
  • ActivationBuffers now work natively with model components -- like the residual stream -- whose activations are typically returned as tuples; the buffer knows to take the first component of the tuple (and will iteratively do this if working with nested tuples).
  • ActivationBuffers can now be stored on the GPU.
  • The file evaluation.py contains code for evaluating trained dictionaries. I've found this pretty useful for quickly evaluating d
... (read more)
Load More