If you think it would be helpful, you are welcome to suggest a meta philpsophy topic for AI Safety Camp.
More info at aisafety.camp. (I'm typing on a phone, I'll add actuall link later if I remember too)
But I think orgs are more likely to be well-known to grant-makers on average given that they tend to have a higher research output,
I think your getting the causality backwards. You need money first, before there is an org. Unless you count informal multi people collaborations as orgs. I think people how are more well-known to grant-makers are more likely to start orgs. Where as people who are less known are more likely to get funding at all, if they aim for a smaller garant, i.e. as an independent researcher.
Counter point. After the FTX collapse, OpenPhil said publicly (some EA Forum post) that they where raising their bar for funding. I.e. there are things that would have been funded before that would now not be funded. The stated reason for this is that there are generally less money around, in total. To me this sounds like the thing you would do if money is the limitation. I don't know why OpenPhil don't spend more. Maybe they have long timelines and also don't expect any more big donors any time soon? And this is why they want to spend carefully?
From what I can tell, the field have been funding constrained since the FTX collapse.What I think happened: FTX had lots of money and a low bar for funding, which meant they spread a lot of money around. This meant that more project got started, and probably even more people got generally encouraged to join. Probably some project got funded that should not have been, but probably also some really good projects got started that did not get money before because not clearing the bar before due to not having the right connections, or just bad att writing ... (read more)
Todays thoughts: I suspect it's not possible to build autonomous aligned AIs (low confidence). The best we can do is some type of hybrid humans-in-the-loop system. Such a system will be powerful enough to eventually give us everything we want, but it will also be much slower and intellectually inferior to what is possible with out humans-in-the-loop. I.e. the alignment tax will be enormous. The only way the safe system can compete, is by not building the unsafe system. Therefore we need AI Governance. Fortunately, political action is getting a lo... (read more)
Recently an AI safety researcher complained to me about some interaction they had with an AI Safety communicator. Very stylized, there interaction went something like this:(X is some fact or topic related to AI SafetyCommunicator: We don't know anything about X and there is currently no research on X.
Researcher: Actually, I'm working on X, and I do know some things about X.
Communicator: We don't know anything about X and there is currently no research on X.
I notice that I semi-frequently hear communicators saying things like the thing above. I think ... (read more)
Recording though in progress...I notice that I don't expect FOOM like RSI, because I don't expect we'll get an mesa optimizer with coherent goals. It's not hard to give the outer optimiser (e.g. gradient decent) a coherent goal. For the outer optimiser to have a coherent goal is the default. But I don't expect that to translate to the inner optimiser. The inner optimiser will just have a bunch of heuristics and proxi-goals, and not be very coherent, just like humans.
The outer optimiser can't FOOM, since it don't do planing, and don't have strategic s... (read more)
There is no study material since this is not a course. If you are accepted to one of the project teams they you will work on that project.
You can read about the previous research outputs here: Research Outputs – AI Safety Camp
The most famous research to come out of AISC is the coin-run experiment.(95) We Were Right! Real Inner Misalignment - YouTube[2105.14111] Goal Misgeneralization in Deep Reinforcement Learning (arxiv.org)But the projects are different each year, so the best way to get an idea for what it's like is just to read the project descrip... (read more)
Second reply. And this time I actually read the link.I'm not suppressed by that result.
My original comment was a reaction to claims of the type [the best way to solve almost any task is to develop general intelligence, therefore there is a strong selection pressure to become generally intelligent]. I think this is wrong, but I have not yet figured out exactly what the correct view is. But to use an analogy, it's something like this: In the example you gave, the AI get's better at the sub tasks by learning on a more general training set. It seems like ... (read more)
I agree that eventually, at some level of trying to solve enough different types of tasks, GI will be efficient, in terms of how much machinery you need, but it will never be able to compete on speed.
Also, it's an open question what is "enough different types of tasks". Obviously, for a sufficient broad class of problems GI will be more efficient (in the sense clarified above). Equally obviously, for a sufficient narrow class of problems narrow capabilities will be more efficient.
Humans have GI to some extent, but we mostly don't use it. This i... (read more)
I think we agreement.I think the confusion is because it is not clear form that section of the post if you are saying 1)"you don't need to do all of these things" or2) "you don't need to do any of these things".
Because I think 1 goes without saying, I assumed you were saying 2. Also 2 probably is true in rare cases, but this is not backed up by your examples.
But if 1 don't go without saying, then this means that a lot of "doing science" is cargo-culting? Which is sort of what you are saying when you talk about cached methodologies.
So why would sm... (read more)
In particular, four research activities were often highlighted as difficult and costly (here in order of decreasing frequency of mention):Running experimentsFormalizing intuitionsUnifying disparate insights into a coherent frameProving theoremsI don't know what your first reaction to this list is, but for us, it was something like: "Oh, none of these activities seems strictly speaking necessary in knowledge-production." Indeed, a quick look at history presents us with cases where each of those activities was bypassed:Einstein figured out special and genera
In particular, four research activities were often highlighted as difficult and costly (here in order of decreasing frequency of mention):
I don't know what your first reaction to this list is, but for us, it was something like: "Oh, none of these activities seems strictly speaking necessary in knowledge-production." Indeed, a quick look at history presents us with cases where each of those activities was bypassed:
Similar but not exactly.I mean that you take some known distribution (the training distribution) as a starting point. But when sampling actions you do so from shifted on truncated distribution to favour higher reward policies. The in the decision transformers I linked, AI is playing a variety of different games, where the programmers might not know what a good future reward value would be. So they let the system AI predict the future reward, but with the distribution shifted towards higher rewards.I discussed this a bit more after posting the above co... (read more)
From my reading of quantilizers, they might still choose "near-optimal" actions, just only with a small probability. Whereas a system based on decision transformers (possibly combined with a LLM) could be designed that we could then simply tell to "make me a tea of this quantity and quality within this time and with this probability" and it would attempt to do just that, without trying to make more or better tea or faster or with higher probability.
Any policy can be model as a consequentialist agent, if you assume a contrived enough utility function. This statement is true, but not helpful.
The reason we care about the concept agency, is because there are certain things we expect from consequentialist agents, e.g. instrumental convergent goals, or just optimisation pressure in some consistent direction. We care about the concept of agency because it holds some predictive power. [... some steps of reasoning I don't know yet how to explain ...]Therefore, it's better to use a concept of agency that ... (read more)
Ok. Thanks :)
Decision transformers ≈ Quantilizers
You mean, in that you can simply prompt for a reasonable non-infinite performance and get said outcome?
Thanks :)How are the completions provided? Are you just looking at the output probabilities for the two relevant completions?
I'm confused why the uniform baseline is always 0.5.This makes sense when the model is choosing between A and B, or Y or N. But I don't see why you consider 0.5 to be a baseline in the other two cases.
I think the baseline is useful for interpretation. In some of the examples the reason the smaller model does better is because it is just answer randomly, while the larger model is misled somehow. But if there is no clear baseline, then I suggest removing this line from the plot.
In this particular experiment, the small models did not have an object-level hypotheses. It just had no clue and answered randomly.I think the experiment shows that sometimes smaller models are too dumb to pick up the misleading correlation, which can though off bigger models.
Todays hot takes (or something)
GI is very efficient, if you consider that you can reuse a lot machinery that you learn, rather than needing to relearn it over and over again. https://towardsdatascience.com/what-is-better-one-general-model-or-many-specialized-models-9500d9f8751d
LM memetics:LM = language model (e.g. GPT-3)
If LMs reads each others text we can get LM-memetics. A LM meme is a pattern which, if it exists in the training data, the LM will output at higher frequency that in the training data. If the meme is strong enough and LLMs are trained on enough text from other LMs, the prevalence of the meme can grow exponentially. This has not happened yet.
There can also be memes that has a more complicated life cycle, involving both humans and LMs. If the LM output a pattern that humans are extra interested in, then the humans ... (read more)
Lets say that U_A = 3x + yThen (I think) for your inequality to hold, it must be that U_B = f(3x+y), where f' >= 0
If U_B care about x and y in any other proportion, then B can make trade-offs between x and y which makes things better for B, but worse for A. This will be true (in theory) even if both A and B are satisfisers. You can see this by assuming replacing y and x with sigmoids of some other variables.
Yes, I like this one. We don't want the AI to find a way to give it self utility while making things worse for us. And if we are trying to make things better for us, we don't want the AI to resist us.
Do you want to find out what these inequalities implies about the utility functions? Can you find examples where your condition is true for non-identical functions?
I don't have a specific example right now but some things that come to mind:
This is a good question.
The not so operationalized answer is that a good operationalization is one that are helpful for achieving alignment.
An operationalization of [helpfulness of an operationalization] would give some sorts to gears level understanding of what shape the operationalization should have to be helpful. I don't have any good model for this, so I will just gesture vaguely.I think that mathematical descriptions are good, since they are more precise. My first operationalization attempt is pretty mathematical which is good. It is also more "const... (read more)
Can't you restate the second one as the relationship between two utility functions UA and UB such that increasing one (holding background conditions constant) is guaranteed not to decrease the other? I.e. their respective derivatives are always non-negative for every background condition.
I recently updated how I view the alignment problem. The post that caused my update is this one form the shard sequence. Also worth mentioning is older post that points to the same thing, but I just happen to read it later.
Basically I used to think we needed to solve both outer and inner alignment separately. No I no longer think this is a good decomposition of the problem.
It’s not obvious that alignment must factor in the way described above. There is room for trying to set up training in such a way to guarantee a friendly mesa-objective somehow wit
If something is good at replicating, then there will be more of that thing, this creates a selection effect for things that are good at replicating. The effects of this can be observed in biology and memetics.
Maybe self replication can be seen as an agentic system with the goal of self replicating? In this particular question all uncertainty comes from "agent" being a fuzzy concept, and not from any uncertainty about the world. So answering this question will be a choice of perspective, not information about the world.
Either way, the type of ag... (read more)
infraBook Club I: Corrigibility is bad ashkually
One of my old blog posts I never wrote (I did not even list it in a "posts I will never write" document) is one about how corrigibility are anti correlated with goal security.
Something like: If you build an AI that don't resist someone trying to change its goals, it will also not try to stop bad actors from changing its goal. (I don't think this particular worry applies to Paul's version of corrigibility, but this blog post idea was from before I learned about his definition.)
I'm not talking about recursive self-improvement. That's one way to take a sharp left turn, and it could happen, but note that humans have neither the understanding nor control over their own minds to recursively self-improve, and we outstrip the rest of the animals pretty handily. I'm talking about something more like “intelligence that is general enough to be dangerous”, the sort of thing that humans have and chimps don't.
Individual humans can't FOOM (at lest not yet), but humanity did.
My best guess is that humanity took a sharp left turn whe... (read more)
(Just typing as I think...)What if I push this line of thinking to the extreme. If I just pick agents randomly from the space of all agents, then this should be maximally random, and that should be even better. Now the part where we can mine information of alignment from the fact that humans are at least some what aligned is gone. So this seems wrong. What is wrong here? Probably the fact that if you pick agents randomly from the space of all agents, you don't get greater variation of aliment, compare to if you pick random humans, because probably all the ... (read more)
I mean that the information of what I value exists in my brain. Some of this information is pointers to things in the real world. So in a sense the information partly exist in the relation/correlation between me and the world. I defiantly don't mean that I can only care about my internal brain state. To me that is just obviously wrong. Although I have met people who disagree, so I see where the misunderstanding came from.
Blogposts are the result of noticing difference in beliefs. Either between you and other of between you and you, across time.I have lots of ideas that I don't communicate. Sometimes I read a blogpost and think "yea I knew that, why didn't I write this". And the answer is that I did not have an imagined audience.My blogposts almost always span after I explained a thing ~3 times in meat space. Generalizing from these conversations I form an imagined audience which is some combination of the ~3 people I talked to. And then I can write. (In a convers... (read more)
I almost totally agree with this post. This comment is just nit picking and speculation.
Evolution has an other advantage, that is relate to "getting a lot's of tries" but also importantly different.It's not just that evolution got to tinker a lot before landing on a fail proof solution. Evolution don't even need a fail proof solution. Evolution is "trying to find" a genome, which in interaction with reality, forms a brain that causes that human to have lots of kids. Evolution found a solution that mostly works, but sometimes don't. Some humans decided... (read more)
This is probably too obvious to write, but I'm going to say it anyway. It's my short form, and approximately no-one reads short forms. Or so I'm told.
Human value formation is to a large part steered by other humans suggesting value systems for you. You get some hard to interpret reward signal from your brainstem, or something. There are lots of "hypothesis" for the "correct reward function" you should learn.
(Quotation marks because there are no ground through for what values you should have. But this is mathematically equivalent to a learning the tru... (read more)
What is alignment? (operationalisation)
Toy model: Each agent has a utility function they want to maximise. The input to the utility function is a list of values describing the state of the world. Different agents can have different input vectors. Assume that every utility function monotonically increases, decreases or stays constant for changes in each impute variable (I did say it was a toy model!). An agent is said to value something if the utility function increases with increasing quantity of that thing. Note that if an agents utility function decrease... (read more)
Re second try: what would make a high-level operationalisation of that sort helpful? (operationalize the helpfulness of an operationalisation)
Here's what you wrote:
This interpretation makes sense even in the absence of “agents” with “beliefs”, or “independent experiments” repeated infinitely many times. It directly talks about maps matching territories, and the role probability plays, without invoking any of the machinery of frequentist or subjectivist interpretations.
Do you still agree with yourself?
In that case I'm confused about this statement
This interpretation makes sense even in the absence of “agents” with “beliefs”
What is priors in the absence of something like agents with beliefs?
We’ve shown that the probability P[q|X] summarizes all the information in X relevant to q, and throws out as much irrelevant information as possible.
This seems correct.Lets say two different points in the data configuration space, X_1 and X_2, provide equal evidence for q. Then P[q|X_1] = P[q|X_2]. The two different data possibilities are mapped to the same point in this compressed map. So far so good.(I assume that I should interpret the object P[q|X] as a function over X, not as a point probability for a specific X.)
I don't think this is true:
But there’s a biological analogy: classical conditioning. E.g. I can choose to do X right before Y, and then I’ll learn an association between X and Y which I wouldn’t have learned if I’d done X a long time before doing Y.
I could not find any study that test this directly, but I don't expect conditioning to work if you yourself causes the unconditioned stimuli (US), Y in your example. My understanding of conditioning is that if there is no surprise there is no learning. For example: If you first condition an animal to expect... (read more)
I agree that it's not exactly FDT. I think I actually meant updateless decision theory (UDT), but I'm not sure because I have some of uncertainty to exactly what others mean by UDT.I claim that mutations + natural selection (evolution) selects for agents that acts according to the policy they would have wanted to pre-commit to, at the time of their birth (last mutation).
Yes, there are some details around who I recognize as a copy of me. In classical FDT this would be anyone who are running the same program (what ever that means). In evolution this would be anyone who are carrying the same genes. Both of these concept are complicated by "same program" and "same genes" are scalar (or more complicated?) and not Boolean values.Edit: I'm not sure I agree with what I just said. I believe something in this direction, but I want to think some more. For example, people with similar genes probably don't cooperate because decision theory (my decision to cooperate with you is correlated with your decision to cooperate with me), but because shared goals (we both want to spread our shared genes).
Update: I'm shifting towards thinking that peer-review could be good if done right, because:
Some thoughts on this question that you mention briefly in the talk.
What decision theory does evolution select for?
I think that evolution selects for functional decision theory (FDT). More specifically, it selects the best policy over a life time, and not the best action in a given situation. I don't mean that we actually cognitively calculate FDT, but that there is an evolutionary pressure to act as if we follow FDT
Example: RevengeBy revenge I mean burning some of your utility just to get back at someone who hurt you. Revenge is equivalent to t... (read more)
I'm surprised by two implicit claims you seem to be making.
And if your research is confined to arXiv or the Alignment Forum, it can be really hard to get any sort of deep feedback on it.
Is your experience that peer-review is a good source of deep feedback? I have a few peer-reviewed physics publications. The only useful peer-review feedback I got was a reviewer who pointed out a typo in one of my equations. Everything else have been gatekeeping related comments, which is no surprise given that peer-review mainly is a status gate keeping function. I go... (read more)
I'm also interested to se the list of candidate instincts. Regarding funding, how much money do you need? Just order of magnitude. There lots of diffrent grants and where you want to appy depends on the size of your budget.
"Most AI reserch focus on building machines that do what we say. Aligment reserch is about building machines that do what we want."
Source: Me, probably heavely inspred by "Human Compatible" and that type of arguments. I used this argument in conversations to explain AI Alignment for a while, and I don't remember when I started. But the argument is very CIRL (cooperative inverse reinforcment learning).
I'm not sure if this works as a one liner explanation. But it does work as a conversation starter of why trying to speify goals directly is a bad idea. And ho... (read more)
I think one major reason why people don't tend to get hijacked by imagined adversaries is that you can't simulate someone who is smarter than you, and therefore you can defend against anything you can simulate in your mind.This is not a perfect arugment since I can imagine someone that has power over me in the real world, and for example imagine how angry they would be at me if I did something they did not like. But then their power over me comes from their power in the real world, not their ability to outsmart me inside my own mind.
The correct labeling of how violent a knifing is, is not 50.1%, or 49.9%. The correct label is 0 or 100%. There is no "ever so slightly" in the training data. The percentage is about the uncertanty of classifyer, it is not about degrees of violence in the sample. It it was the other way around, then I would mostsy agree with the current training scheem, as I said.If the model is well calibrated then half the samples would be safe, and half violent at 50%. Moving a up the safe one is helpfull. Decreesing missclassification of safe samples will increas the c... (read more)
There’s one thing you can do that definitely works, which is to only get labels for snippets which are just barely considered safe enough by your classifier. Eg if your threshold is set to 99%, so that a completion won’t be accepted unless the classifier is 99% sure that it’s safe, then there’s no point looking at completions rated as <99% likely to be safe (because the classifier isn’t going to accept them), and also it’s probably a better bet to look at things that the model thinks are 99.1% likely to be safe rather than 99.9%, because (assuming the m
Not sure how usefull this is, but I think this counts as a selection theorem.(Paper by Caspar Oesterheld, Joar Skalse, James Bell and me)
We played around with taking learning algorithms designed for multi armed bandit problems (your action matters but not your policy) and placing them in Newcomblike environments (both your acctual action and your probability distribution over actions matters). And then we proved some stuf about their behaviour.
Studying early cosmolog has a lot of the same epistemic problems as AI safety. You can't do direct experiment. You can't observe what's going on. You have to extrapolate anything you know far beyond where this knolage is trustworthy.
By early cosmology I mean anything before recombination (when the matter transfomed from plasma to gas, and the uinverse became transparent to light) but especially anything to do with cosmic inflation or compeeting theories, or stuff about how it all started, or is cyclic, etc.
Unfortunatly I don't know what lessons we ca... (read more)