This is a special post for quick takes by Morpheus. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New to LessWrong?

40 comments, sorted by Click to highlight new comments since: Today at 12:16 PM

While there is currently a lot of attention on assessing language models, it puzzles me that no one seems to be independently assessing the quality of different search engines and recommender systems. Shouldn't this be easy to do? The only thing I could find related to this is this Russian site (It might be propaganda from Yandex, as it is listed as the top quality site?). Taking their “overall search quality” rating at face value does seem to support the popular hypothesis that search quality of Google has slightly deteriorated over the last 10 years (although compared to 2009-2012, quality has been basically the same according to this measure). Overall search result quality.

The gpt-4 translated version of their blog states that they gave up actively maintaining this project in 2014, because search engine quality had become reasonable according to them:

For the first time in the history of the project, we have decided to shut down one of the analyzers: SEO pressing as a phenomenon has essentially become a thing of the past, and the results of the analyzer have ceased to be interesting.

Despite the fact that search engine optimization as an industry continues to thrive, search engine developers have made significant progress in combating the clutter of search results with specially promoted commercial results. The progress of search engines is evident to the naked eye, including in the graph of our analyzer over the entire history of measurements:

commercial results

SEO Pressing Analyzer Graph

The result of the analyzer is the share of commercial sites in the search results for queries that do not have a clearly commercial meaning; when there are too many such sites in the search results, it is called susceptibility to SEO pressing. It is easy to see that a few years ago, more than half (sometimes significantly more than half) of the search results from all leading search engines consisted of sites offering specific goods or services. This is, of course, a lot: a query can have different meanings, and the search results should cater to as many of them as possible. At the same time, a level of 2-3 such sites seems decent, since a user who queries "Thailand" might easily be interested in tours to that country, and one who queries "power station" might be interested in power stations for a country home.

If we are worried that current recommender systems are already doing damage and expect things to get worse in the future, it might be good to actively monitor this to not get frog boiled.

If I had more time I would have written a shorter letter.

TLDR: I looked into how much it would take to fine-tune gpt-4 to do Fermi estimates better. If you liked the post/paper on fine-tuning Language models to make predictions you might like reading this. I evaluated gpt-4 on the first dataset I found, but gpt-4 was already making better fermi estimates than the examples in the dataset, so I stopped there (my code).

First problem I encountered: there is no public access to fine-tuning gpt-4 so far. Ok, we might as well just do gpt-3.5 I guess.

First, I found this Fermi estimate dataset. (While doing this, I was thinking I should perhaps search more widely what kind of different AI benchmarks exist, since probably a dataset that is evaluating a similar capability is already out there, but I don't know its name.)

Next I looked at this paper, where people used among other gpt-3.5 and gpt-4 on this benchmark. Clearly these people weren't even trying, though, because gpt-4 does worse than gpt-3.5. One of the main issues I saw was that they were trying to make the LLM output the answer as a program in the domain specific language used in that dataset. They couldn't even get the LLM to output valid programs more than 60% of the time (their metric compares on a log scale, if the answer by the LLM is within 3 orders of magnitude of the real answer. 1 is best 0 is more than 3 orders of magnitude away: fp-score(x) = max(0,1-1/3 * | log_10(prediction/answer)|)).

image

My conjecture was that just using python instead should give you better results.(This turned out to be true). I get a mean score of ~0.57 on 100 sample problems, so as good results with gpt-4-turbo as they get when they first provide “context” by giving the llm the values for the key variables needed to compute the answer (why would this task even still be hard at all?).

When gpt-4 turned out to get a worse fp-score than gpt-4-turbo on my 10 samples. I got suspicious and after looking at samples gpt-4 got a bad score, it was clear this was mostly to blame on bad quality of the dataset. 2 answers were flat-out not using the correct variables/confused, while gpt-4 was answering correctly. Once, the question didn't make clear what unit to use. 2 of the samples gpt-4 gave a better answer. Once, using a better approach (using geometry instead of wrong figures of how much energy the earth gets from the sun, to determine the fraction of sun energy that the earth receives). Once, by having better numbers, input estimates like how many car miles are driven in total in the US.

So on this dataset, gpt-4 seems to be already at the point of data-saturation. I was actually quite impressed how well it was doing. When I had tried using gpt-4 for this task, I had always felt like it was doing quite badly. One guess I have is this is because when I ask gpt-4 for an estimate, it is often a practical question, which is actually harder than these artificial questions. In addition, the reason I ask gpt-4 is that the question is hard, and I expect to need to employ a lot of cognitive labor to do it myself.

Another data point with respect to this was the “Thinking physics exercises”. Which I tried with some of my friends. For that task, gpt-4 was better than people who were bad at this, but worse than people who were good at this (and given 5–10 minutes of thinking time) (although I did not rigorously evaluate that). GPT-4 is probably better than most humans at doing Fermi estimates given 10 minutes of time. Especially in domains one is unfamiliar with, since it has so much more breadth.

I would be interested to see what one would get out of actually making a high quality dataset by taking Fermi estimates from people I deem to produce high quality work in that area. 

Probably silly

Quantifying uncertainty is great and all, but also exhausting precious mental energy. I am getting quite fond of giving probability ranges instead of point estimates when I want to communicate my uncertainty quickly. For example: “I'll probably (40-80%) show up to the party tonight.” For some reason, translating natural language uncertainty words into probability ranges feels more natural (at least to me) so requires less work for the writer.

If the difference is important, the other person can ask, but it still seems better than just saying 'probably'.

Interesting.  For me, thinking/saying "about 60%" is less mental load and feels more natural than "40 to 80%".  It avoids the rabbit-hole of what a range of probabilities even means - presumably that implies your probability estimates are normal around 60% with a standard deviation of 20%, or something.  

Is there anything your communication recipient would do differently with a range than a point estimate?  presumably they care about the resolution of the event (will you attend) rather than the resolution of the "correct" probability estimate.

There's a place for "no", "probably not", "maybe", "I hope to", "probably", "I think so", "almost certainly", and "yes" as a somewhat ambiguous estimate as well, but that's a separate discussion.

Agree that the meaning of the ranges is very ill-defined. I think I am most often drawn to this when I have a few different heuristics that seem applicable. Example of internals: One is just how likely this feels when I query one of my predictive engines and another is just some very crude "outside view"/eyeballed statistic that estimates how well I did on this in the past. Weighing these against each other causes lots of cognitive dissonance for me, so I don't like doing it.

I am not sure how much this was a problem, but I felt like listening more to pop music on Spotify slowly led to value drift, because so many songs are about love and partying.

I felt a stronger desire to invest more time to fix the fact that I am single. I do not actually endorse that on reflection. The best solution I've found so far is starting to listen to music in languages I don't understand, which works great!

I hope the fact I like listening to songs where the singer role-plays as a supervillain isn’t affecting me that way lol

Metaculus recently updated the way they score user predictions. For anyone who used to be active on Metaculus and hasn't logged on for a while, I recommend checking out your peer and baseline accuracy scores in the past years. With the new scoring system, you can finally determine whether your predictions were any good compared to the community median. This makes me actually consider using it again instead of Manifold.

By the way, if you are new to forecasting and want to become better, I would recommend past-casting and/or calibration games instead, because of the faster feedback loops. Instead of within weeks, you'll know within 1–2 hours whether you tend to be overconfident or underconfident.

Epistemic Status: Anecdote

Two weeks ago, I’ve been dissatisfied with the amount of workouts I do. When I considered how to solve the issue, my brain generated the excuse that while I like running outside, I really don’t like doing workouts with my dumbbells in my room even though that would be a more intense and therefore more useful workout. Somehow I ended up actually thinking and asked myself why I don’t just take the dumbbells with me outside. Which was of course met by resistance because it looks weird. It’s even worse! I don’t know how to “properly” do curls or whatnot, and other people would judge me on that. I noticed that I don’t actually care that much about people in my dorm judging me. These weirdness points have low cost. In addition, this muscle of rebellion seems useful to train, as I suspect it to be one of the bottlenecks that hinders me from writing posts like this one. 

I was just thinking that there is actually a way to justify using occams razor, because by using it, you will always converge on the true hypothesis in the limit of accumulating evidence. Not sure if I've seen this somewhere else before, or if I gigabrained myself into some nonsense:

Let's say the true world is some finite state machine M'∈M with the input alphabet {1} and the output alphabet {0,1}. Now I feed into this an infinite sequence of 1s. If I use a uniform prior over all possible finite state automatons, then at any step of observing the output, there will be a countably infinite number of machines that explain my observation, so my prior and posterior will always be flat and never converge. Now I use as my prior,m∈M f: M -> R, m->2^(|m|+1) where |m| is the number of states after which m repeats (I will view different automatons that always produce the same output as the same). If I use this prior, then M' will be my top hypothesis after observing |M'| + 1 bits and will just rise in confidence after that. Since we used finite state automatons, we avoid the whole computability business, but my intuition would be that you could make the argument go through with Turing machines, it would have to become way more subtle though.

I rechecked Hutter on induction https://arxiv.org/pdf/1105.5721.pdf and the convergence stuff seems to be already known. Going to recheck logical induction. I think maybe Occam's razor is actually hard to justify. What is easier justify is using a prior that will actually converge, if there is any explanation at all (your observations aren't random noise)

Ok yeah. Logical induction just works then because you don't expect any adversaries in math truths.

All of this is just getting annoyed at the NFL theorem trying to be objective, but one thing that I'd find interesting is what happens if you start out with very different priors.

I like the agreement voting feature for comments! Not only does it change incentives/signals people receive, I also notice how looking at whether to press this button I am more often actually asking myself whether I actually just endorse a comment or whether I actually belief it. Which seems great. I do feel the added time considering to press a button costly, but for this particular one that seems more like a feature than a bug.

I also notice my disappointment that this is not possible for shortforms... yet?

The feature works for newer posts created after it was released, including newer shortforms. If multiple posts could take on the role of shortform repositories, agreement voting would work for the newer ones.

yeah it does seem like we should just fix this.

Summary: I have updated on being more conscientious than I thought.

Since most of the advice on 80.000 hours is aimed at high performing college students, I find it difficult how much this advice should apply to myself, who just graduated from high school. Previously I had thought of myself as talented in math (I was the best in my class with 40 students, since first grade), but mid- to below average in conscientiousness. I also feel slightly ashamed of my (hand-)writing: most of my teachers commented that my texts were too short and my writing is not exactly pretty. I was diagnosed with ADHD with eleven and even with medication, my working memory is pretty bad. Even though I have started to develop strategies to cope with my disabilities, I wasn't sure how I was doing compared to those classmates that performed better in writing. So I just assumed that they must be way more productive. Recently I thought it would be interesting to try to predict my final grades by predicting the grades for every subject using Guesstimate (Unfortunately I later put in the grades I got in the end without saving my initial model. If someone is interested I can try to recreate it). This proved to be more useful then I thought: It was a major update for me being more productive and conscientious compared to the rest of my class.

  • Together with another student, I got the highest grades in German (my native tongue) in my final exam (13 out of 15 points), because I practiced writing (Exams).
  • I think I would not have realized that I had false assumptions about my performance if I had not seen the difference (There were 3-4 additional students in my class who I thought would be better than me in the final exam) between my prediction and the outcome.
  • It is not like I was bad before in German, but I attributed a lot of the credit to my teacher liking me. Since a second teacher graded my final exam, this effect shouldn't be as great.
even with medication, my working memory is pretty bad

To say the obvious: make notes about what you learn. (I am not recommending any specific note-taking method here, only the general advice that a mediocre system you actually start using today is better than a hypothetically perfect system you only dream about.) It really sucks to spend a lot of time and work learning something, then not using it for a few years, then finding out you actually forgot everything.

This usually doesn't happen at high school, because the elementary and high school education is designed as a spiral (you learn something, then four years later you learn the more advanced version of the thing). But at university: you may learn a thing once, and maybe never again.

I was the best in my class with 40 students

How much this means, you will only find out later, because it depends a lot on your specific school and classmates. I mean, it definitely means that you are good... but is it 1:100 good or 1:1000000 good? At high school both are impressive, but in later life you are going to compete against people who often also were the best in their classes.

I mean, it definitely means that you are good... but is it 1:100 good or 1:1000000 good? At high school both are impressive, but in later life you are going to compete against people who often also were the best in their classes.

Update after a year: I am currently studying CS and I feel like I got kind of spoiled by reading "How to be a straight A student" which was mostly aimed at us-college students, and it was kind of hard to sort out which kinds of advice would apply in Germany and made the whole thing seem easier than it actually is. I am doing ok, but my grades aren't great (my best guess is that in pure grit+IQ I'm somewhere in the upper 40%). In the end, I decided that the value of this information wasn't so great after all, and now I am focusing more on how to actually gain career capital and getting better at prioritizing on a day-to-day basis.

Testing a claim from the lesswrong_editor tag about the spoiler feature: first trying ">!":

! This should be hidden

Apparently markdown does not support ">!" for spoiler tags. now ":::spoiler ... :::"

It's hidden!

works.

Inspired by John's post on How To Make Prediction Markets Useful For Alignment Work I made two markets (see below):

I feel like there are pretty important predictions to be made around things like whether the current funding situation is going to continue as it is. It seems hard to tell, though what kind of question to ask that provides someone more value, than just reading something like the recent post on what the marginal LTFF grant looks like

Has someone bothered moving the content on Arbital into a format where it is (more easily) accessible? By now I figured out that and where you can see all math and ai-alignment related content, but I only found that by accident, when Arbitals main page actually managed to load not like the other 5 times I clicked on its icon. I had already assumed it was nonexistent, but it's just slow as hell.

It mostly works lately (after a months/years period of mostly not working), but the greaterwrong viewer seems more reliable.

Thanks! This looks like the solution I was looking for!

Would this not be better as a Question post?

I wonder if you could exploit instrumental convergence for IRL. For example, with humans that we lack information about, we would still guess that money would probably help them. In some sense, most of the work is probably done by the assumption that the human is rational.

Epistemic status: Speculation

"Everyone" is misinterpreting the implications of the original "no-free-lunch theorems". Stuart Armstrong is misinterpreting the implications of his no-free-lunch theorems for value learning.

The original no-free-lunch theorems show, that if you use a terrible prior over your hypothesis, then it will not converge/learning is impossible. In practice, this is not important, because we always make the assumption that learning is possible. We call these priors "simplicity priors", but the actually important bit about these is not the simplicity, but that they "work"(that they converge). Now, Stuart says that using a simplicity prior is not enough to make value learning converge, and thus we must use additional assumptions (taken from human knowledge). Maybe I am reading into it, but the implied problem is telling what kind of assumptions might just be a product of our cultural upbringing or our evolutionary tuning, and it would be hard to say what to judge these values by. But this is wrong! I think the value assumptions shouldn't have a different status from the "simplicity prior" ones. If his theorem should also apply to humans, then humans would never have been able to learn values. What we must include in our prior is not some knowledge that humans acquired over their environment, but a kind of prior that humans already have before they get any inputs!

While reading p vs. np for dummies recently, I was really intrigued by Scott's probabilistic reasoning about math questions. It occurred to me that of all science areas, math seems like a really fruitful area for betting markets, because compared to areas like psychology where you have to argue with people what results of studies actually mean, it seems mathematicians are better at getting at a consensus (it could potentially also help to uncover areas where this is not the case?) I also just remembered that There are a few math-related questions on Metaculus, but their aggregation method seemed badly suited for math questions (changed my mind after thinking of considerations below), because I'd expect thinking more about the question to have more payoff in this area, so it seems desirable for people to be able to stake more money.

So why are there no betting markets for math conjectures? Why would it be a bad idea? Some ideas:

  • The cutting edge is hard to understand and probably also not the most useful stuff. Results that are actually uncertain take a long time for lay people to understand.
  • Results that haven't been solved for a long time, on the other hand, probably will continue to do so for a long time. This is an issue because it skews prices and incentives in ways that lead to "wrong" prices. I won't bother correcting the price for p!=np if it is not significantly off, if I can make more money by just investing it.
  • In Math you can always ask more arbitrary questions and most of them would never be resolved, because no one bothers with them. (It would be hard to keep only relevant questions)
  • The intuition behind the math problems and the ideas behind the proofs are the things people are actually interested in. When someone makes a conjecture, the intuition that led her to it is actually the interesting bit. Pure probabilities are bad for transferring these intuitions.

“Causality is part of the map, not the territory”. I think I had already internalized that this is true for probabilities, but not for “causality”, a concept that I don't have a solid grasp on yet. This should be sort of obvious. It's probably written somewhere in the sequences. But not realizing this made me very confused when thinking about causality in a deterministic setting after reading the post on finite factored sets in pictures (causality doesn't seem to make sense in a deterministic setting). Thanks to Lucius for making me realize this.

[This comment is no longer endorsed by its author]Reply

There isnt any strong reason to believe either "probability is in the map" or "causation is in the map", mainly because there aren't good reasons to believe it's a dichotomy.

Hm… maybe? Do you have a specific example, or links you have in mind when you say this? I am still having trouble wrapping my head around this and plan think more about it.

if you didn't get the idea from https://www.lesswrong.com/posts/f6ZLxEWaankRZ2Crv/probability-is-in-the-mind ...where did you get it from?

Yeah, I know, that post. I give Jaynes most of the credit for further corrupting me. Was mostly hoping for good links for how to think about causality. Something pointing towards the solution to the problems mentioned in this post. I kinda skimmed "The book of why", but did not feel like I really understood the motivation behind do-calculus. I still don't really understand the justification between saying that xyz are random variables. It seems like saying "these observations should all be bagged into the same variable X" is already doing huge legwork in terms of what is able to cause what. I kinda wonder whether you could do a thing similar to implications in logic where you say, "assuming we put these observations all in the same bag, that implies this bag causes this other bag to have a slightly different composition", but say we bag them a bit differently, and causation looks different.

Well, I responded to That Post, and you can tell it was good , because it was downvoted.

Do you read the comments?

Do you read non-rationalsphere material? It's not like the topic hasn't been extensively written about

Its likely that mainstream won't tell you The Answer, but if there isn't an answer, you should wish to believe there is not an answer. You should not force yourself to "internalise" an answer you can't personally understand, and that has objections to it.

Do you read the comments?

Wups...that might be a bug to fix. My excuse might be that I read the post before you made the comment, but I am not sure if that is true.

Its likely that it won't tell you The Answer, but if there isn't an answer, you should wish to believe there is not an answer. You should not force yourself to "internalise" an answer you can't perosnally understand, and that has objections to it.

I think you are definitely pointing out a failure mode I've fallen into recently, a few times. But mostly I am not sure if I understood what you mean. I also think my original comment failed to communicate how my views have actually shifted, which is mostly that after fidling with binary strings a bit and trying to figure out how I would model any causal chains in that, I noticed that the simple way I wanted to do that didn't work and my naive notion for how causes work broke down. I now think, when you have a system that is fully deterministic and that in such worlds "probabilistic causality" is a property of maps of such agents, but mostly I am still very confused. I don't actually have anything that I would call solution actually.

I made the comment over a year ago ... and the question was whether you read the comments in general.

It should be obvious that if the territory is deterministic, the only remaining place for possibilities/probabilities to reside is in the map/mind. But it isn't at all obvious that the territory is deterministic.

I often do read the comments, though I don't really read that intentionally, so I don't have a good estimate of how often I read comments or how many I read (probably read most comments if I find the topic interesting, and I feel like the points in the post wasn't obvious before I read it). I scroll through the "Recent discussion" stuff almost never. So I miss a lot of comments if I read a post early on and then people make comments later that I never see.

The point is that there is often a good counterargument to whatever is being asserted in a post. Sometimes it's in a comment to the post itself -- which is easy and convenient-- and sometimes it's on another website,.or in a book. Either way,.rationality does not consist of forcing yourself to adopt a list of "correct" beliefs.

[+][comment deleted]3y10