Scientific Method

Andrij “Androniq” Ghorbunov

Here I will try to describe the scientific method in simple words — what it is, how it works, and how we can break it into a list of easily understandable singular ideas, in this article called 'points'.

First of all, the scientific method is not carved on tablets, like the commandments of Moses; it can change and update itself. This article compiles the points of the scientific method that are more or less recognized by the global consensus at the time of writing (or the last update). At the same time, this method is the best we have so far for constructing both abstract concepts of the world and, ultimately, working and useful technologies.

There is no ultimate goal in science. A goal is bad; more precisely, one universal goal is bad. If you set out to make everyone happy, one of the methods to achieve this goal would be to kill all the unhappy people. Therefore, science does not fix a goal in order to choose a method for it — it fixes a method, and the goal it leads to or with which it is applied may change in the process.

The scientific method can update itself — discard outdated points, add new ones, or reformulate existing ones; it is similar to a computer program that can rewrite its own code. Yes, this means that we do not know if this program will hang, start going in circles, or delete itself someday; but we have enough examples of successful application of the scientific method and failed attempts to abandon it, that we, despite everything, stick to it.

To a large extent, the scientific method tells us what not to do — it is not an action algorithm (if you are looking for an algorithm, you need Solomonoff's theory), but rather a filter through which we pass any algorithms, research, conclusions, and so on, and what remains after all sieves, we call it science. In each of the points, I will try to give an example of what this point can filter out (declare as non-science).

The scientific method allows us to construct hypotheses — that is, individual statements (assumptions) about our reality; as well as theories or research programmes — large systems of such statements. Note that we do not use words like "fact" — because any statement can be doubted. A hypothesis that is currently considered closest to the truth (among all hypotheses surrounding one problem) is called a working hypothesis.

By the assumption of this article, the scientific method consists of points, so let's just look at them all one by one. In addition to describing the points themselves, I will also point out the known weaknesses of each, so as not to create illusions, as if this current version of the scientific method is perfect or final. I will arrange each point in four parts: Aphorism — a short formulation for memorization; Content — a full textual description; Anti-example — an illustrative case of violation of this point; Flaw — a known drawback of this point and its possible weak spot.

1. Occam's Razor

Aphorism: Entities must not be multiplied beyond necessity.

Content: We must always start with the simplest possible explanations for the phenomena that surround us. The fewer components, abstractions, or entities required for a hypothesis, the better the hypothesis. We first accept a simple explanation, then complicate it if necessary; after all, complicating is easy, while simplifying is difficult.

Anti-example: I come home and see that a chair is placed next to the fridge, and there are fewer cookies in the jar on the fridge than there were when I left. I assume that a hungry, short thief sneaked into my apartment and decided not to take anything valuable. At the same time, there is a simpler hypothesis: my daughter decided to eat the cookies while I was not at home. (Example from here.)

Flaw: It is impossible to objectively say which hypothesis is simpler. Ideas about the simplicity of hypotheses are based on our intuition; the best formalization of Occam's razor we have is Kolmogorov complexity. It equals the length of the shortest computer program in a given language that constructs a universe described by the corresponding hypothesis. However, Kolmogorov complexity can be manipulated by artificially writing a specific programming language for a particular hypothesis in which this hypothesis becomes simple, that is, in which the required universe is created by a single built-in command. Moreover, Kolmogorov complexity is algorithmically uncomputable; this means that we cannot simply create a calculator for this value, and it may be that hypotheses that now seem complicated to us will eventually become simpler over time as new ways are found to describe them with fewer lines of code (even in an existing programming language).

2. Basic Laws of Logic (Boolean Algebra)

Aphorism: At least we are not crazy.

Content: We can operate on statements about the world using Boolean algebra: some statements are true, some are false, and some statements which are known to be true can be used to form new true statements according to certain rules. We will use traditional notation for logical operations: ¬ for "not"; ∧ for "and"; ∨ for "or". Aristotle was the first to try to formalize these rules; in his presentation, there were three main laws:

The law of identity: A = A. Any statement remains the same statement no matter how many times it is repeated; its truthfulness/truth value cannot change. (In modern computer terms, this can be thought of as statelessness and/or immutability.)
The law of contradiction: ¬(A∧¬A). Of two opposing statements "A" and "not A", both cannot be true.
The law of the excluded middle: A∨¬A. Of two opposite statements, at least one is true. (It means, along with the previous law, that exactly one of the two opposite statements is true)

Roughly speaking, these three laws can be considered the axiomatics of Boolean algebra.

The "fourth law of Aristotle" is a nickname sometimes given to the law of sufficient reason (it was formulated by Leibniz, but was also known before him). It states that for every true statement there is a justification; this law does not belong to the axiomatics of Boolean algebra. Moreover, Gödel's incompleteness theorems actually refute it: in strong formal systems, there are always true statements that are impossible to be derived as theorems. That is why we do not include the "fourth law" in this list, and we postpone the issues it raises (not in the context of formal systems, but in the context of hypothesizing) to other points of the scientific method (see point 7, the principle of reproducibility).

Anti-example: Zero is equal to one. Let's add one to both parts of the equality: one equals two. The Pope and I are two people; let's apply our equality: the Pope and I are one person. (With this example, Bertrand Russell demonstrated the principle of explosion: as soon as we assume at least one contradiction — one pair of statements A and ¬A, in which both statements are true (in this case, x=0 and x≠0 simultaneously), we can immediately derive any syntactically correct statement from it. A strong formal system that has a contradiction cannot distinguish between true and false statements at all).

Flaw: We assume that we have correctly formulated the very laws of logic at this point. This is equivalent to what I call the Mathematicity Hypothesis (the assumption that ZFC — Zermelo-Fraenkel axiomatics — is consistent) or even weaker than it (it may be that the ZFC is contradictory and the laws of logic work — but definitely not the other way around); it is theoretically possible that our world runs on some other axiomatics, and ZFC and Boolean algebra are just a very good approximation to it, just as Newtonian mechanics is an approximation of relativistic mechanics at low speeds and masses. However, this alternative axiomatics must be very cumbersome, it must contain either very long axioms or lots of axioms (or both) because we have already searched through the space of "small" axiomatics (with the help of computers) and have not found the slightest hint of possible alternatives. If the Mathematicity Hypothesis is true (which is now the global scientific consensus), then ZFC is able to analyze any formal system, and then the laws of logic work; if not, then we are actually doomed: the capacity of our brains (and maybe the capacity of any computers we can build) is simply not enough to process this world.

3. Descriptive Method

Aphorism: Describe, do not interpret.

Content: First of all (during the initial data collection), we should describe the phenomena we see, that is, answer the questions "What?" and "What kind of?"; other questions, "Why?", "How?", “For what purpose?", should wait for their turn (only systems of well-formed hypotheses can answer them). This way, we can avoid second-guessing in the course of observations. History and archaeology almost entirely consist of the descriptive method; there is a view that sciences are basically divided into descriptive and designing (those that create technologies) ones.

Anti-example: An entry in a naturalist's journal: "The ostrich stuck its head in the sand to avoid seeing the danger." (It should be: "1. Danger arose. 2. The ostrich stuck its head in the sand." Even better would be not to use the word "danger" but to describe the event itself, for example, the emergence of a predator. In reality, the ostriches are probably just using the sand to clean their feathers).

Flaw: On the one hand, it is impossible to achieve a pure descriptive method because our words are inherently vague approximate categories in which one or another interpretation is constantly embedded, no matter how hard we try to clean it out. On the other hand, scientists themselves often consider the descriptive method to be "low science" and treat it with disdain: "All science is either physics or stamp collecting" (Ernest Rutherford).

4. Experiment

Aphorism: Ask reality what it is like.

Content: The unit of scientific activity is an experiment, that is, creating certain known conditions that lead to consequences that are unknown in advance. The conditions are known, meaning we assume we can recreate them an unlimited number of times; the consequences are unknown, meaning we will adjust our ideas about reality (and the directions of further experiments) depending on the results of the experiments we observe. (There is also a distinction made for passive experiments, where we do not intentionally create conditions but simply observe them in nature; passive experiments can be attributed to the previous point —"Descriptive method"). The concept of an experiment is further elaborated in the following points of the scientific method — 5, 6, 7, and 9.

Anti-example: All men are mortal. Socrates is a man. Therefore, Socrates is mortal. (In this reasoning, the general scheme of which is known as a "syllogism," Aristotle hopes to gain knowledge about reality based only on logic and general notions. In reality, whether Socrates is mortal can only be confirmed or refuted by an experiment: Socrates drinks hemlock and dies (or does not die).) Credits go here.

Flaw: Not all experiments can be performed, and no experiment can be performed a truly infinite number of times. Some experiments are unethical (forcing Socrates to drink hemlock); some are technologically unattainable (colliding two black holes) and can only be passively observed when nature itself offers us such an opportunity; some exhaust limited resources (studying an endangered species or a unique historical artifact; and we only had one Socrates as well); and some experiments are dangerous (nuclear chain reactions).

5. The Principle of Testing

Aphorism: First comes the hypothesis, then the experiment. / A regularity exists if it continues to hold after it has been formulated.

Content: Before conducting an experiment, we must accurately formulate the hypothesis we are testing in it. Ideally, each experiment should answer the question "Yes" or "No" (rather than "Red," "Twenty-two," or "Because they are forced to expend energy"), that is, provide a minimal amount of discrete information—one bit ("The hypothesis holds" or "The hypothesis is refuted"). If the hypothesis has not yet been formulated and the experiment has already been conducted, we can tailor the hypothesis as accurately as possible to its results since we have virtually unlimited freedom in combining words, concepts, and structures when formulating a hypothesis. Hypotheses not tested by any experiment are essentially essays on a free topic.

Anti-example: People living near high-voltage lines were tested for various diseases (Sweden, 1992), and it was found that the likelihood of childhood leukemia in these people was higher than the average among the population of the region. As it turned out, this was a Texas Sharpshooter Fallacy (a joke character who first shoots and then draws the target where his bullets hit): the number of diseases screened for was so high (about 800) that one of these diseases just happened to correlate with the distance from power lines; further experiments did not confirm the found correlation. If you think about it, this experiment was supposed to provide the researchers with much more than one bit of information: it chose one out of eight hundred diseases, about 10 bits (or 800 bits if any number of diseases from the set were assumed to correlate). Experiments do not work this way, they start lying on the second bit. So a large amount of information can only be obtained from a large number of different experiments (see also point 8—Sagan's standard). (Disclaimer: there are other explanations for the leukemia-electricity correlation as well, such as a confounding variable.)

Flaw: Often, it is impossible to formulate a good hypothesis right away, and the first few experiments have to be done at random. Fundamentally, we do not have a discrete space of options in which we could only conduct strictly dichotomous experiments and choose one of the two branches each time; in the complete absence of scientific knowledge, we are in an unlimited space of reality options, and no amount of information bits will get us out of this state (see also "pessimistic meta-induction" in the Flaw section of point 11 — Popper's Criterion). Another problem is that sometimes we only have historical data and cannot conduct new experiments. Then methods of so-called "Post Hoc" analysis are used. For example, we have to pretend that we do not know part of this data, construct a hypothesis based on the "available" part of the dataset, derive predictions from this hypothesis, and then compare them with the "hidden" part of the data—but here everything depends on how well we can "forget" information, prohibit ourselves from using it, become independent of it, and also whether we do not choose for blinding the very part of the data that is the easiest to predict.

6. Null Hypothesis

Aphorism: Nothing exists until the opposite is demonstrated.

Content: Just as there is a presumption of innocence in law — a person is not guilty until proven otherwise — there is also the principle of the null hypothesis in science, which states that in the absence of an experiment, the working hypothesis should assume the nonexistence of something (phenomenon, pattern, correlation, or difference between values). A successful experiment result lies not so much in confirming its primary (positive or alternative — that is, non-null) hypothesis but in refuting the null one (see also point 11 — Popper's criterion). Initially, we assume that "The Chupacabra does not exist"—then we experimentally establish that "It is not true that 'The Chupacabra does not exist'".

Anti-example: Prove to me that the Chupacabra doesn't exist! (This is a common cognitive error known as "Shifting the burden of proof": the one who defends the positive hypothesis should prove their position. It is impossible to go through all the objects in the world and make sure none of them is the Chupacabra; instead, one should first present arguments in favor of the existence of the Chupacabra, and then the opponent can attack these arguments, look for weaknesses in them — this would be (in case of success) the "proof" of the null hypothesis, as in the absence of valid evidence for the alternative hypothesis, the null hypothesis is again assumed as the working one.)

Flaw: As in the case of point 1, Occam's razor, there are no absolutely objective criteria for constructing the null hypothesis. There are three main problems in formulating the null hypothesis.

The first is that the null hypothesis must be an exact hypothesis ("The probability of a boy's birth is exactly 50%"), and not a range hypothesis ("The probability of a boy's birth lies in the range between 49% and 51%"); the formal probability of virtually any exact hypothesis is zero (which, however, does not exclude it completely — the theory of probability operates the so-called almost improbable events, such that their probability is zero, but they can still happen).
The second problem is the issue of symmetry of the null hypothesis: in some cases, the alternative hypothesis will be symmetrical ("The probability of a boy's birth is different from 50%", that is, either higher or lower); in others, it will be asymmetrical or "single-tailed" ("The probability of a boy's birth is less than 50%"). This will determine the calculation of the probability of error in the experiment, which will be discussed in point 10, Statistical Significance.
The third problem is that with a bit of word trickery, you can hide the second negation in the sentence, and claim that 'There is no absence of correlation here' is your null hypothesis! (A similar thing happened to me once while discussing of what should be the null hypothesis in mathematics, and it seems that both the Mathematicity Hypothesis and its negation can be seen as null under different angles: the first one is the absence of contradictions of the kind A∧¬A among theorems, the second one is the absence of distinguishing between theorems and nontheorems.)

7. The Principle of Reproducibility (Nullius in verba)

Aphorism: I don't believe any words.

Content: All hypotheses should be based on experiments that can be reproduced. Words, notes, written records, etc., are not a source of scientific knowledge, they are only an intermediate storage of information on its way from the experiment to the consumer of scientific theory. A newspaper article is not a documentation of a scientific hypothesis or experiment; we can only trust articles in peer-reviewed journals, and even then, only because it is easy to write a critical review of such articles (and thus, even in this case, there is still a chance that such a review has not yet been written, or that the journal has turned out to be disreputable and does not publish refutations of previously published papers). This is where Aristotle's (somewhat modified) "fourth law" finally comes into play: every scientific belief must have a material justification (in the form of an experiment design) or a chain of thoughts and ideas leading to such a justification.

Related to replicability is the concept of reproducibility, in which the design of an experiment is not required to be identical — only its conclusions have to be the same. For example, sometimes several groups of scientists independently obtain the same results at the same time. That they have been working on the same open problem known in world science is an understandable and highly probable event; but the observation that they come to the same answer without agreeing with each other most likely indicates that this answer is correct in some sense, and not just chosen at random among possible options based on some situational factors that may differ from group to group (availability of resources, personal preferences, etc.). Similarly, to increase the reliability of the results of a particular study, you can send samples of the test material to different laboratories and prohibit them from communicating with each other for the duration of the study; the overlap in the answers provided by these laboratories will reduce the chance of error since it would take them to be equally wrong to achieve this result otherwise.

Anti-example: Scientists have proven that prayer changes the structure of water. (What scientists? How did they prove it? Where is the link to the study in a peer-reviewed journal? If I am also a scientist, how exactly can I repeat this experiment—where can I find step-by-step instructions for reproducing it? These are all legitimate questions to ask someone who makes a claim like this without proper references).

Flaw: None of us has enough lifetime to reduce all scientific knowledge down to our own experiments. For the most part, we still rely on the assumption that, at least on average, scientific journals are honest and conspiracy theories are mathematically impossible. Only then can we trust at least the most consensual core of science and focus our search on more problematic issues that lack reliable information and require our efforts and research.

8. The Sagan (Laplace) Standard

Aphorism: Extraordinary claims require extraordinary evidence.

Content: The Sagan Standard is a kind of law of conservation of information in science: to significantly change the scientific consensus, you need a lot of input data that does not fit into it (or "sufficiently strange data", which is the equivalent formulation in classical information theory: the amount of information equals its strangeness, or absurdity). You cannot take an average experiment result and draw extraordinary conclusions from it. One small deviation from the predictions cannot destroy a strong scientific theory (see also point 13, Lakatos' sophisticated falsificationism, namely, the protective belt of the scientific program).

Anti-example: The Romer’s gap in the Carboniferous period does not fit into the general theory of evolution. Hence, life was created by aliens. (It is much more economical — in terms of adding new entities, which point 1 warns us against — to explain the Romer’s gap with the incompleteness of the data we have collected about the Carboniferous period than with the falsity of the whole theory of evolution.)

Flaw: If you apply the Sagan standard too strongly, you can easily reach dogmatism: accepting only those new hypotheses that fit into the already established consensus. It is also necessary to take into account that scientific revolutions (according to Thomas Kuhn) also constitute progress, and we have to look for new knowledge, among the paradoxical as well. In a sense, the Sagan standard slows down the movement of science: we cannot instantly throw out whole layers of theories, replace bunches of points of the scientific method at once, etc. We can move through the space of possible knowledge systems only continuously, in small steps, not in leaps and bounds; this can lead to a case where we are initially in a certain "valley" of this space and cannot get out of it to other (no less fertile) "valleys" because we have forbidden ourselves to jump over "mountains." Nevertheless, there is still a possibility that someday the scientific method will be able to reject the Sagan standard and make a great leap.

9. Observer Effect

Aphorism: An experiment is an intervention.

Content: We must always remember that point 4, Experiment, is different from point 3, Descriptive method ("passive experiment"). When we perform active actions to create certain conditions, these actions may have side effects that we did not take into account.

In quantum mechanics, the observer effect is manifested in the notion that in order to observe a microscopic particle, we have to bombard it with another microscopic particle and thus affect its trajectory; but this is not the only case of the observer effect.
The Hawthorne effect is a change in people's behavior when they know they are being experimented on.
The observer-expectancy effect is the unconscious actions of the researcher aimed at obtaining the result he or she anticipated.
Intervening in a computer program to investigate an error that occurs in it can lead to the disappearance of this error (for example, the entire system slows down slightly due to the operation of a debugger and changes the order of certain asynchronous events) — a phenomenon known as "heisenbug”.
In classical physics, probe effect can be encountered due to overloading of a system with measuring devices.

To combat the observer effect, efforts should be made to maximize the information isolation of the experiment. For example, in medicine, the standard of research is not just blinding (subjects do not know who among them receives the medication and who receives the placebo), but double-blinding (the staff who gives the subjects the drugs also does not know which one is the drug is and which one is the placebo. They only have numbered containers or pills in front of them, and the correspondence between the numbers and groups is stored in the computer's memory, undisclosed until the end of the experiment).

Anti-example: A horse named "Clever Hans" seemed to be able to perform arithmetic operations by beating the right number of hoofbeats in response to a question. (As it turned out, the horse was actually just reacting to the unconscious body movements of its trainer, who expected Hans to give the correct answer every time. Note that the trainer made this systemic mistake without any malicious intent — he deceived himself no less than the audience of the show.)

Flaw: The observer effect is probably fundamentally irreducible — at least in quantum physics. We will still have to formulate an idea of the world from the point of view of us as its "interactors”, not from the height of an absolute objective theory.

10. Statistical Significance

Aphorism: What is the probability that this happened by accident?

Content: Suppose we have a coin, and we suspect that it is unbalanced (when flipped, it comes up one side more often than the other). We have formulated the null hypothesis ("The coin is balanced"), created an experiment design ("Toss the coin and record which side it came up on"), conducted 1000 repetitions of this experiment, and found that the coin came up heads 517 times. How do we know if this is enough to reject the null hypothesis? For this, we need a p-value, a way to numerically measure the strength of the collected data. The lower the p-value, the better for the alternative hypothesis. To calculate the p-value, we accept the null hypothesis ("The coin is balanced") and calculate the probability that the dataset produced by it will be like the observed one or even weirder (this is the "Bayesian inversion" — we start from the hypothesis and try to get from it to reality, not vice versa). In our case, we need to calculate the probability that a balanced coin (with a 50% probability of falling heads) will fall on heads 517 times or more per thousand tosses (for the asymmetric null hypothesis (see point 6); if our suspicions do not hint that it falls on heads more often, then the hypothesis becomes symmetric, and we must also add the probability that it will fall on heads 483 times or less). It turns out that in the symmetric case, this probability is approximately 29.7%, and in the asymmetric case, it is 14.8%. These are very high values; in other words, the null hypothesis explains our results well, it predicts our reality with a high probability, and we do not need to come up with another hypothesis to replace it. The standard threshold is a p-value limit of 5% — if the null hypothesis generates something similar to the observed world less than one case in twenty, then it makes sense to look for an alternative to it; however, there is some debate that this threshold is too high and that we should tighten the requirements for research, lowering it to 0.5% or 0.1%.

Anti-example: We cannot know anything for sure. Both abiogenesis and the creation of life by an intelligent Creator are just hypotheses. (Contrary to this claim, we can compare hypotheses and choose only those with a low enough p-value.)

Flaw: We should bear in mind that the p-value is the probability of the dataset under the null hypothesis, not the probability of the null hypothesis given the observed dataset. A p-value of 5% should not be interpreted as "The probability that our conclusions are wrong is 5%." These two values are related by the Bayes formula, but we lack other values in this formula to be able to calculate the "true" probability of error, because we do not have a priori probabilities of all possible hypotheses. We can only assume that the p-value is a good enough approximation to the probability of error of the experiment, provided that we have followed the other points of the scientific method (in particular, we have not opposed a very simple hypothesis to a very complex one as if they were equal). One also needs to be aware of existence of p-hacking, i. e. techniques of artificially lowering the p-value that malicious scientists can use; when reviewing scientific papers, one needs to be careful about the calculation methods to exclude the possibility of p-hacking. Finally, we should not forget that a good experiment is not one in which a sub-threshold p-value was reached once, but one in which it is usually reached in most series of replications; the operational meaning of a low p-value is "It makes sense to repeat this experiment", not "This experiment has confirmed its hypothesis, halt the research!”

11. Popper's Criterion (Falsificationism)

Aphorism: Every hypothesis must contain a mechanism for its refutation.

Content: A statement can be considered a scientific hypothesis only if there is a conceivable possibility of disproving it—when you can think of an experiment that will undermine the statement. Hypotheses serve to distinguish between variants of reality. The one who puts forward a hypothesis must also propose an experiment that shows the dichotomy of this hypothesis: result A will confirm it, result B will deny it; or, in those worlds where the hypothesis is true, the result of the experiment will be A, and in those where it is false, it will be B. ("Confirming" a hypothesis does not mean proving it, but merely increasing the probability that the hypothesis is true, that is, pushing it towards the position of a "working hypothesis." According to Popper, it is not so important and even possible in science to prove the right as it is to refute the wrong — and then among the things that survive the barrage of such refutations, only good hypotheses will remain on average). Statements and views that do not pass Popper's Criterion are either overly general ideas ("All cats are beautiful") or floating beliefs that refer only to one another but have no access to sensory experience ("Wulky Wilkinsen is a retropositional author; this means that his texts have an alienated resublimation. What is alienated resublimation? This is a common feature of the texts of all retropositional authors."), or pseudo-scientific texts that are deliberately protected from any refutation ("I have a magic stone, but you may not conduct any experiments with it"). To put it simply, a scientist should be honest and show everyone the weaknesses of his or her theory; they should clearly draw a target that needs to be hit in order to destroy their theory, and once this target is drawn, it should not be moved.

An anti-example:

— I have a dragon in my garage!

— Wow, can we go into your garage and see it?

— Unfortunately, no — the dragon is invisible.

— Okay, then we can probably hear him growl or walk?

— No, it doesn't make any sounds.

— Okay, can we feel it by touch? Or sprinkle flour in the garage so that it settles on its body and makes its outline visible?

— No, this dragon does not interact with the physical world.

— Hmm... Then what does the statement "There is a dragon in the garage" mean?

(From Carl Sagan's parable mentioned here.)

Flaw: If taken pure, Popper's Criterion makes hypotheses extremely fragile. If we know how to refute hypotheses, but not how to prove them, then sooner or later we will refute all the hypotheses we have today; this is known as "creeping falsificationism" (allusion to "creeping determinism") or "pessimistic meta-induction", and Popper was as well aware of it (for a solution to this problem, see point 14, Hawking's Model-Dependent Realism). In addition, he suspected that true refutation is just as impossible as true proof — since there is always the possibility that our refutation was a mistake (i.e. that our refutation will also be refuted). Finally, let's keep in mind that Popper's Criterion does not say anything about the truthfulness of statements — it only categorizes them into scientific and non-scientific, but there are both false scientific hypotheses and non-scientific statements that are generally accepted as true.

12. Russell's Teapot

Aphorism: The mechanism for refuting a hypothesis must be technologically achievable.

Content: Bertrand Russell said: "If I were to assume that there is a porcelain teapot between the Earth and Mars, orbiting the Sun in an elliptical orbit, no one could refute my statement, especially if I were to add that the teapot is too small to be detected by even our most powerful telescopes." Actually, the existence of the teapot is a falsifiable hypothesis — one can imagine the humankind building an even more powerful telescope than all the existing ones (which were mentioned by Russell in the premise), or launching thousands of teapot-searching satellites into Earth orbit, which would completely cover the entire possible flight space of such a teapot. However, this is a technologically unattainable experiment, and the possibility of disproving the hypothesis is pushed into the undefined (though finitely distant) future. In other words, the falsifying experiment from Popper's Criterion should not be too far ahead of the level of technology representing the context where the hypothesis is being made.

Anti-example: Intelligence has supernatural origins because humanity will never be able to create artificial intelligence identical to human intelligence. (It will, but not today. Please redraw the target so that it is within the range of our telescopes.)

Flaw: The concept of "level of technology" is not clearly defined. It is possible that humanity is already able to set up a relevant experiment, but it is still too expensive and no one will pay for it in reality.

13. Lakatos' Sophisticated Falsificationism

Aphorism: Theories collapse not from a single refutation, but from becoming wronger than other theories.

Content: Single hypotheses should be fragile, but when they are combined into large structured theories, they develop a protective belt — at least that's what Imre Lakatos believed. He called such structured combinations of ideas "research programmes" and divided them into progressive and degenerate ones; the example Lakatos gives to outline the limits of Popper's Criterion concerns the research programme "Newton's Theory of Gravity."

Let's imagine a scientist who knows nothing about Einstein's theory of relativity and believes that Newton's theory of gravity is correct. The theory gives him predictions for the trajectories of the planets; he tests these predictions with a telescope and finds that one of the planets deviates slightly from its predicted course. Does he decide at that moment that he has disproved Newton's theory? Not at all. Instead, he assumes that there is another planet that is not yet visible through telescopes and whose gravity affects the motion of the visible planet. He solves an inverse equation to determine the coordinates and mass of this planet, and waits three years to build a new, more powerful telescope to test his assumption. So the telescope is built, astronomers look through it and... do not find any new planet. So is Newton's theory disproven? Not yet — the scientist then assumes that there is some kind of nebula on the way between the telescope and the planet that prevents us from seeing this planet, so we need to build a satellite that can investigate this nebula...

As you can see, this story resembles too heavily a violation of the previous point — the Russell's Teapot Principle. Such a violation is permissible because it is not a single hypothesis, but an entire scientific programme that has a lot of evidence (see point 8 — the Sagan standard); the scientific programme has a protective belt that can absorb a certain number of errors and inconsistencies with reality. However, when there are enough such violations and deviations, the scientific programme will become outdated and should be abandoned. To define what "enough" is, Lakatos introduces the concept of comparative predictive value, the culmination of his improved version of falsificationism:

A research programme is progressive if it makes predictions that come true and that cannot be obtained from other, competing research programmes.

(Originally, "predicts novel facts," but I expanded the wording and eliminated the word "fact," which we decided to avoid.) Thus, according to Popper, a single statement can be scientific or unscientific "in a vacuum," but according to Lakatos, a research programme can be progressive or degenerate only in the context of other programmes that compete with it. The emergence of a new theory that makes better predictions can make a previous generally accepted theory degenerate. (Predictions are a phenomenon of the same kind as Popper's falsifiers, because if a prediction fails, it is equivalent to a "single refutation" of the theory that made the prediction.)

Anti-example: The god Ra exists and wants us to have plenty of warmth and light; this is the reason why the Sun rises every morning. (Alternative theories also predict that the Sun will rise every morning — this prediction is not comparatively useful.)

Flaw: The criteria for being scientific are quite different for a single claim versus for a system of claims; it is not clear at what point a group of claims becomes large and systematic enough to move from one category (a fragile hypothesis) to the other (a research programme with a protective belt). It may seem as if we are giving some kind of indulgence to big theories simply because we have no other big theories (whereas small claims can be invented and formulated unlimitedly).

14. Hawking's Model-Dependent Realism

Aphorism: There is more than one way to correctly describe reality.

Content: Newton's theory was nominally refuted by Einstein's relativism, but this did not stop it from working — within the limits of its applicability (that is, at low speeds, if we are talking about Newtonian mechanics, and at non-critical masses of bodies, if we are talking about the theory of gravity). If we consider Newton's theory to be false in some universal sense, then all other theories should be considered false as well — according to pessimistic meta-induction (see the flaw of point 11), they will all be refuted sooner or later. However, Stephen Hawking suggests instead that we consider them all true: that a theory accurately describes the fundamental nature of things is of less importance to us than that it gives us reliable mechanisms for interacting with reality.

Let us imagine such an evil (Descartes) demon, who meddles with our senses and, let's say, makes us see an illusory bowl of jam on the table; if we try to touch this bowl, he will have the same effect on our touch as he did on sight before. If we want to taste this jam, he will activate our taste buds and add sugar and other substances to our blood, which we should get from the corresponding amount of ingested real jam. If this demon is willing to construct his illusion without any limit — to perfectly reproduce the experience of a real bowl of jam—then we are better off assuming that the bowl of jam really exists than looking for a more fundamental explanation that brings us no use (especially if, apart from simulating jam, this demon cannot do anything, and after us "eating" the entire "bowl" he will vanish from existence).

The original example, which Hawking tells, concerns the curious case when in the city of Monza (Italy) it was forbidden to keep goldfish in convex spherical aquariums, because, according to the authors of the ban, it was cruel to the fish, to give them a distorted picture of the surrounding reality outside the aquarium (due to refraction of light in curved surfaces). “But how do we know that our picture of reality is true?” asks the physicist. “Can't we ourselves be inside a giant convex aquarium and see the world distorted through a huge lens? The picture of the reality of the goldfish is different from ours, but how do we know that it is less real?". If in our reality an object moves in a straight line, the fish may see its motion as curved, but if it is intelligent enough, it can still formulate its own "distorted" laws of physics that will work just as accurately as ours — that is, allow it to predict this curvilinear (in its perspective) movement of bodies quite efficiently.

In other words, Hawking says that indistinguishable (in the sense of Popper) models of reality describe it equally well and are equally true (or equally false). This allows us to focus on practical research and avoid wasting time on the "too fundamental" knowledge, which will give us nothing anyway. At the same time, model-dependent realism stimulates the parallel development of different models of the same reality — as opposed to the dogmatic adoption of a single "best" theory.

An aphorism semantically opposite to the proposed one is common in statistical sciences: "All models are wrong, but some of them are useful."

Anti-example: We've never seen quarks, so they don't exist. (Even if quarks are just a convenient and reasonably universal computational abstraction, that's reason enough to accept that they, in some sense, “exist.”)

Flaw: For better or worse, I can't yet formulate a flaw of model-dependent realism.

15. Multiple Comparison Problem

Aphorism: If you give a million monkeys typewriters, they will print a target for the Texas Sharpshooter.

Content: As we remember from point 5 (principle of testing), if we possess a large dataset from the very beginning, then it is extremely easy to formulate hypotheses in it, but they will not be very useful. We now return to the Texas sharpshooter issue from a new angle; namely, if we measured a sufficiently large number of parameters on some sample, we can almost certainly find a correlation (statistical relationship) between some pair of these parameters. This problem is even stronger than the one described in point 5 — there, one of the two compared parameters (namely, distance to the power lines) was fixed; if we imagine that we simply study 800 different parameters on the same sample of Swedish residents, then the number of correlations we find will be much more than one. This is because the number of possible pairs of parameters increases quadratically with the increase in the number of parameters themselves; a similar phenomenon is the birthday paradox: in a group of 23 people, the probability that two people celebrate their birthday on the same day is greater than 50% (although 23 is much less than 50% of 365) because, roughly speaking, among 23 people there are 23⋅22 / 2=243 possible pairs of people.

Anti-example: The length of the winning word of an annual US spelling bee correlates with death from venomous spider bites. (See also other surprising correlations.)

Flaw: The disadvantage of this point is the same as the disadvantage of point 5 — the principle of testing: first, you still need to take a large set of data and see at least some correlations on it in order to put forward a basic hypothesis.

16. Methodological Reductionism

Aphorism: Break the model into parts as much as possible.

Content: Most scientists believe that there are no irreducibly complex things in the world. In other words, every large piece of scientific information can be broken down into smaller pieces. This belief belongs to a broader class of ideas known as reductionism. Methodological reductionism is about simplifying models and hypotheses and breaking them down into parts and subsystems as far as possible. (That is, if a model remains complex, but we have no ideas how to simplify it further, this is not a reason to throw away the model; we should only throw away the model that we have successfully "archived," that is, decomposed into submodels in such a way that the larger model is unambiguously constructed from the resulting components.)

Methodological Reductionism differs from point 1 — Occam's Razor — in that the Razor is applied at the beginning (when we are just hypothesizing) while Reductionism is applied at the end (when we already have a working model or theory), and also in that the Razor aims to destroy redundant entities, while Reductionism only "evicts" them into separate theories or structural elements of the model.

Anti-example: My magic stone only works together with my magic hat. You may conduct experiments, but only with both of them together, not separately.

Flaw: There is a level at which the formal application of reductionism is no longer meaningful. For example, you shouldn't try to reduce the meaning of words to the meanings of individual letters. This is because, at some point, our intuitive understanding of what is basic entities ceases to work properly. For example, not only intuitively point-like objects (electrons, letters, axioms of a formal system) can be basic (i.e., already maximally reduced), but also relations between them (pairs of electrons containing potential interaction energy; sequences of letters or sounds corresponding to certain semes, i.e., atomic meanings; combinations of axioms leading to certain classes of theorems).

17. The Principle of Interdisciplinarity

Aphorism: A wheel that can be turned though nothing else moves with it, is not a part of the mechanism. (L. Wittgenstein)

Meaning: Coherentism is the idea that a sufficiently large system of strongly interconnected statements is necessarily true (actually, it is one of the possible definitions of "truth"). Of course, coherentism is not a part of the scientific method, but if you change the direction of implication (swap "if" and "then"), you will get the principle of interdisciplinarity: if a system is true, then its parts must be connected. Every research programme should have implications for other research programmes; and "implications" here means not "associations", "similar terms" or even "reuse of the same models," but something more specific: sensitivity to change. If a theory can be radically changed without affecting any other theory, it means that this theory is isolated, it is describing things that are disconnected from the shared reality that the scientific method investigates. Each research programme has to contrast itself from its negation and from its alternatives — "Why is it this way and not that way?"; and between any two programmes we should be able to build a bridge (perhaps through other programs) such that radical changes in one will affect the other; that is, the field also answers the question "What if it is not this way? What are the consequences? How can we test this theory through theories about other parts of reality?"

Note that the principle of interdisciplinarity does not claim that statements, models, and hypotheses that are more consistent with existing theories have higher a priori probability — that is another statement of the scientific method, which is expressed in point 8, the Sagan standard.

Anti-example:

— Oh, what a mess! Yesterday I was giving a lecture on astrology and showed my students the natal charts of famous people: Indira Gandhi, Napoleon, etc. And I painted every detail so perfectly, every angle between the planets fits into the biography...

— And what?

— And today I discovered that I had mixed up the cards! What I showed as Indira Gandhi's natal chart turned out to be Hitler's!

— And now the details don't fit the biographies?

— No, they do, you just have to explain the angles between the planets in a different way.

— So astrology is non-falsifiable?

— Okay, you caught me here. Astrology admits that it was wrong in the past, and the descriptions of all the relationships between the planets need to be changed, because they were based on incorrect data due to my negligence. You see, there is falsifiability.

— And when the descriptions are changed and the models are fitted to the new data, will this destroy some other theory? Will we have to rethink the history of famous figures or question someone's date of birth because it doesn't fit the historical description of their character?

— No.

— That is, astrology is completely detached from other disciplines, I see.

Flaw: In branches of science that are far enough apart from each other, it can easily be the case that tremendous changes in one of them become completely insignificant until they reach the other — friction and backlash between the gears of the theories can, in a sufficiently large number of linkages, extinguish even a large discrepancy. In practice, we can only monitor and ensure transitive, or weak, interdisciplinarity; that is, we can only build bridges between neighboring fields, while still making sure that each field is connected to the others by a chain of such neighborhoods.

18. Cromwell's Rule (the Principle of Doubt)

Aphorism: All generalizations are false (even this one.) / Only a Sith deals in absolutes.

Content: Cromwell's Rule in Bayesian inference states that the probability of any hypothesis about the real world can never reach zero or one. In other words, you can't be absolutely sure of anything (this is what we started with when we abandoned the word "fact"). Even the most proven theories must contain a small pinch of doubt; even the most absurd ones cannot simply be dismissed. The rationale for this rule is purely mathematical: if you start from zero or one, Bayes' formula cannot change this estimate, no amount of new evidence will affect this probability.

Sometimes a counter-argument is put forward about the self-application of the principle of doubt: if one doubts everything, then one must also doubt the principle of doubt itself, which seems to lead to a paradox. However, this is not the case: the principle of doubt does not require that all probabilities be set at 50% (this is known as the Fallacy of Gray or the golden mean fallacy), it only warns against reaching extreme values on the interval [0%; 100%]. Even if we imagine the principle of doubt as a program that automatically multiplies the probabilities of all hypotheses by a certain factor, for example, 0.9 (leaving 10% for the option "All our ideas are wrong"), and one of these hypotheses is the principle of doubt itself, a stable exponentially convergent process arises. In the first step, we are 100% sure that we should doubt everything; in the second step, we reduce this probability to 90% according to the principle of doubt; in the third step, we have to increase it again because the probability of the need for multiplication by 0.9 has decreased. In other words, we first "take out" a 10% part of our total 100% and label it "In case our beliefs are wrong"; then we take out 1% of this 10% and write "In case the idea that our beliefs may be wrong is wrong." from this 1% we have to take out a tenth of it — 0.1% — and sign it with three nested errors; this process continues indefinitely, and then we have to color all the "taken out" parts alternatingly and take only the odd ones (because they are responsible for the case "Our ideas are still wrong"). Thus, the total degree of doubt will be 9% + 0.09% + 0.0009% + ... = 9,(09)%, and the degree of certainty will be 100% — 9,(09)% = 90,(90)%; or 1/11 and 10/11, respectively. In other words, the self-applied principle of doubt of level 1/n corresponds exactly to the non-self-applied principle of level 1/(n+1).

Lakatos distinguishes the concept of "non-problematic background knowledge" — for example, in the study of stars, such background knowledge can be the optical theory that makes creating telescopes possible. We certainly doubt the hypothesis we are testing more than we doubt the background knowledge; we can assign a fixed level of doubt to all of our background knowledge and not complicate our lives by asking how this level is distributed among different background areas (which will, however, come into focus and receive a higher level of doubt if we suspect that the problem lies among the background knowledge).

It should be noted that Cromwell's rule applies only to hypotheses, not to abstract formal statements. Within mathematics, a statement like "There are infinitely many primes" may have a probability of 100% (however, when leaving mathematics, this probability will still be multiplied by the probability of the Mathematicity Hypothesis, which is subject to Cromwell's rule, although this one — the ZFC consistency — is probably the best-tested hypothesis of all that we know.)

Anti-example: I am absolutely certain that the scientific method works.

Flaw: If we fix a microscopic but non-zero probability to assign to each new bogus hypothesis, then an opponent can potentially generate enough mutually exclusive or independent hypotheses to "displace" our established and highly probable theories (this is somewhat similar to Pascal's Mugging algorithm). Therefore, it is important to avoid setting a single probability, but rather to make it dependent on the length of the hypothesis expressed in such a way that all possible bogus hypotheses (there’s an infinite number of them) add up to a single finite amount of permissible doubt. Sometimes scientists have to defend themselves against the influx of spam using dogmatic methods—for example, in 1775, the French Academy of Sciences decided that it would no longer accept proposals of perpetual motion machines (because 100% of the proposals it considered were either false or deliberately forged in order to steal a patent).

[-]AnthonyC2mo42

I think there could be a lot of value in having a sequence of posts on, basically, "What is this 'science' thing anyway?" Right now all the core ideas (including various prerequisites and corollaries) exist on this site or in the Sequences, but not a single, clear, cohesive whole that isn't extremely long.

However, I think trying to frame it this way, in one post, doesn't work. It's unclear who the target audience it, how they should approach it, and what they should hope to get out of it. Even knowing and already understanding these points, I read it wondering, "Why are these here, together, in this order? What is implied by the point numbering? Who, not already knowing these, would be willing to read this and able to understand it?"

It looks like the author created this account only a day before posting this. IDK if they've been here lurking or using another account a long time before that or not. In any case, my suggestion would be to look at how the Sequences are structured, find the bits that tie into what you're writing here, and then refactor this post into a series. Try and make it present ideas in a cohesive order, in digestible chunks, with links to past posts by others that expand on important points in more detail or other styles.

[-]jmh2mo37

Voted down the post as I could not even get through more than maybe a third of it before deciding it was way to complicated for me to understand the ultimate point, much less just how to do anything with what was hoped to be communicated.

If the effort is to help those that are perhaps challenged in understanding and applying a scientific method to their questions and life another version might be more effective.

I do want to make an update on my comment. I think it might sound too harsh. I do think the intent of the post is both well meaning and of value. But perhaps rather than putting everything into a long and a bit complicated post a a small sequence/series of posts might be a better approach.

[-]TAG2mo20

We must always start with the simplest possible explanations for the phenomena that surround us.

Why?

The fewer components, abstractions, or entities required for a hypothesis, the better the hypothesis.

(Not doubting Occam's razor, pointing out that it needs an explanation).

There is more than one way to correctly describe reality.

That goes against he law of non-contradiction: if the two ways are different, they cannot both be correct.

Newton’s theory was nominally refuted by Einstein’s relativism, but this did not stop it from working

"Working" means making correct predictions, not describing reality.

However, Stephen Hawking suggests instead that we consider them all true: that a theory accurately describes the fundamental nature of things is of less importance to us than that it gives us reliable mechanisms for interacting with reality.

How important something is depends on ones values.

“All models are wrong, but some of them are useful.”

...is the opposite of "There is more than one way to correctly describe reality.". Unless you start changing the meanings of "works"/"useful" versus "true"/"describes reality".

PS. Nothing to say about induction?

[-]Anon User2mo10

Your principles #3 and #5 are in a weak conflict - generating hypothesis without having enough information to narrow the space of reasonable hypotheses would too often lead to false positives. When faced with an unknown novel phenomena, one put to collect information first, including collecting experimental data without a fixed hypothesis, before starting to formulate any hypotheses.

LESSWRONG
LW