Reproducibility of Scientific Results
The terms “reproducibility crisis” and “replication crisis” gained currency in conversation and in print over the last decade (e.g., Pashler & Wagenmakers 2012), as disappointing results emerged from large scale reproducibility projects in various medical, life and behavioural sciences (e.g., Open Science Collaboration, OSC 2015). In 2016, a poll conducted by the journal Nature reported that more than half (52%) of scientists surveyed believed science was facing a “replication crisis” (Baker 2016). More recently, some authors have moved to more positive terms for describing this episode in science; for example, Vazire (2018) refers instead to a “credibility revolution” highlighting the improved methods and open science practices it has motivated.
The crisis often refers collectively to at least the following things: The associated open science reform movement aims to rectify conditions that led to the crisis. This is done by promoting activities such as data sharing and public pre-registration of studies, and by advocating stricter editorial policies around statistical reporting including publishing replication studies and statistically non-significant results. This review consists of four distinct parts.
First, we look at the term “reproducibility” and related terms like “repeatability” and “replication”, presenting some definitions and conceptual discussion about the epistemic function of different types of replication studies. Second, we describe the meta-science research that has established and characterised the reproducibility crisis, including large scale replication projects and surveys of questionable research practices in various scientific communities. Third, we look at attempts to address epistemological questions about the limitations of replication, and what value it holds for scientific inquiry and the accumulation of knowledge.
The fourth and final part describes some of the many initiatives the open science reform movement has proposed (and in many cases implemented) to improve reproducibility in science. In addition, we reflect there on the values and norms which those reforms embody, noting their relevance to the debate about the role of values in the philosophy of science.
📑 Contents
1. Replicating, Repeating, and Reproducing Scientific Results
A starting point in any philosophical exploration of reproducibility and related notions is to consider the conceptual question of what such notions mean. According to some (e.g., Cartwright 1991), the terms “replication”, “reproduction” and “repetition” denote distinct concepts, while others use these terms interchangeably (e.g., Atmanspacher & Maasen 2016a). Different disciplines can have different understandings of these terms too. In computational disciplines, for example, reproducibility often refers to the ability to reproduce computations alone, that is, it relates exclusively to sharing and sufficiently annotating data and code (e.g., Peng 2011, 2015). In those disciplines, replication describes the redoing of whole experiments (Barba 2017, Other Internet Resources). In psychology and other social and life sciences, however, reproducibility may refer to either the redoing of computations, or the redoing of experiments. The Reproducibility Projects, coordinated by the Center for Open Science, redo entire studies, data collection and analysis. A recent funding program announcement by DARPA (US Defense Advanced Research Programs Agency) distinguished between reproducibility and replicability, where the former refers to computational reproducibility and the latter to the redoing of experiments. Here we use all three terms—“replication”, “reproduction” and “repetition”—interchangeably, unless explicitly describing the distinctions of other authors.
When describing a study as “replicable”, people could have in mind either of at least two different things. The first is that the study is replicable in principle the sense that it can be carried out again, particularly when its methods, procedures and analysis are described in a sufficiently detailed and transparent way. The second is that the study is replicable in that sense that it can be carried out again and, when this happens, the replication study will successfully produce the same or sufficiently similar results as the original. A study may be replicable in the former sense but not in the second sense: one might be able to replicate the methods, procedures and analysis of a study, but fail to successfully replicate the results of the original study. Similarly, when people talk of a “replication”, they could also have in mind two different things: the replication of the methods, procedures and analysis of a study (irrespective of the results) or, alternatively, the replication of such methods, procedures and analysis as well as the results.
Arguably, most typologies of replication make more or less fine-grained distinctions between direct replication (which closely follow the original study to verify results) and conceptual replications (which deliberately alter important features of the study to generalize findings or to test the underlying hypothesis in a new way). As suggested, this distinction may not always be known by these terms. For example, roughly the same distinction is referred to as exact and inexact replication by Keppel (1982); concrete and conceptual replication by Sargent (1981), and literal, operational and constructive replication by Lykken (1968). Computational reproducibility is most often direct (reproducing particular analysis outcomes from the same data set using the same code and software), but it can also be conceptual (analysing the same raw data set with alternative approaches, different models or statistical frameworks). For an example of a conceptual computational reproducibility study, see Silberzahn and Uhlmann 2015.
We do not attempt to resolve these disciplinary differences or to create a new typology of replication, and instead we will provide a limited snapshot of the conceptual terrain by surveying three existing typologies—from Stefan Schmidt (2009), from Omar Gómez, Natalia Juristo, and Sira Vegas (2010) and from Hans Radder. Schmidt’s account has been influential and widely-cited in psychology and social sciences, where the replication crisis literature is heavily concentrated. Gómez, Juristo, and Vegas’s (2010) typology of replication is based on a multidisciplinary survey of over 18 scholarly classifications of replication studies which collectively contain more than 79 types of replication. Finally, Radder’s (1996, 2003, 2006, 2009, 2012) typology is perhaps best known within philosophy of science itself.
1.1 An Account from the Social Sciences
Schmidt outlines five functions of replication studies in the social sciences:
Modifying Hendrik’s (1991) classes of variables that define a research space, Schmidt (2009) presents four classes of variables which may be altered or held constant in order for a given replication study to fulfil one of the above functions. The four classes are:
Schmidt then systematically works through examples of how each function can be achieved by altering and/or holding a different class or classes of variable constant. For example, to fulfil the function of controlling for sampling error (Function 1), one should alter only variables regarding participant recruitment (Class 3), attempting to keep variables in all other classes as close to the original study as possible. To control for artefacts (Function 2), one should alter variables concerning the context and dependent variable measures (variables in Classes 2 and 4 respectively), but keep variables in 1 and 3 (information conveyed to participants and participant recruitment) as close to the original as possible. Schmidt, like most other authors in this area, acknowledges the practical limits of being able to hold all else constant. Controlling for fraud (Function 3) is served by the same arrangements as controlling for artefacts (Function 2). In Schmidt’s account, controlling for sampling error, artefacts and fraud (Functions 1 to 3) are connected by a theme of confirming the results of the original study. Functions 4 and 5 go beyond this—generalizing to new populations (Function 4) which is served by changes to participant recruitment (Class 3) and confirming the underlying hypothesis (Function 5), which served by changes to the information conveyed, the context and dependant variable measures (Classes 1, 2 and 4 respectively) but not changes to participant recruitment (Class 3, although Schmidt acknowledges that holding the latter class of variables constant whilst varying everything else is often practically impossible). Attempts to enable verification of the underlying research hypothesis (i.e., to fulfil Function 5) alone are what Schmidt classifies as conceptual replications, following Rosenthal (1991). Attempts to fulfil the other four functions are considered variants of direct replications.
In summary, for Schmidt, direct replications control for sampling error, artifacts, and fraud, and provide information about the reliability and validity of prior empirical work. Conceptual replications help corroborate the underlying theory or substantive (as opposed to statistical) hypothesis in question and the extent to which they generalize in new circumstances and situations. In practice, direct and conceptual replications lie on a continuum, with replication studies varying more or less compared to the original on potentially a great number of dimensions.
1.2 An Interdisciplinary Account
Gómez, Juristo, and Vega’s (2010) survey of the literature in 18 disciplines identified 79 types of replication, not all of which they considered entirely distinct. They identify five main ways in which a replication study may diverge from an initial study. With some similarities to Schimdt’s four classes above:
A change in any one or combination of these elements in a replication study corresponds to different purposes underlying the study, and thereby establishes a different kind of validity. Like Schmidt et al. then systematically work through how changes to each of the above work to fulfil different epistemic functions.
1.3 A Philosophical Account
Radder (1996, 2003, 2006, 2009, 2012) distinguishes three types of reproducibility. One is the reproducibility of what Radder calls an experiment’s material realization. Using one of Radder’s own examples as an illustration, two people may carry out the same actions to measure the mass of an object. Despite doing the same actions, person A regards themselves as measuring the object’s Newtonian mass while person B regards themselves as measuring the object’s Einsteinian mass. Here, then, the actions or material realization of the experimental procedure can be reproduced, but the theoretical descriptions of their significance differ. Radder, however, does not specify what is required for one material realisation to be a reproduction of another, a pertinent question, especially since, as Radder himself affirms, no reproduction will be exactly the same as any other reproduction (1996: 82–83).
A second type of reproducibility is the reproducibility of an experiment, given a fixed theoretical description. For example, a social scientist might conduct two experiments to examine social conformity. In one experiment, a young child might be instructed to give an answer to a question before a group of other children who are, unknown to the former child, instructed to give wrong answers to the same question. In another experiment, an adult might be instructed to give an answer to a question before a group of other adults who are, unknown to the former adult, instructed to give wrong answers to the same question. If the child and the adult give a wrong answer that conforms to the answers of others, then the social scientist might interpret the result as exemplifying social conformity. For Radder, the theoretical description of the experiment might be fixed, specifying that if some people in a participant’s surroundings give intentionally false answers to the question, then the genuine participant will conform to the behaviour of their peers. However, the material realization of these experiments differs insofar as one concerns children and the other adults. It is difficult to see how, in this example at least, this differs from what either Schmidt or Gómez, Juristo, and Vegas would refer to as establishing generalizability to a different population (Schmidt’s [2009] Class 3 and Function 5; Gómez, Juristo, and Vegas’s [2010] way 5 and Function 4).
The third kind of reproducibility is what Radder calls replicability. This is where experimental procedures differ to produce the same experimental result (otherwise known as a successful replication). For example, Radder notes that multiple experiments might obtain the result “a fluid of type f has a boiling point b”, despite having different kinds of thermometers by which to measure this boiling point (2006: 113–114).
Schmidt (2009) points out that the difference between Radder’s second and third types of reproducibility is small in comparison to their differences to the first type. He consequently suggests his alternative distinction between direct and conceptual replication, presumably intending a conceptual replication to cover Radder’s second and third types.
In summary, whilst Gómez, Juristo, and Vegas’s typology draws distinctions in slightly different places to Schmidt’s, its purpose is arguably the same—to explain what types of alterations in replication studies fulfil different scientific goals, such as establishing internal validity or the extent of generalization and so on. With the exception of his discussion of reproducing the material realization, Radder’s other two categories can perhaps be seen as fitting within the larger range of functions described by Schmidt and Gómez et al., who both acknowledge that in practice, direct and conceptual replications lie on a noisy continuum.
2. Meta-Science: Establishing, Monitoring, and Evaluating the Reproducibility Crisis
In psychology, the origin of the reproducibility crisis is often linked to Daryl Bem’s (2011) paper which reported empirical evidence for the existence of “psi”, otherwise known as Extra Sensory Perception (ESP). This paper passed through the standard peer review process and was published in the high impact Journal of Personality and Social Psychology. The controversial nature of the findings inspired three independent replication studies, each of which failed to reproduce Bem’s results. However, these replication studies were rejected from four different journals, including the journal that had originally published Bem’s study, on the grounds that the replications were not original or novel research. They were eventually published in PLoS ONE (Ritchie, Wiseman, & French 2012). This created controversy in the field, and was interpreted by many as demonstrating how publication bias impeded science’s self-correction mechanism. In medicine, the origin of the crisis is often attributed to Ioannidis’ (2005) paper “Why most published findings are false”. The paper offered formal arguments about inflated rates of false positives in the literature—where a “false positive” result claims a relationship exists between phenomena when it in fact does not (e.g., a claim that consuming a drug is correlated with symptom relief when it in fact is not). Ioannidis’ (2005) also reported very low (11%) empirical reproducibility rates from a set of pre-clinical trial replications at Amgen, later independently published by Begley and Ellis (2012). In all disciplines, the replication crisis is also more generally linked to earlier criticisms of Null Hypothesis Significance Testing (e.g., Szucs & Ioannidis 2017), which pointed out the neglect of statistical power (e.g., Cohen 1962, 1994) and a failure to adequately distinguish statistical and substantive hypotheses (e.g., Meehl 1967, 1978). This is discussed further below.
In response to the events above, a new field identifying as meta-science (or meta-research) has become established over the last decade (Munafò et al. 2017). Munafò et al. define meta-science as “the scientific study of science itself” (2017: 1). In October 2015, Ioannidis, Fanelli, Dunne, and Goodman identified over 800 meta-science papers published in the five-month period from January to May that year, and estimated that the relevant literature was accruing at the rate of approximately 2,000 papers each year. Referring to the same bodies of work with slightly different terms, Ioannidis et al. define “meta-research” as
an evolving scientific discipline that aims to evaluate and improve research practices. It includes thematic areas of methods, reporting, reproducibility, evaluation, and incentives (how to do, report, verify, correct, and reward science). (2015: 1)
Multiple research centres dedicated to this work now exist, including, for example, the Tilburg University Meta-Research Center in psychology, the Meta-Research Innovation Center at Stanford (METRICS), and others listed in Ioannidis et al. 2015 (see Other Internet Resources). Relevant research in medical fields is also covered in Stegenga 2018.
Projects that self-identify as meta-science or meta-research include:
2.1 Reproducibility Projects
The most well known of these projects is undoubtedly the Reproducibility Project: Psychology, coordinated by the now Center for Open Science in Charlottesville, VA (then the Open Science Collaboration). It involved 270 crowd sourced researchers in 64 different institutions in 11 different countries. Researchers attempted direct replications of 100 studies published in three leading psychology journals in the year 2008. Each study was replicated only once. Replications attempted to follow original protocols as closely as possible, though some differences were unavoidable (e.g., some replication studies were done with European samples when the original studies used US samples). In almost all cases, replication studies used larger sample sizes that the original studies and therefore had greater statistical power—that is, a greater probability of correctly rejecting the null hypothesis (i.e., that no relationship exists) when the hypothesis is false. A number of measures of reproducibility were reported:
There have been objections to the implementation and interpretation of this project, most notably by Gilbert et al. (2016), who took issue with the extent to which the replications studies were indeed direct replications. For example, Gilbert et al. highlighted 6 specific examples of “low fidelity protocols”, that is, where replication studies differed in their view substantially from the original (in one case, using a European sample rather than a US sample of participants). However, Anderson et al. (2016) explained in a reply that in half of those cases, the authors of the original study had endorsed the replication as being direct or close to on relevant dimensions and that furthermore, that independently rated similarity between original and replication studies failed to predict replication success. Others (e.g., Etz & Vandekerckhove 2016) have applied Bayesian reanalysis to the OSC’s (2015) data and conclude that up to 75% (as opposed to the OSC’s 36–47%) of replications could be considered successful. However, they do note that in many cases this is only with very weak evidence (i.e., Bayes factors of <10). They too conclude that the failure to reproduce many effects is indeed explained by the overestimation of effect sizes, itself a product of publication bias. A Reproducibility Project: Cancer Biology (also coordinated by the Center for Open Science) is currently underway (Errington et al. 2014), originally attempting to replicate 50 of the highest impact studies in Cancer Biology published between 2010–2012. This project has recently announced it will complete with only 18 replication studies, as too few originals reported enough information to proceed with full replications (Kaiser 2018). Results of the first 10 studies are reportedly mixed, with only 5 being considered “mostly repeatable” (Kaiser 2018).
The Many Labs project (Klein et al. 2014) coordinated 36 independent replications of 13 classic psychology phenomena (from 12 studies, that is, one study tested two effects), including anchoring, sunk cost bias and priming, amongst other well-known effects in psychology. In terms of matching statistical significance, the project demonstrated that 11 out of 13 effects could be successful replicated. It also showed great variation in many of the effect sizes across the 36 replications.
In biomedical research, there have also been a number of large scale reproducibility projects. An early one by Begley and Ellis (2012, but discussed earlier in Ioannidis 2005) attempted to replicate 56 landmark pre-clinical trials and reported an alarming reproducibility rate of only 11%, that is, only 6 of the 56 results could be successfully reproduced. Subsequent attempts at large scale replications in this field have produced more optimistic estimates, but routinely failed to successfully reproduce more than half of the published results. Freedman et al. (2015) report five replication projects by independent groups of researchers which produce reproducibility estimates ranging from 22% to 49%. They estimate the cost of irreproducible research in US biomedical science alone to be in the order of USD$28 billion per year. A reproducibility project in Experimental Philosophy is an exception to the general trend, reporting reproducibility rates of 70% (Cova et al. forthcoming).
Finally, the Social Science Replication Project (SSRP) redid 21 experimental social science studies published in the journals Nature and Science between 2010 and 2015. Depending on the measure taken, the replication success rate was 57–67% (Camerer et al. 2018).
2.2 Publication Bias, Low Statistical Power and Inflated False Positive Rates
The causes of irreproducible results are largely the same across disciplines we have mentioned. This is not surprising given that they stem from problems with statistical methods, publishing practices and the incentive structures created in a “publish or perish” research culture, all of which are largely shared, at least in the life and behavioral sciences.
Whilst replication is often casually referred to as a cornerstone of the scientific method, direct replication studies (as they might be understood from Schmidt or Gómez, Juristo, and Vegas’s typologies above) are a rare event in the published literature of some scientific disciplines, most notably the life and social sciences. For example, such replication attempts constitute roughly 1% of the published psychology literature (Makel, Plucker, & Hegarty 2012). The proportion in published ecology and evolution literature is even smaller (Kelly 2017, Other Internet Resources).
This virtual absence of replication studies in the literature can explained by the fact that many scientific journals have historically had explicit policies against publishing replication studies (Mahoney 1985)—thus giving rise to a “publication bias”. Over 70% of editors from 79 social science journals said they preferred new studies over replications and over 90% said they would did not encourage the submission of replication studies (Neuliep & Crandall 1990). In addition, many science funding bodies also fund only “novel”, “original” and/or “groundbreaking” research (Schmidt 2009).
A second type of publication bias has also played a substantial role in the reproducibility crisis, namely a bias towards “statistically significant” or “positive” results. Unlike the bias against replication studies, this is rarely an explicitly stated policy of a journal. Publication bias towards statistically significant findings has a long history, and was first documented in psychology by Sterling (1959). Developments in text mining techniques have led to more comprehensive estimates. For example, Fanelli’s work has demonstrated the extent of publication bias in various disciplines, and the proportions of statistically significant results given below are from his 2010a paper. He has also documented the increase of this bias over time (2012) and explored the causes of the bias, including the relationship between publication bias and a publish or perish research culture (2010b).
In many disciplines (e.g., psychology, psychiatry, materials science, pharmacology and toxicology, clinical medicine, biology and biochemistry, economics and business, microbiology and genetics) the proportion of statistically significant results is very high, close to or exceeding 90% (Fanelli 2010a). This is despite the fact that in many of these fields, the average statistical power is low—that is, the average probability that a study will correctly reject the null hypothesis is low. For example, in psychology the proportion of published results that are statistically significant is 92% despite the fact that the average power of studies in this field to detect medium effect sizes (arguably typical of the discipline) is roughly 44% (Szucs & Ioannidis 2017). If there was no bias towards publishing statistically significant results, the proportion of significant results should roughly match the average statistical power of the discipline. The excess in statistical significance (in this case, the difference between 92% and 44%) is therefore an indicator the strength of the bias. For a second example, in ecology, environment and plant and animal sciences the proportion of statistically significant results is 74% and 78% respectively, admittedly lower than in psychology. However, the most recent estimate of the statistical power, again of medium effect sizes, of ecology and animal behaviour is 23–26% (Smith, Hardy, & Gammell 2011) (An earlier more optimistic assessment was 40–47%, Jennions & Møller, 2003.) For a third example, the proportion of statistically significant results in neuroscience and behaviour is 85%. Our best estimate of the statistical power in neuroscience is at best 31%, with a lower bound estimate of 8% (Button et al. 2013). The associated file-drawer problem (Rosenthal 1979)—where researchers relegate failed statistically non-significant studies to their file drawers, hidden from public view—has long been established in psychology and others disciplines, and is known to lead to distortions in meta-analysis (where a “meta-analysis” is a study which analyses results across multiple other studies).
2.3 Questionable Research Practices
In addition to creating the file-drawer problem described above, publication bias has been held at least partially responsible for the high prevalence of Questionable Research Practices (QRPs) uncovered in both self-report survey research (John, Loewenstein, & Prelec 2012; Agnoli 2017 et al. 2017; Fraser et al. 2018) and in journal studies that have detected, for example, unusual distributions of p values (Masicampo & Lalande 2012; Hartgerink et al. 2016). Pressure to publish, now ubiquitous across academic institutions, means that researchers often cannot afford to simply assign “failed” or statistically non-significant studies to the file drawer, so instead they p hack and cherry-pick results (as discussed below) back to significance, and back into the published literature. Simmons, Nelson, and Simonsohn (2011) explained and demonstrated with simulated results how engaging in such practices inflates the false positive error rate of the published literature, leading to a lower rate of reproducible results.
“P hacking” refers to a set of practices which include: checking the statistical significance of results before deciding whether to collect more data; stopping data collection early because results have reached statistical significance; deciding whether to exclude data points (e.g., outliers) only after checking the impact on statistical significance and not reporting the impact of the data exclusion; adjusting statistical models, for instance by including or excluding covariates based on the resulting strength of the main effect of interest; and rounding of a p value to meet a statistical significance threshold (e.g., presenting 0.053 as P < .05). “Cherry picking” includes failing to report dependent or response variables or relationships that did not reach statistical significance or other threshold and/or failing to report conditions or treatments that did not reach statistical significance or other threshold. “HARKing” (Hypothesising After Results are Known) includes presenting ad hoc and/or unexpected findings as though they had been predicted all along (Kerr 1998); and presenting exploratory work as though it was confirmatory hypothesis testing (Wagenmakers et al. 2012). Five of the most widespread QRPs are listed below in Table 1 (from Fraser et al. 2018), with associated survey measures of prevalence.
2.4 Over-Reliance on Null Hypothesis Significance Testing
Null Hypothesis Significance Testing (NHST)—discussed above—is a commonly diagnosed cause of the current replication crisis (see Szucs & Ioannidis 2017). The ubiquitous nature of NHST in life and behavioural sciences is well documented, most recently by Cristea and Ioannidis (2018). This is important pre-condition for establishing its role as a cause, since it could not be a cause if its actual use was rare. The dichotomous nature of NHST facilitates publication bias (Meehl 1967, 1978). For example, the language of accept and reject in hypothesis testing maps conveniently on to acceptance and rejection of manuscripts, a fact that led Rosnow and Rosenthal (1989) to decry that “surely God loves the .06 nearly as much as the .05” (1989: 1277). Techniques that failed to enshrine a dichotomous threshold would be harder to employ in service of publication bias. For example, a case has been made that estimation using effect sizes and confidence intervals (introduced above) would be less prone to being used in service of publication bias (Cumming 2012, Cumming and Calin-Jageman 2017.
As already mentioned, the average statistical power in various disciplines is low. Not only is power often low, but it is virtually never reported; less than 10% of published studies in psychology report statistical power and even fewer in ecology do (Fidler et al. 2006). Explanations for the widespread neglect of statistical power often highlight the many common misconceptions and fallacies associated with p values (e.g., Haller & Krauss 2002; Gigerenzer 2018). For example, the inverse probability fallacy[1] has been used to explain why so many researchers fail to calculate and report statistical power (Oakes 1986).
In 2017, a group of 72 authors proposed in a Nature Human Behaviour paper that alpha level in statistical significance testing be lowered to 0.005 (as opposed to the current standard of 0.05) to improve the reproducibility rate of published research (Benjamin et al. 2018). A reply from a different set of 88 authors was published in the same journal, arguing against this proposal and stating instead that researchers should justify their alpha level based on context (Lakens et al. 2018). Several other replies have followed, including a call from Andrew Gelman and colleagues to abandon statistical significance altogether (McShane et al. 2018, Other Internet Resources). The exchange has become known on social media as the Alpha Wars (e.g., in the Barely Significant blog, Other Internet Resources)). Independently, the American Statistical Association released a statement on the use of p values for the first time in its history, cautioning against their overinterpretation and pointing out the limits of the information they offer about replication (Wasserman & Lazar 2016) and devoted their association’s 2017 annual convention to the theme “Scientific Method for the 21st Century: A World Beyond \(p <0.05\)” (see Other Internet Resources).
2.5 Scientific Fraud
A number of recent high-profile cases of scientific fraud have contributed considerably to the amount of press around the reproducibility crisis in science. Often these cases (e.g., Diederik Stapel in psychology) are used as a hook for media coverage, even though the crisis itself has very little to do with scientific fraud. (Note also that the Questionable Research Practices above are not typically counted as “fraud” or even “scientific misconduct” despite their ethically dubious status.) For example, Fang, Grant Steen, and Casadevall (2012) estimated that 43% of retracted articles in biomedical research are withdrawn because of fraud. However, roughly half a million biomedical articles are published annually and only 400 of those are retracted (Oransky 2016, founder of the website RetractionWatch), so this amounts to a very small proportion of the literature (approximately 0.1%). There are, of course, many cases of pharmaceutical companies exercising financial pressure on scientists and the publishing industry that raise speculation about how many undetected (or unretracted) cases there may still be in the literature. Having said that, there is widespread consensus amongst scientists in the field that the main cause of the current reproducibility crisis is the current incentive structure in science (publication bias, publish or perish, non-transparent statistical reporting, lack of rewards for data sharing). Whilst this incentive structure can push some to scientific fraud, it appears to be a very small proportion.
3. Epistemological Issues Related to Replication
Many scientists believe that replication is epistemically valuable in some way, that is to say, that replication serves a useful function in enhancing our knowledge, understanding or beliefs about reality. This section first discusses a problem about the epistemic value of replication studies—called the “experimenters regress”—and it then considers the claim that replication plays an epistemically valuable role in distinguishing scientific inquiry. It lastly examines a recent attempt to formalise the logic of replication in a Bayesian framework.
3.1 The Experimenters’ Regress
Collins (1985) articulated a widely discussed problem that is now known as the experimenters’ regress. He initially lays out the problem in the context of measurement (Collins 1985: 84). Suppose a scientist is trying to determine the accuracy of a measurement device and also the accuracy of a measurement result. Perhaps, for example, a scientist is using a thermometer to measure the temperature of a liquid, and it delivers a particular measurement result, say, 12 degrees Celsius.
The problem arises because of the interdependence of the accuracy of the measurement result and the accuracy of the measurement device: to know whether a particular measurement result is accurate, we need to test it against a measurement result that is previously known to be accurate, but to know that the result is accurate, we need to know that it has been obtained via an accurate measuring device, and so on. This, according to Collins, creates a “circle” which he refers to as the “experimenters’ regress”.
Collins extends the problem to scientific replication more generally. Suppose that an experiment B is a replication study of an initial experiment A, and that B’s result apparently conflicts with A’s result. This seeming conflict may have one of two interpretations:
The regress poses a problem about how to choose between these interpretations, a problem which threatens the epistemic value of replication studies if there are no rational grounds for choosing in a particular way. Determining whether one experiment is a proper replication of another is complicated by the facts that scientific writing conventions often omit precise details of experimental methodology (Collins 2016), and, furthermore, much of the knowledge that scientists require to execute experiments is tacit and “cannot be fully explicated or absolutely established” (Collins 1985: 73).
In the context of experimental methodology, Collins wrote:
To know an experiment has been well conducted, one needs to know whether it gives rise to the correct outcome. But to know what the correct outcome is, one needs to do a well-conducted experiment. But to know whether the experiment has been well conducted…! (2016: 66; ellipses original)
Collins holds that in such cases where a conflict of results arises, scientists tend to fraction into two groups, each holding opposing interpretations of the results. According to Collins, where such groups are “determined” and the “controversy runs deep” (Collins 2016: 67), the dispute between the groups cannot be resolved via further experimentation, for each additional result is subject to the problem posed by the experimenters’ regress.[2] In such cases, Collins claims that particular non-epistemic factors will partly determine which interpretation becomes the lasting view:
the career, social, and cognitive interests of the scientists, their reputations and that of their institutions, and the perceived utility for future work. (Franklin & Collins 2016: 99)
Franklin was the most vociferous opponent of Collins, although recent collaboration between the two has fostered some agreement (Collins 2016). Franklin presented a set of strategies for validating experimental results, all of which relate to “rational argument” on epistemic grounds (Franklin 1989: 459; 1994). Examples include, for instance, appealing to experimental checks on measurement devices or eliminating potential sources of error in the experiment (Franklin & Collins 2016). He claimed that the fact that such strategies were evidenced in scientific practice “argues against those who believe that rational arguments plays little, if any, role” in such validation (Franklin 1989: 459), with Collins being an example. He interprets Collins as suggesting that the strategies for resolving debates of the validation of results are social factors or “culturally accepted practices” (Franklin, 1989: 459) which do not provide reasons to underpin rational belief about results. Franklin (1994) further claims that Collins conflates the difficulty in successfully executing experiments with the difficulty of demonstrating that experiments have been executed, with Feest (2016) interpreting him to say that although such execution requires tacit knowledge, one can nevertheless appeal to strategies to demonstrate the validity of experimental findings.
Feest (2016) examines a case study involving debates about the Mozart effect in psychology (which, roughly speaking, is the effect whereby listening to Mozart beneficially affects some aspect of intelligence or brain structure). Like Collins, she agrees that there is a problem in determining whether conflicting results suggest a putative replication experiment is not a proper replication attempt, in part because there is uncertainty about whether scientific concepts such as the Mozart effect have been appropriately operationalised in earlier or later experimental contexts. Unlike Collins (on her interpretation), however, she does not think that this uncertainty arises because scientists have inescapably tacit knowledge of the linguistic rules about the meaning and application of concepts like the Mozart effect. Rather the uncertainty arises because such concepts are still themselves developing and because of assumptions about the world that are required to successfully draw inferences from it. Experimental methodology then serves to reveal the previously tacit assumptions about the application of concepts and the legitimacy of inferences, assumptions which are then susceptible to scrutiny.
For example, in her study of the Mozart effect, she notes that replication studies of the Mozart effect failed to find that Mozart music had a beneficial influence on spatial abilities. Rauscher, who was the first to report results supporting the Mozart effect, suggested that the later studies were not proper replications of her study (Rauscher, Shaw, and Ky 1993, 1995). She clarified that the Mozart effect applied only to a particular category of spatial abilities (spatio-temporal processes) and that the later studies operationalised the Mozart effect in terms of different spatial abilities (spatial recognition). Here, then, there was a difficulty in determining whether to interpret failed replication results as evidence against the initial results or rather as an indication that the replication studies were not proper replications. Feest claims this difficulty arose because of tacit knowledge or assumptions: assumptions about the application of the Mozart effect concept to different kinds of spatial abilities, about whether the world is such that Mozart music has an effect on such abilities and about whether the failure of Mozart to impact other kinds of spatial abilities warrants the inference that the Mozart effect does not exist. Contra Collins, however, experimental methodology enabled the explication and testing of these assumptions, thus allowing scientists to overcome the interpretive impasse.
Against this background, her overall argument is that scientists often are and should be sceptical towards each other’s results. However, this is not because of inescapably tacit knowledge and the inevitable failure of epistemic strategies for validating results. Rather, it is at least in part because of varying tacit assumptions that researchers have about the meaning of concepts, about the world and about what to draw inferences from it. Progressive experimentation serves to reveal these tacit assumptions which can then be scrutinised, leading to the accumulation of knowledge.
There is also other philosophical literature on the experimenters’ regress, including Teira’s (2013) paper arguing that particular experimental debiasing procedures are defensible against the regress from a contractualist perspective, according to which self-interested scientists have reason to adopt good methodological standards.
3.2 Replication as a Distinguishing Feature of Science
There is a widespread belief that science is distinct from other knowledge accumulation endeavours, and some have suggested that replication distinguishes (or is at least essential to) science in this respect. (See also the entry on science and pseudo-science.). According to the Open Science Collaboration, “Reproducible research practices are at the heart of sound research and integral to the scientific method.” (OSC 2015: 7). Schmidt echoes this theme: “To confirm results or hypotheses by a repetition procedure is at the basis of any scientific conception” (2009: 90). Braude (1979) goes so far as to say that reproducibility is a “demarcation criterion between science and nonscience” (1979: 2). Similarly, Nosek, Spies, and Motyl state that:
[T]he scientific method differentiates itself from other approaches by publicly disclosing the basis of evidence for a claim…. In principle, open sharing of methodology means that the entire body of scientific knowledge can be reproduced by anyone. (2012: 618)
If replication played such an essential or distinguishing role in science, we might expect it to be a prominent theme in the history of science. Steinle (2016) considers the extent to which it is such a theme. He presents a variety of cases from the history of science where replication played very different roles, although he understands “replication” narrowly to refer to when an experiment is re-run by different researchers. He claims that the role and value of replication in experimental replication is “much more complex than easy textbook accounts make us believe” (2016: 60), particularly since each scientific inquiry is always tied to a variety of contextual considerations that can affect the importance of replication. Such considerations include the relationship between experimental results and the background of accepted theory at the time, the practical and resource constraints on pursuing replication and the perceived credibility of the researchers. These contextual factors, he claims, mean that replication was a key or even overriding determinant of acceptance of research claims in some cases, but not in others.
For example, sometimes replication was sufficient to embrace a research claim, even if it conflicted with the background of accepted theory and left theoretical questions unresolved. A case of this is high-temperature superconductivity, the effect whereby an electric current can pass with zero resistance through a conductor at relatively high temperatures. In 1986, physicists Georg Bednorz and Alex Müller reported finding a material which acted as a superconductor at 35 kelvin (−238 degrees Celsius). Scientists around the world successfully replicated the effect, and Bednorz and Muller were then awarded with a Nobel Prize in Physics a year after their announcement. This case is remarkable since not only did their effect contradict the accepted physical theory at the time, but there is still no extant theory that adequately explains the effects which they reported (Di Bucchianico, 2014).
As a contrasting example, however, sometimes claims were accepted without any replication. In the 1650s, German scientist Otto von Guericke designed and operated the world’s first vacuum pump that would visibly suck air out of a larger space. He performed experiments with his device to various audiences. Yet the replication of his experiments by others would have been very difficult, if not impossible: not only was Guericke’s pump both expensive and complicated to build, but it was also unlikely that his descriptions of it sufficed to enable anyone to build the pump and to consequently replicate his findings. Despite this, Steinle claims that “no doubts were raised about his results”, probably as a results of his “public performances that could be witnessed by a large number of participants” (2016: 55).
Steinle takes such historical cases to provide normative guidance for understanding the epistemic value as replication as context-sensitive: whether replication is necessary or sufficient for establishing a research claim will depend on a variety of considerations, such as those mentioned earlier. He consequently eschews wide-reaching claims, such as those that “it’s all about replicability” or that “replicability does not decide anything” (2016: 60).
3.3 Formalising the Logic of Replication
Earp and Trafimow (2015) attempt to formalise the way in which replication is epistemically valuable, and they do this using a Bayesian framework to explicate the inferences drawn from replication studies. They present the framework in a context similar to that of Collins (1985), noting that “it is well-nigh impossible to say conclusively what [replication results] mean” (Earp & Trafimow, 2015: 3). But while replication studies are often not conclusive, they do believe that such studies can be informative, and their Bayesian framework depicts how this is so.
The framework is set out with an example. Suppose an aficionado of Researcher A is highly confident that anything said by Researcher A is true. Some other researcher, Researcher B, then attempts to replicate an experiment by Researcher A, and Researcher B find results that conflict with those of Researcher A. Earp and Trafimow claim that the aficionado might continue to be confident in Researcher A’s findings, but the aficionado’s confidence is likely to slightly decrease. As the number of failed replication attempts increases, the aficionado’s confidence accordingly decreases, eventually falling below 50% and thereby placing more confidence in the replication failures than in the findings initially reported by Researcher A.
Here, then, suppose we are interested in the probability that the original result reported by Researcher A is true given Researcher B’s first replication failure. Earp and Trafimow represent this probability with the notation \(p(T\mid F)\) where p is a probability function, T represents the proposition that the original result is true and F represents Researcher B’s replication failure. According to Bayes’s theorem below, this probability is calculable from the aficionado’s degree of confidence that the original result is true prior to learning of the replication failure \(p(T)\), their degree of expectation of the replication failure on the condition that the original result is true \(p(T\mid F)\), and the degree to which they would unconditionally expect a replication failure prior to learning of the replication failure \(p(F)\):
Relatedly, we could instead be interested in the confidence ratio that the original result is true or false given the failure to replicate. This ratio is representable as \(\frac{p(T\mid F)}{p(\nneg T\mid F)}\) where \(\nneg T\) represents the proposition that the original result is false. According to the standard Bayesian probability calculus, this ratio in turn is related to a product of ratios concerning
This relation is expressed in the equation:
Now Earp and Trafimow assign some values to the terms on the right-hand of the equation for (2). Supposing that the aficionado is confident in the original results, they set the ratio \(\frac{p(T)}{p(\nneg T)}\) to 50, meaning that the aficionado is initially fifty times more confident that the results are true than that the results are false.
They also set the ratio \(\frac{p(F\mid T)}{p(F\mid \nneg T)}\). about the conditional expectation of a replication failure to 0.5, meaning that the aficionado is considerably less confident that there will be a replication failure if the original result is true than if it is false. They point out that the extent to which the aficionado is less confident depends on the quality of so-called auxiliary assumptions about the replication experiment. Here, auxiliary assumptions are assumptions which enable one to infer that particular things should be observable if the theory under test is true. The intuitive idea is that the higher the quality of the assumptions about a replication study, the more one would expect to observe a successful replication if the original result was true. While they do not specify precisely what makes such auxiliary assumptions have high “quality” in this context, presumably this quality concerns the extent to which the assumptions are probably true and the extent to which the replication experiment is an appropriate test of the veracity of the original results if the assumptions are true.
Once the ratios on the right-hand of equation (2) are set as such, one can see that a replication failure would reduce one’s confidence in the original results:
Here, then, a replication failure would reduce the aficionado’s confidence that the original result was true so that the aficionado would be only 25 times more confident that the result is true given a failure (as per \(\frac{p(T\mid F)}{p(\nneg T\mid F)}\)) rather than 50 times more confident that it is true (as per \(\frac{p(T)}{p(\nneg T)}\)).
Nevertheless, the aficionado may still be confident that the original result is true, but we can see how such confidence would decrease with successive replication failures. More formally, let \(F_N\) be the last replication failure in a sequence of N replication failures \(\langle F_1,F_2,\ldots,F_N\rangle\). Then, the aficionado’s confidence in the original result given the Nth replication failure is expressible in the equation:[3]
For example, suppose there are 10 replication failures, and so \(N=10\). Suppose further that the confidence ratios for the replication failures are set such that:
Then,
Here, then, the aficionado’s confidence in the original result decreases so that they are more confident that it was false than that it was true. Hence, on Earp and Trafimow’s Bayesian account, successive replication failures can progressively erode one’s confidence that an original result is true, even if one was initially highly confident in the original result and even if no single replication failure by itself was conclusive.[4]
Some putative merits of Earp and Trafimow’s account, then, are that it provides a formalisation whereby replication attempts are informative even if they are not conclusive, and furthermore, the formalisation provides a role for both quantity of replication attempts as well as auxiliary assumptions about the replications.
4. Open Science Reforms: Values, Tone, and Scientific Norms
The aforementioned meta-science has unearthed a range of problems which give rise to the reproducibility crisis, and the open science movement has proposed or promoted various solutions—or reforms—for these problems. These reforms can be grouped into four categories: (a) methods and training, (b) reporting and dissemination, (c) peer review processes, and (d) evaluating new incentive structures (loosely following the categories used by Munafò et al. 2017 and Ioannidis et al. 2015). In subsections 4.1–4.4 below, we present a non-exhaustive list of initiatives in each of the above categories. These initiatives are reflections of various values and norms that are at the heart of the open science movement, and we discuss these values and norms in 4.5.
4.1 Methods and Training
4.2 Reporting and Dissemination
4.3 Peer Review
4.4 Incentives and Evaluations
4.5 Values, Tone, and Scientific Norms in Open Science Reform
There has long been philosophical debate about what role values do and should play in science (Churchman 1948; Rudner 1953; Douglas 2016), and the reproducibility crisis is intimately connected to questions about the operations of, and interconnections between, such values. In particular, Nosek et al. (2017) argue that there is a tension between truth and publishability. More specifically, for reasons discussed in section 2 above, the accuracy of scientific results are compromised by the value which journals place on novel and positive results and, consequently, by scientists who value career success to seek to exclusively publish such results in these journals. Many others in addition to Nosek et al. (Hackett 2005; Martin 1992; Sovacool 2008) have taken also take issue with the value which journals and funding bodies have placed on novelty.
Some might interpret the tension as a manifestation of how epistemic values (such as truth and replicability) can be compromised by (arguably) non-epistemic values, such the value of novel, interesting or surprising results. Epistemic values are typically taken to be values that, in the words of Steel “promote the acquisition of true beliefs” (2010: 18; see also Goldman 1999). Canonical examples of epistemic values include the predictive accuracy and internal consistency of a theory. Epistemic values are often contrasted with putative non-epistemic or non-cognitive values, which include ethical or social values like, for example, the novelty of a theory or its ability to improve well-being by lessening power inequalities (Longino 1996). Of course, there is no complete consensus as to precisely what counts as an epistemic or non-epistemic value (Rooney 1992; Longino 1996). Longino, for example, claims that, other things being equal, novelty counts in favour of accepting a theory, and convincingly argues that, in some contexts, it can serve as a “protection against unconscious perpetuation of the sexism and androcentrism” in traditional science (1997: 22). However, she does not discuss novelty specifically in the context of the reproducibility crisis.
Giner-Sorolla (2012), however, does discuss novelty in the context of the crisis, and he offers another perspective on its value. He claims that one reason novelty has been used to define what is publishable or fundable is that it is relatively easy for researchers to establish and for reviewers and editors to detect. Yet, Giner-Sorolla argues, novelty for its own sake perhaps should not be valued, and should in fact be recognized as merely an operationalisation of a deeper concept, such as “ability to advance the field” (567). Giner-Sorolla goes on to point out how such shallow operationalisations of important concepts often lead to problems, for example, using statistical significance to measure the importance of results, or measuring the quality of research by how well outcomes fit with experimenters’ prior expectations.
Values are closely connected to discussions about norms in the open science movement. Vazire (2018) and others invoke norms of science—communality, universalism, disinterestedness and organised skepticism—in setting the goals for open science, norms originally articulated by Robert Merton (1942). Each such norm arguably reflects a value which Merton advocated, and each norm may be opposed by a counternorm which denotes behaviour that is in conflict with the norm. For example, the norm of communality (which Merton called “communism”) reflects the value of collaboration and the common ownership of scientific goods since the norm recommends such collaboration and common ownership. Advocates of open science see such norms, and the values which they reflect, as an aim for open science. For example, the norm of communality is reflected in sharing and making data open, and in open access publishing. In contrast, the counternorm of secrecy is associated with a closed, for profit publishing system (Anderson et al. 2010). Likewise, assessing scientific work on its merits upholds the norm of universalism—that the evaluation of research claims should not depend on the socio-demographic characteristics of the proponents of such claims. In contrast, assessing work by the age, the status, the institution or the metrics of the journal it is published in reflects a counternorm of particularism.
Vazire (2018) and others have argued that, at the moment, scientific practice is dominated by counternorms and that a move to Mertonian norms is a goal of the open science reform movement. In particular, self-interestedness, as opposed to the norm of disinterestedness, motivates p-hacking and other Questionable Research Practices. Similarly, a desire to protect one’s professional reputation motivates resistance to having one’s work replicated by others (Vazire 2018). This in turn reinforces a counternorm of organized dogmatism rather than organized skepticism which, according to Merton, involves the “temporary suspension of judgment and the detached scrutiny of beliefs” (Merton, 1973).
Anderson et al.’s (2010) focus groups and surveys of scientists suggest that scientists do want to adhere to Merton’s norms but that the current incentive structure of science makes this difficult. Changing the structure of penalty and reward systems within science to promote communality, universalism, disinterestedness and organized skepticism instead of their counternorms is an ongoing challenge for the open science reform movement. As Pashler and Wagenmakers (2012) have said:
replicability problems will not be so easily overcome, as they reflect deep-seated human biases and well-entrenched incentives that shape the behavior of individuals and institutions. (2012: 529)
The effort to promote such values and norms has generated heated controversy. Some early responses to the Reproducibility Project: Psychology and Many Labs projects were highly critical, not just of the substance of the nature and process of the work. Calls for openness were interpreted as reflecting mistrust, and attempts to replicate others’ work as personal attacks (e.g., Schnail 2014 in Other Internet Resources). Nosek, Spies, & Motyl (2012) argue that calls for openness should not be interepreted as mistrust:
Opening our research process will make us feel accountable to do our best to get it right; and, if we do not get it right, to increase the opportunities for others to detect the problems and correct them. Openness is not needed because we are untrustworthy; it is needed because we are human. (2012: 626)
Exchanges related to this have become known as the tone debate.[]
5. Conclusion
The subject of reproducibility is associated with a turbulent period in contemporary science. This period has called for a re-evaluation of the values, incentives, practices and structures which underpin scientific inquiry. While the meta-science has painted a bleak picture of reproducibility in some fields, it has also inspired a parallel movement to strengthen the foundations of science. However, more progress is to be made, especially in understanding the solutions to the reproducibility crisis. In this regard, there are fruitful avenues for future research, including a deeper exploration of the role that epistemic and non-epistemic values can or should play in scientific inquiry.