Damian Sendler Scientific psychology aims to shed light on human psychology. If you can accurately predict what will happen in the future as well as explain what has happened in the past, you’re considered to be a behavioral scientist. However, in reality, it is difficult to tell the difference between the two. As far as we know, the two are so intertwined that there is little point in separating them, except perhaps as a philosophical exercise. This view holds that explanation enables prediction because the model that most closely approximates the mental processes that produce an observed behavior is also the model that most accurately predicts future behavior. It’s possible that we could, in principle, measure all the relevant variables and predict a group of people’s future behavior with a high degree of accuracy by cataloging the various causes of a set of behaviors and their moderating and mediating factors.
Damian Jacob Sendler Unfortunately, despite the fact that philosophically, explanation and prediction may be compatible, there is good reason to believe that they are frequently at odds statistically and pragmatically. Is it really true that a statistical model which most closely mimics how data is generated will invariably be best at predicting real-world outcomes? (Hagerty & Srinivasan, 1991; Shmueli, 2010; Wu, Harris, & Mcauley, 2007). Because of what we call overfitting, a biased, implausible model can systematically outperform an accurate but more complex model. This phenomenon is known as overfitting. Moreover, there is no guarantee that the phenomena studied by psychologists will prove to be sufficiently simple to be well approximated by models that are comprehensible to the general public. Psychologists might be forced to choose between developing complex models that can accurately predict outcomes of interest but do not adhere to well-established psychological or neurobiological constraints or developing simple models that appear theoretically elegant but have a very limited ability to predict actual human behavior in many areas of psychology. Even in cases where a simple explanatory model is just waiting to be discovered, a researcher cannot know this in advance, practically speaking. This means she must make a case-by-case decision as to whether to prioritize an explanation-focused strategy that aims to identify abstract, generalizable principles, or one that simply tries to mimic the outputs of the true data-generating process when given the same inputs.
Dr. Sendler We argue that the tension between prediction and explanation has profound implications for the conduct of psychological research. Choosing whether to explain or predict is a conscious decision researchers must make if ideal explanatory science is not ideal predictive science, and vice versa. As a result, the majority of psychology has always taken an explanatory approach, rather than a predictive one. Our main point is that research programs that focus on prediction rather than explanation are more likely to be successful in the short and long term. Predictive tools were poorly understood and rarely used in most fields of social and biomedical science in the not-so-distant past, which we believe is one of the major reasons psychologists have historically chosen explanation. Machine learning theory and methodology—in which prediction of unobserved data is treated as the gold standard of success, and explanation is typically less important—as well as the increasing availability of large-scale datasets recording human behavior have changed this. A predictive approach to psychology is not only feasible, but it has already had a number of successes in the field of behavioral science when taken by researchers.
What follows is a breakdown of the rest of the document: Firstly, we review and highlight some of the difficulties of the typical explanatory approach as practiced by most of psychology, which psychologists are becoming increasingly aware of. “P-hacking” and the researchers’ apparent inability to consistently replicate the results of previous experiments are two examples of these issues (Simmons, Nelson, & Simonsohn, 2011). (Open Science Collaboration, 2015). To begin with, we’ll take a different tack, focusing on the average difference between observed data (i.e., “out-of-sample” data that weren’t used to fit a model) and the model’s predictions for those data instead of the theoretically privileged regression coefficient or model fit statistic. To illustrate how these principles and tools could be applied to psychology, we then describe some of the most important principles and tools in modern predictive science as practiced by the field of machine learning. Our discussion of the issue of sample size when rejecting hypotheses is not a primary goal includes a discussion of overfitting, cross validation, and regularization. A short-term focus on forecasting can, in the long run, help us better understand the underlying causes of human behavior. To put it another way, prediction is not an enemy of explanation, but rather a complement that can ultimately lead to a better grasp of theory.
Overfitting is the statistical model’s propensity to treat sample-specific noise as if it were signal. One of the primary goals of machine learning is to minimize overfitting when training statistical models (Domingos, 2012). Recall that our standard goal in statistical modeling is to develop a model that is capable of generalizing to new observations similar, but not identical, to those we have sampled in order to see why this is the case.. Because we already know the scores of the observations in our current sample, we don’t give much weight to the accuracy of our prediction models. It’s important to keep in mind that the prediction error that we get when we apply our trained model to an entirely new set of observations sampled from the same population is merely a proxy for the quantity that matters most to us: the error term. In order to distinguish it from the training error that we get when we first fit the model, we refer to this error as the test error” (in psychology, model fit indicators are almost always reported strictly for the training sample). Almost always, the test error will be greater than the training error. This means that no matter how well the model appears to do when it is evaluated on the same dataset as in training, one cannot be sure that the model can generalize to new observations unless steps to prevent overfitting have been taken.
Damian Sendler
Damian Jacob Markiewicz Sendler Many psychologists may not be alarmed by the examples of quantitative results we’ve seen so far. When the number of predictors is large relative to the number of subjects, researchers have difficulty fitting models with more than three or four variables. Our examples only measure overfitting at the model estimation stage, which is why we rely on comparing the average in-sample R2 to the out-of-sample test R2. As a result, it assumes that researchers have been completely principled when setting up their analysis pipeline (the processing steps leading from raw data to the final statistical results) and have not engaged in any flexible analysis practices. Analytical procedures that are not directly related to the model estimation process are often to blame for overfitting findings. In particular, researchers, readers, reviewers, editors, and journalists alike tend to prefer analysis procedures that produce “good” results—where a good result is one that is deemed more hypothesis-congruent, publication-worthy, societally interesting, etc. (Bakker, Dijk, & Wicherts, 2012; Dwan et al., 2008; Ferguson & Heene, 2012; Ioannidis, 2012). Recent years have seen an increase in the term “p-hacking” (Simmons et al., 2011) for the practice of selecting analytical procedures based in part on the quality of the results they produce. A less pejorative term might be “data-contingent analysis” (Gelman & Loken, 2013).
One cannot overstate the impact of p-hacking on the generation of overfitted or spurious results. An influential recent study by Simmons and colleagues demonstrated the possibility of false positive rates exceeding 60% with even a moderate degree of analysis choice flexibility, e.g., choosing between two dependent variables or optionally including covariates in a regression analysis, a figure they convincingly argue is probably a conservative estimate (Simmons et al., 2011). Similar protests are common. The common practice of periodically computing a p-value during data collection and stopping the process if the value is below the conventional.05 level, as demonstrated by Strube (2006), for example, has been shown to be sufficient to inflate the false positive rate several times. Studies of academic psychologists show that questionable research methods are more common than not, unfortunately. More than half of psychologists admitted to using optional stopping, 46% admitted to selectively reporting studies that “worked,” and 38% admitted to only deciding whether or not to exclude data after looking at results, as reported by John and colleagues (2012) in their study (and given the stigma associated with these practices, these self-reported numbers likely understate the true prevalence).
According to recent research, the term “Big Data” has garnered a lot of attention from both behavioral scientists and neuroscientists over the past few years. What exactly Big Data means when applied to psychology is still up for debate (Yarkoni, 2014). Terabytes and petabytes of data are the norm in the tech industry, which is several orders of magnitude larger than the datasets used by all but the luckiest (or unluckiest) of psychologists. A cynic, on the other hand, might argue that Big Data is more of a buzzword than a genuine paradigm shift in psychological data analysis.
No matter how you slice it, the “Big” in Big Data is a positive thing. The use of larger samples has been advocated by statisticians and methodologically inclined psychologists for decades (Cohen, 1962, 1992), but there was little evidence that these calls were being heeded until recently (Sedlmeier & Gigerenzer, 1989). The tide appears to be turning as a result of modern technology. Studies based on sample sizes of tens of thousands, and in some cases millions, of participants are no longer unusual because of the advent of online/mobile data collection and access to enormous archival datasets from social networks and other websites (Bond et al., 2012; Xu, Nosek, and Greenwald, 2014).
Even if they have no discernible effect on the final result, machine learning concepts and techniques can often improve the efficiency and reproducibility of a researcher’s analysis pipeline. Researchers frequently use this approach to show the validity of a particular metric by comparing it to other metric and/or comparing it across different raters. Any number of other available variables can achieve the same ends much more efficiently when shown to be able to accurately predict an out-of-sample variable. According to Ekman (1992), humans are capable of reliably producing and detecting additional “compound” facial emotional expressions in addition to the six “basic” expressions of happiness, surprise, sadness, anger, fear, and disgust (Du, Tao, & Martinez, 2014). A study conducted by Du and his colleagues found that facial expressions of compound emotions (e.g., happy surprised, happy disgusted, sadly fearful, etc.) can indeed convey information about more complex emotional states than have previously been recognized. Even though a similar conclusion could theoretically be drawn from human subjects, using an automated classifier is more efficient, reproducible, and extensible (e.g., one would not have to recruit new raters when new photos were added to the stimulus set).