Evaluating research: Validity and reliability
As you may have surmised, doing research is not exactly a science. You may have noticed that I switch between “social science research” and “social research.” I’m ambivalent on whether what we do is “science,” exactly—it depends what you mean by “science,” and smart people disagree on that point. I’m at peace with my ambivalence. While writing, I’ve been self-conscious about how I’m constantly qualifying my statements—I’ve used the word usually 37 times so far, and sometimes, 30. That’s not the mark of particularly good writing, but it does reflect an important point: There is not one right way to do any research project. When we’re making decisions about how to go about our research, we’re faced with many options. Identifying these options is a creative process; we brainstorm, we trade ideas with others, we tease out the implications of our theoretic bases, we look to previous research for inspiration, and we’re left with a myriad of options. If we’re interested in learning about public managers’ leadership styles, we could interview them, conduct focus groups with them, have them complete a web survey, or observe them in action. We could structure our observations in a cross-sectional research design, make cross-case comparisons, follow managers over time, or devise a clever experiment. When it comes to operationalizing any one of the many concepts we need to measure, we’re faced with still more choices. To decide how to operationalize a concept like transformational leadership, we’ll look to our fellow researchers, theories, and previous research, but we’ll still be left with infinite variations on how we could ask questions, extract data from administrative records, or record direct observations.
As creative as doing research is, however, it would be misleading to say that doing research is an art. It is a creative endeavor to be sure, but it’s definitely not the case that what constitutes good research is “in the eye of the beholder.” It’s more like a craft. Doing research takes a lot of creativity, but it can be done well or poorly. Doing research is not a wholly subjective enterprise; there are standards that we can apply to judge research quality. Broadly speaking, the two standards used to judge the quality of research are validity and reliability. We use these terms as special bits of jargon in research methodology, where they take on meaning beyond what we mean when using them colloquially. (And to pile the po-mo even higher, I should note that of all the jargon we’ve covered, the jargon related to validity and reliability is the most inconsistently applied among social science methodologists. Methodologists all seem to have their own twist on how they use these terms, so understand that you’re about to get my distillation of all that, and it won’t necessarily always jibe with how you’ll see the terms used elsewhere.) We should know how to apply these standards because it helps us decide how much stock to put in research that we read and because knowing the standards by which research is judged helps us design research ourselves that will meet those standards.
We can think of evaluating research design on two levels: overall research design and operationalization of specific concepts. For any given research project, then, we can make holistic evaluations of the merits of the entire project, and we can also make evaluations of how each individual concept was measured, which could amount to dozens of discrete evaluations for a single research project.
When we’re evaluating the overall design of a research project, we apply the standards of internal validity, external validity, and reliability. Internal validity is the extent to which the inferences we make from our observations are true. Most often, the standard of internal validity is applied to causal inferences. If we assess a study’s internal validity, then, we’re assessing the degree to which the design of that study permits confident inferences about cause and effect. Experimental designs, when well done, are very high in internal validity; we can be confident that the observed changes in the dependent variable are, indeed, due to the changes in the independent variable. It’s important to see that strong internal validity is a function of the research design; characteristics of the research design itself—in the case of experiments, the random assignment of cases to experimental and control groups and the control of the experimental setting—allow us to make our causal claims with a lot of confidence.
Interestingly enough, the characteristics of experiments that strengthen internal validity are the same characteristics that tend to weaken external validity. External validity is the extent to which we can generalize the inferences we make from observations beyond the cases observed. Assessing external validity asks whether or not we can apply what we’ve learned from our observations to other cases, settings, or times. When we conduct an experiment, it’s usually very artificial—the whole setting of the experiment has to be tightly controlled to ensure comparability of the experimental and control groups in every respect except their values for the independent variable. (I hope you thought about that when you read about students listening to conservative talk radio through their earbuds for four hours straight while sitting in a classroom—not a very realistic scenario.) This tight control is essential to achieving internal validity, but it makes it really hard to apply it to other settings (like real life)—it makes it hard to achieve external validity.
Reliability is the extent to which other researchers would get the same results if the study were repeated, whether by themselves or by someone else. Most often, assessing reliability is a thought experiment—an exercise we carry out only in our imaginations. Let’s return to the example of surveying people in inner-city neighborhoods about their eating habits. If I were to assess the reliability of our quasi-experimental research design, I would think through a few hypothetical scenarios. What if someone else had conducted this study? I’m a white male; what if a black female had conducted the interviews instead? Would she have gotten the same results as me? What if I could hit the cosmic reset button, go back in time, and conduct the study again myself? Would I, myself, get the same results again?
When we evaluate a study at the level of the operationalization of all its concepts, we apply the standards of operational validity and, again, reliability. Operational validity is the extent to which the way we have operationalized a concept truly measures that concept. Let’s consider the challenge of operationalizing a concept college students are familiar with, college readiness. If I were to take a stab at a nominal definition of college readiness, I’d say something like “a person’s preparedness for success in college.” How might we operationalize this concept? We have lots of options, but let’s say we’re going to administer a written questionnaire to college applicants, and we’ll include the following question as our measure of college readiness:
- What was your score on the ACT?
That seems straightforward enough, but let’s evaluate this operationalization of college readiness in terms of its operational validity. Does this question really measure college readiness? We can assess operational validity from four different angles: face validity, content validity, discriminate validity, and criterion validity. (In introducing these terms, I should mention a quibble I have with lots of textbook authors. These aren’t really different types of validity; they’re all different aspects of operational validity—different ways of thinking about whether or not an operationalization really measures the concept it’s intended to measure.)
Face validity is the most intuitive of these four ways to think about operational validity. When we assess the face validity of an operationalization, we’re just asking whether, on the face of it, the operationalization seems to measure its targeted concept. Here, I’d say sure—it seems very reasonable to use ACT scores as a measure of college readiness. As evidence for the face validity of this operationalization, I could refer to other researchers who have used this same operationalization to measure college readiness. Certainly, ACT score achieves face validity as a measure of college readiness.
Next, we can think about operational validity by assessing the measure’s content validity (sometimes called construct validity). Many abstract concepts we want to measure are broad and complex. Think about college readiness. Surely it includes academic readiness, which itself is multifaceted—having adequate studying skills, critical thinking skills, math skills, writing skills, computer skills, and so on. College readiness probably also includes nonacademic factors as well, like self-motivation, openness to new ideas, ability to get along well in a group, and curiosity. I’m sure you can think of still more aspects of college readiness. When we assess content validity, we ask whether or not our operationalization measures the full breadth and complexity of a concept. Here, I think our ACT score might be in trouble. Of all the many aspects of college readiness, ACT scores only measure a swath of the academic skills. Those academic skills are, indeed, indicators of college readiness (and hence ACT scores do achieve face validity), but if we’re relying solely on ACT scores as our full operationalization of college readiness, our operationalization fails to achieve content validity. We almost always require multiple measures when operationalizing complex concepts in order to achieve content validity.
At this point in our research design, we’d probably add some additional items to our questionnaire to operationalize college readiness more fully. Let’s continue, though, assessing our original operationalization, relying only on ACT scores as a measure of college readiness. We can continue to assess the operational validity of this operationalization by assessing its discriminate validity, which asks whether or not the way we’ve operationalized our concept will enable us to distinguish between the targeted concept and other concepts. We all had a friend in high school who didn’t do so hot on the ACT and unwittingly attributed the poor showing to discriminate validity: “ACT scores just show how good you are at taking standardized tests!” Your friend was saying that the ACT doesn’t operationalize the concept it’s intended to operationalize, college readiness, but another concept altogether, standardized-test-taking ability. Your friend was quite astute to consider whether the ACT achieves discriminate validity.
If considering face validity is the most intuitive way of assessing operational validity, considering criterion validity is the most formal. When we assess criterion validity, we test, usually statistically, whether or not our measures relate to other variables as they should if we have successfully operationalized our target concept. If ACT score successfully operationalizes college readiness, what should students’ ACT scores be statistically associated with? Well, if ACT scores really are a measure of college readiness, then students who had higher ACT scores should also tend to have higher college GPAs. If we test for that association, we’re using college GPA as a criterion variable (hence criterion validity) for determining whether or not ACT scores are a good way to operationalize college readiness. If there’s a strong association between ACT scores (the variable we’re testing) and college GPA (our criterion variable), then we’ll use that as evidence that our operationalization of college readiness (our target concept) demonstrates operational validity. We could think of other criterion variables as well—whether or not the student graduates from college and how long it takes come to mind. We don’t always have the opportunity to test for criterion validity, but when we do, it can provide very strong evidence for our measures’ operational validity.
Just as when we were evaluating the overall research design, we apply the standard of reliability when we evaluate the operationalization of an individual concept, likewise engaging in thought experiments to consider whether we’d get the same results if the observations were made by other researchers or even by ourselves if we could go back and do it again. We also consider, and sometimes quantify using statistical tools, the degree to which individual measures demonstrate random error. This is the amount of variation in repeated measures, whether repeated in reality or only hypothetically. Say we’re measuring the height of a wall using a tape measure. We know that the wall’s height is 96 inches. You can imagine, though, that your tape measure might read 95 ⅞ the first time you measure it, 96 ⅛ the second time, and 95 ¹⁵⁄₁₆ the third time. Your measurement is exhibiting some random error. If you were to repeat this over and over, the mean measurement would be about right, but any one measurement is bound to be off just a little.
In social research, some types of measures are more susceptible to random error than others. Imagine being asked to rate your agreement or disagreement with the statement I like campaign signs printed in all caps on a 7-point scale. I know I don’t have a particularly strong opinion on the matter, really. If you asked me this morning, I might rate it a 5, but this afternoon, it might be a 4, and tomorrow it might be a 7. We very rarely actually take measurements from the same cases over and over again (and if you did, I’d probably start always giving you the same answer anyway just for the sake of sounding consistent with myself), so we have to think about the consistency of hypothetical repeated measurements. Hypothetically, if we were to ask someone to rate how much he likes campaign signs in all caps, zap his memory of the experience, ask him again, zap, ask again, zap, ask again, zap, and ask again, I’d predict that we’d observe a lot of random error, meaning our question is probably not a very reliable way to operationalize the targeted concept, preference for capitalization of campaign sign text.