Data collection structured by formal research designs

Data collection is the act of making and recording systematic observations. Those records of our observations become our data. The decisions facing the researcher embarking on data collection are myriad: What or who will your cases be? What kind of data will you collect? How will you structure your data collection so that you can convincingly draw conclusions from it later?

Sampling

The selection of cases to observe is the task of sampling. If you’re going to be collecting data from people, you might be able to talk to every person that you want your research to apply to, that is, your population. If you’re doing a study of state election commissioners, you might be able to talk to all 50 of them. In that case, you’d be conducting a census study. Often, though, we’re only able to collect data from a portion of the population, or a sample. We devise a sampling frame, a list of cases we select our sample from—ideally, a list of all cases in the population—but then which cases do we select for the sample? We select cases for our sample by following a sampling design, which comes in two basic varieties: probability sampling designs and nonprobability sampling designs.

In probability sampling designs, every case in the population has a known, greater-than-zero probability of being selected for the sample. This feature of probability sampling designs, along with the wonder of the central limit theorem and law of large numbers, allows us to do something incredibly powerful. If we’re collecting quantitative data from our sample, we can use these data to calculate statistics—quantified summaries of characteristics of the sample, like the median of a variable or the correlation between two variables. If we’ve followed a probability sampling design, we can then use statistics to estimate the parameters—the corresponding quantified characteristics of the population—with known levels of confidence and accuracy. This is what’s going on when you read survey results in the newspaper: “± 3 points at 95% confidence.” For example, if 30% of people in our sample say they’d like to work for government, then we’d be confident that if we were to repeat this survey a thousand times, 95% of the time (our level of confidence), we’d find that between 27 and 33% (because ± 3 points is our degree of accuracy) of the respondents would answer the same way. Put another way, we’d be 95% certain that 27 to 33% of the population would like to work for government.

Again, this trick of using sample statistics to estimate population parameters with known levels of confidence and accuracy only works when we’ve followed a probability sampling design. The most basic kind of probability sampling design is a simple random sample. In this design, each case in the population has a known and equal probability of being selected for the sample. When social researchers use the term random, we don’t mean haphazard. (This word has become corrupted since I was in college, when my future sister-in-law started saying stuff like “A boy I knew in kindergarten just called—that was so random!” and “I just saw that guy from ‘Saved by the Bell’ at the mall—pretty random!”) It takes a plan to be random, to give every case in the population an equal chance of being selected for a sample. If we were going to randomly select 20 state capitals, we wouldn’t just select the first 20 working from west to east or the first 20 we could think of—that would introduce sampling bias. (We’ll have more to say about bias later, but you get the gist of it for now.) To ensure all 50 capitals had an equal probability of being selected (a probability of 0.4, in fact), we could list them all out on a spreadsheet, use a random number generator to assign them all random numbers, sort them by those numbers, and select the first 20; or we could write each capital’s name on same-sized pieces of paper, put them in a bag, shake them up, and pull out 20 names. (Some textbooks still have random number tables in the back, which you’re welcome to learn how to use on your own, but they’ve become pretty obsolete.)

Selecting a simple random sample may be too much of a hassle because you just have a long, written list in front of you as your sampling frame, like a printed phonebook. Or, selecting a simple random sample may be impossible because you’re selecting from a hypothetically infinite number of cases, like the vehicles going through an intersection. In such scenarios, you can approximate a random sample by selecting every 10th or 20th or 200th or whateverth case to reach your desired sample size, which is called systematic sampling. This works fine as long as periodicity isn’t present in your population, meaning that there’s nothing odd about every 10th (or whateverth) case. If you were sampling evenings to observe college life, you wouldn’t want to select every 7th case, or you’d introduce severe sampling bias. Just imagine trying to describe campus nightlife by observing only Sunday evenings or only Thursday evenings. As long as periodicity isn’t a problem, though, systematic sampling approximates simple random sampling.

Our goal in selecting a random (or systematic) sample is to construct a sample that is like the population so that we can use what we learn about the sample to generalize to the population. What if we already know something about our population, though? How can we make use of that knowledge when constructing our sample? We can replicate known characteristics of a sample by following another probability sampling design, a proportionate stratified sampling design. Perhaps we’d like to sample students at a particular college, and we already know students’ sex, in-state versus out-of-state residency, and undergraduate versus graduate classification. We can use sex, residency, and classification as our strata and select a sample with the same proportions of male versus female, in-state versus out-of-state, and undergraduate versus graduate students as the population. If we determine that 4% of our population are male graduate students from out-of-state and we wanted a sample of 300 students, we’d select (using random sampling or systematic sampling) 12 (300*4%) male graduate students from out-of-state to be in our sample. We’d carry on similarly sampling students with other combinations of these characteristics until we had a sample proportionally representative of the population in terms of sex, residency, and classification. We probably would have gotten similar results if we had used a simple random sampling strategy, but now we’ve ensured proportionality with regard to these characteristics.

Sometimes, though, proportionality is exactly what we don’t want. What if we were interested in comparing the experiences of students who had been homeschooled to students who were not homeschooled? If we followed a simple random sampling design or a proportionate stratified sampling design, we would probably end up with very few former homeschoolers—not enough to provide a basis of comparison to the never homeschooled. We may even want half of our sample to be former homeschoolers, which would require oversampling from this group to have their representation in the sample disproportionately high compared to the population, achieved by following a disproportionate stratified sampling design. Importantly, this is still a probability sampling design. With some careful math, we can still calculate the probability of any one case in the population being selected for the sample; it’s just that for former homeschoolers, that probability would be higher than for the never homeschooled. Knowing these probabilities still permits us to use statistics to estimate parameters for the entire population of students, we just have to remember to make the responses of former homeschoolers count less and the responses of the never homeschooled count more when calculating our parameter estimates. This is done using weights, which are based on those probabilities, in our statistical calculations.

One final probability sampling design, cluster sampling design, is commonly used to sample cases that are dispersed throughout a broad geographic region. Imagine the daunting task of needing to sample 2,000 parents of kindergarteners from across the United States. There is no master list of kindergarten students or their parents to serve as a sampling frame. Constructing a sampling frame by going school to school across the country would likely consume more resources than the rest of the study itself—the thought of constructing such a sampling frame is ridiculous, really. We could, though, first randomly select, say, 20 states, and then 10 counties within each of those 20 states, and then 1 school from each of those counties, and then 10 kindergartners from each of those schools. At each step, we know the probability of each state, county, school, and kid being selected for the sample, and we can use those probabilities to calculate weights, which means we can still use statistics to estimate parameters. We’ll have to modify our definition for probability sampling designs just a bit, though. We could calculate the probability of any one case in the population being included in the study, but we don’t. Being able to calculate the probabilities of selection for each sampling unit (states, counties, schools, kids), though, does the same job, so we still count cluster sampling designs as one of the probability sampling designs. To modify our definition of probability sampling designs, we might say that every case in the population has a known or knowable, greater-than-zero probability of being selected for the sample.

Using a probability sampling design is necessary, but not sufficient, if we want to use statistics to estimate parameters. We still need an adequate sample size. How do we calculate an adequate sample size? Do we, say, select 10% of the population? It would be handy to have such an easy rule of thumb, but as it turns out, the size of the population is only one factor we have to consider when determining the required sample size. (By the way, this is probably the most amazing thing you’ll learn in this text.) In addition to population size, we also have to consider required level of confidence (something you decide yourself), required level of accuracy (something else you decide), and the amount of variance in the parameter (something you don’t get to decide; it is what it is).

As you’d probably guess, the larger the population size, the larger the required sample size. However, the relationship between population size and required sample size is not linear (thus no rule of thumb about selecting 10% or any other percent of the population for your sample). If we have a somewhat small population, we’ll need a large proportion of it in our sample. If we have a very large population, we’ll need a relatively small proportion of it in our sample. In fact, once the population size goes above around 20,000, the sample size requirement hardly increases at all (thanks again to the central limit theorem and the law of large numbers).

We also have to consider how much the parameter varies. Imagine that I’m teaching a class of 40 students, and I know that everyone in the class is the same age, I just don’t know what that age is. How big would my sample size need to be for me to get a very good (even perfect) statistic, the mean age of my students? Think. One! That’s right, just one. My parameter, the mean age of the class, has zero variation (my students are all the same age), so I need a very small sample to calculate a very good statistic. What if, though, my students’ ages were all over the place—from one of those 14-year-old child geniuses to a 90-year-old great grandmother who decided to finish her degree? I’d be very reluctant to use the mean age of a sample of 3, 4, or even 10 students to estimate the whole class’s mean age. Because the population parameter varies a lot, I’d need a large sample. The rule, then: The more the population parameter varies, the more cases I need in my sample.

The astute reader should, at this point, be thinking “Wait a sec. I’m selecting a sample so I can calculate a statistic so I can estimate a parameter. How am I supposed to know how much something I don’t know varies?” Good question. Usually, we don’t, so we just assume the worst, that is, we assume maximum variation, which places the highest demand on sample size. When we specify the amount of variation (like when using the sample size calculators I’ll say more about below), we use the percentage of one value for a parameter that takes on only two values, like responses to yes/no questions. If we wanted to play it safe and assume maximum variation in a parameter, then, we’d specify 50%; if 50% of people in a population would answer “yes” to a yes/no question, the parameter would exhibit maximum variation—it can’t vary any more than a 50/50 split. Specifying 0% or 100% would be specifying no variation, and, as it may have occurred to you already, specifying 25% would be the same as specifying 75%.

Very astute readers might have another question: “You’ve been referring to a required sample size, but required for what? What does it mean to have a required sample size? Isn’t that what we’re trying to figure out?” Another good question. Given the size of the population (something you don’t control) and the amount of variance in the parameter (something else you don’t control), a sample size is required to be at least a certain size if we want to achieve a desired level of confidence and a desired level of accuracy, the factors you do control. We saw examples of accuracy and confidence previously. We might say “I am 95% percent certain [so I have a 95% confidence level] that the average age of my class is in the 19 to 21 range [so I have a ± 1 year level of accuracy].” A clumsier way to say the same thing would be “If I were to repeat this study over and over again, selecting my sample anew each time, 95% of my samples would have average ages in the range of 19 to 21.” Confidence and accuracy go together; it doesn’t make sense to specify one without specifying the other. As I’ve emphasized, you get to decide on your levels of confidence and accuracy, but there are some conventions in social research. The confidence level is most often set at 95%, though sometimes you’ll see 90% or 99%. The level of accuracy, which is usually indicated as the range of percentage point estimates, is often set at ±1%, 3%, or 5%. If you’re doing applied research, you might want to relax these standards a bit. You might decide that a survey giving you ±6% at an 85% confidence level is all you can afford, but it will help you make decisions better than no survey at all.

So far, I’ve just said we need to “consider” these four factors—population size, parameter variation, degree of accuracy, and degree of confidence, but, really, we have to do more than just consider them, we have to plug them into a formula to calculate the required sample size. The formula isn’t all that complicated, but most people take the easy route and use a sample size calculator instead, and so will we. Several good sample size calculators will pop up with a quick internet search. You enter the information and get your required sample size in moments. Playing around with these calculators is a bit mind boggling. Try it out. What would be a reasonable sample size for surveying all United States citizens? What about for all citizens of Rhode Island? What’s surprising about these sample sizes? Play around with different levels of confidence, accuracy, and parameter variation. How much do small changes affect your required sample sizes?

And note the interplay of confidence and accuracy. For any given sample size, you can have different combinations of confidence and accuracy, which will have an inverse relationship—as one goes up, the other goes down. With the same sample, I could choose either to be very confident about an imprecise estimate or to be not-so-confident about a precise estimate. I can look over a class of undergraduates and predict with near certainty that their average age is between 17 and 23, or I can predict with 75% confidence that their average age is between 19 and 20.

It’s important to realize what we’re getting from the sample size calculator. This is the minimum sample size if we’re intending to use statistics to estimate single parameters, one by one—that is, we’re calculating univariate statistics. If, however, we’re planning to compare any groups within our sample or conduct any bivariate or multivariate statistical analysis with your data, our sample size requirements will increase accordingly (and necessitate consulting statistics manuals).

Calculating a minimum sample size based on the desired accuracy and confidence only makes sense if we’re following a probability sampling design. Sometimes, though, our goal isn’t to generalize what we learn from a sample to a population; sometimes, we have other purposes for our samples and use nonprobability sampling designs. Maybe we’re doing a trial run of our study. We just want to try out our questionnaire and get a feel for how people will respond to it, so we use a convenience sampling design, which is what it sounds like—sampling whatever cases are convenient. You give your questionnaire to your roommate, your mom, and whoever’s waiting in line with you at the coffee shop. Usually, convenience sampling is used for field testing data collection instruments, but it can also be used for exploratory research—research intended to help orient us to a research problem, to help us figure out what concepts are important to measure, or to help us figure out where to start when we don’t have a lot of previous research to build on. We know that we have to be very cautious in drawing conclusions from exploratory research based on convenience samples, but it can provide a very good starting point for more generalizable research in the future.

In other cases, it would be silly to use a probability sampling design to select your case. What if you wanted to observe people’s behavior at Green Party rallies? Would you construct a sampling frame listing all the upcoming political rallies and randomly select a few, hoping to get a Green Party rally in your sample? Of course not. Sometimes we choose our sample because we want to study particular cases. We may not even describe our case selection as sampling, but when we do, this is purposive sampling. We can also use purposive sampling if we wish to describe typical cases, atypical cases, or cases that provide insightful contrasts. If I were studying factors associated with nonprofit organizational effectiveness, I might select organizations that seem similar but demonstrate a wide range of effectiveness to look for previously unidentified differences that might explain the variation. Purposive sampling is prominent in studies built around in-depth qualitative data, including case studies, which we’ll look at in a bit.

When purposively selecting cases of interest, we should take care not to draw unwarranted conclusions from cases selected on the dependent variable, the taboo sampling strategy. Imagine we want to know whether local governments’ spending on social media advertising encourages local tourism. Our independent variable is social media advertisement spending, and our dependent variable is the amount of tourism. If we were to adopt this taboo sampling strategy, we would identify localities that have experienced large increases in tourism. We may then, upon further investigation, learn they had all previously increased spending on social media advertising and conclude that more advertising spending leads to more tourism. Can we legitimately draw that conclusion, though? It may be that many other localities had also increased their social media advertising spending but did not see an increase in tourism; the level of spending may not affect tourism at all. It’s even possible that other localities spent more on social media advertising—we do not know because we fell into the trap of selecting cases on the dependent variable.

We may wish to do probability sampling but lack the resources, potentially making a quota sampling design a good option. This is somewhat of a cross between convenience sampling design and the stratified sampling designs. Before, when we wanted to include 12 male out-of- state graduate students in our sample, we constructed a sampling frame and randomly selected them. We could, however, select the first 12 male out-of-state graduate students we stumble upon, survey them to meet our quota for that category of student, and then seek out students in our remaining categories. (This is what those iPad-carrying marketing researchers at the mall and in theme parks are doing—and why they’ll ignore you one day and chase you down the next.) We’d still be very tentative about generalizing from this sample to the population, but we’d feel more confident than if our sample had been selected completely as a matter of convenience.

One final nonprobability sampling design is useful when cases are difficult to identify beforehand, like meth users, sex workers, or the behind-the-scenes movers-and-shakers in a city’s independent music scene. What’s a researcher wanting to interview such folks to do? Post signs and ask for volunteers? Probably not. She may be able to get that first interview, though, and, once that respondent trusts her, likes her, and becomes invested in her research, she might get referred to a couple more people in this population, which could lead to a few more, and so on. This is called (regrettably, I think, because I’d hate to have the term snowball in my serious research report) a snowball sampling design or (more acceptably but less popularly) a network sampling design, and it has been employed in a lot of fascinating research about populations we’d otherwise never know much about.

Data Collection Methods

The decision of how to select cases to observe may present a long list of options, but deciding what specific types of data to collect presents us with infinite options. It seems to me, though, that the kinds of data collection we do in empirical social research all fall in one of three broad categories: asking questions, making direct observations, and collecting secondary data.

Collecting data by asking questions can be somewhat like our everyday experience of carrying on conversations. If you have taken an introductory communications course, you have learned how interpersonal communication involves encoding our intended meaning in words, transmitting those words to our conversation partner, who then receives those words, decodes them to derive meaning, and then repeats the process in response. All of this can be derailed due to distractions, assumptions, moods, attitudes, social pressures, and motives. In normal conversation, both parties can try to keep communication on track by reading body language, asking clarifying questions, and correcting misunderstandings. When asking questions for research, though, you—the researcher—are solely responsible for crafting a question-and- answer exchange that yields valid data. The researcher must ensure the meaning she intends to encode in her questions are accurately decoded by the respondent; she must ensure the respondent is enabled to accurately encode his intended meaning in his available response options; she must anticipate and mitigate threats to the accurate encoding and decoding of meaning posed by those distractions, assumptions, moods, attitudes, social pressures, and motives. Before thinking about the nuts and bolts of asking questions for research, understand that it is, essentially, two-way communication with all responsibility for ensuring its accuracy on the head of the researcher.

Volumes have been written about the craft of asking people questions for research purposes, but we can sum up the main points briefly. Researchers ask people questions face-to-face (whether in person or via web-based video conferencing), by telephone, using self-administered written questionnaires, and in web-based surveys. Each of these modes of administration has its advantages and disadvantages. It’s tempting to think that face-to-face interviewing is always the best option, and often, it is a good option. Talking to respondents face-to-face makes it hard for them to stop midway through the interview, gives them the chance to ask questions if something needs clarifying, and lets you read their body language and facial expressions so you can help if they look confused. A face-to-face interview gives you a chance to build rapport with respondents, so they’re more likely to give good, thorough answers because they want to help you out. That’s a double-edged sword, though: Having you staring a respondent in the face might tempt him to give answers that he thinks you want to hear or that make him seem like a nice, smart, witty guy—the problem of social desirability bias.

Combating bias is one of the most important tasks when designing a research project. Bias is any systematic distortion of findings due to the way that the research is conducted, and it takes many forms. Imagine interviewing strangers about their opinions of a particular political candidate. How might their answers be different if the candidate is African-American and the interviewer is white? What if the respondent is interviewed at her huge fancy house and the interviewer is wearing tattered shoes? The human tendencies to want to be liked, to just get along, and to avoid embarrassment are very strong, and they can strongly affect how people answer questions asked by strangers. To the extent that respondents are affected similarly from interview to interview, the way the research is being conducted has introduced bias.

So, then, asking questions face-to-face may be a good option sometimes, but it may be the inferior option if social desirability bias is a potential problem. In those situations, maybe having respondents answer questions using a self-administered written questionnaire would be better. Completing a questionnaire in private goes a long way in avoiding social desirability bias, but it introduces other problems. Mail is easier to ignore than someone knocking at your door or making an appointment to meet with you in your office. You have to count more on the respondent’s own motivation to complete the questionnaire, and if motivated respondents’ answers are systematically different than unmotivated nonrespondents, your research plan has introduced self-selection bias. You’re not there to answer questions the respondent may have, which pretty much rules out complicated questionnaire design (such as questionnaires with a lot of skip patterns—“If ‘Yes,’ go to Question 38; if ‘No,’ go to Question 40” kind of stuff). On the plus side, it’s much easier and cheaper to mail questionnaires to every state’s director of human services than to visit them all in person.

You can think through how these various pluses and minuses would play out with surveys administered by telephone. If you’re trying to talk to a representative sample of the population, though, telephone surveys have another problem. Think about everyone you know under the age of 30. How many of them have telephones—actual land lines? How many of their parents have land lines? Most telephone polling is limited to calling land lines, so you can imagine how that could introduce sampling bias—bias introduced when some members of the population are more likely to be included in a study than others. When cell phones are included, you can imagine that there are systematic differences between people who are likely to answer the call and those who are likely to ignore the unfamiliar Caller ID—another source of sampling bias. If you are a counseling center administrator calling all of your clients, this may not be a problem; if you are calling a randomly selected sample of the general population, the bias could be severe.

Web-based surveys have become a very appealing option for researchers. They are incredibly cheap, allow complex skip patterns to be carried out unbeknownst to respondents, face no geographic boundaries, and automate many otherwise tedious and error-prone data entry tasks. For some populations, this is a great option. I once conducted a survey of other professors, a population with nearly universal internet access. For other populations, though—low-income persons, homeless persons, disabled persons, the elderly, and young children—web-based surveys are often unrealistic.

Deciding what medium to use when asking questions is probably easier than deciding what wording to use. Crafting useful questions and combining them into a useful data collection instrument take time and attention to details easily overlooked by novice researchers. Sadly, plentiful examples of truly horribly designed surveys are easy to come by. Well-crafted questions elicit unbiased responses that are useful for answering research questions; poorly crafted questions do not.

So, what can we do to make sure we’re asking useful questions? There are many good textbooks and manuals devoted to just this topic, and you should definitely consult one if you’re going to tackle this kind of research project yourself. Tips for designing good data collection instruments for asking questions, whether questionnaires, web-based surveys, interview schedules, or focus group protocols, boil down to a few basics.

Perhaps most important is paying careful attention to the wording of the questions themselves. Let’s assume that respondents want to give us accurate, honest answers. For them to do this, we need to word questions so that respondents will interpret them in the way we want them to, so we have to avoid ambiguous language. (What does often mean? What is sometimes?) If we’re providing the answer choices for them, we also have to provide a way for respondents to answer accurately and honestly. I bet you’ve taken a survey and gotten frustrated that you couldn’t answer the way you wanted to.

I was once asked to take a survey about teaching online. One of the questions went something like this:

  1. Do you think teaching online is as good as teaching face-to-face?
    1. ❑  Yes
    2. ❑  No
    3. ❑  I think they’re about the same

I’ve taught online lot, I’ve read a lot about online pedagogy, I’ve participated in training about teaching online, and this was a frustrating question for me. Why? Well, if I answer no, my guess is that the researchers would infer that I think online teaching is inferior to face-to-face teaching. What if I am an online teaching zealot? By no, I may mean that I think online teaching is superior to face-to-face! There’s a huge potential for disconnect between the meaning the respondent attaches to this answer and the meaning the researcher attaches to it. That’s my main problem with this question, but it’s not the only one. What is meant, exactly, by as good as? As good as in terms of what? In terms of student learning? For transmitting knowledge? My own convenience? My students’ convenience? A respondent could attach any of these meanings to that phrase, regardless of what the researcher has in mind. Even if I ignore this, I don’t have the option of giving the answer I want to—the answer that most accurately represents my opinion—it depends. What conclusions could the researcher draw from responses to this question? Not much, but uncritical researchers would probably report the results as filtered through their own preconceptions about the meanings of the question and answer wording, introducing a pernicious sort of bias—difficult to detect, particularly if you’re just casually reading a report based on this study, and distorting the findings so much as to actually convey the opposite of what respondents intended. (I was so frustrated by this question and fearful of the misguided decisions that could be based on it that I contacted the researcher, who agreed and graciously issued a revised survey—research methods saves the day!) Question wording must facilitate unambiguous, fully accurate communication between the researcher and respondent.

Just as with mode of administration, question wording can also introduce social desirability bias. Leading questions are the most obvious culprit. A question like Don’t you think public school teachers are underpaid? makes you almost fall over yourself to say “Yes!” A less leading question would be Do you think public school teachers are paid too much, paid too little, or paid about the right amount? To the ear of someone who doesn’t want to give a bad impression by saying the “wrong” answer, all of the answers sound acceptable. If we’re particularly worried about potential social desirability bias, we can use normalizing statements: Some people like to follow politics closely and others aren’t as interested in politics. How closely do you like to follow politics? would probably get fewer trying-to-sound-like-a-good-citizen responses than Do you stay well informed about politics?

Closed-ended questions—questions that give answers for respondents to select from—are susceptible to another form of bias, response set bias. When respondents look at a range of choices, there’s subconscious pressure to select the “normal” response. Imagine if I were to survey my students, asking them:

  1. How many hours per week do you study?
    1. ❑  Less than 10
    2. ❑  10 – 20
    3. ❑  More than 20

That middle category just looks like it’s the “normal” answer, doesn’t it? The respondent’s subconscious whispers “Lazy students must study less than 10 hours per week; more than 20 must be excessive.” This pressure is hard to avoid completely, but we can minimize the bias by anticipating this problem and constructing response sets that represent a reasonable distribution.

Response sets must be exhaustive—be sure you offer the full range of possible answers—and the responses must be mutually exclusive. How not to write a response set:

  1. How often do you use public transportation?
    1. ❑  Never
    2. ❑  Every day
    3. ❑  Several times per week
    4. ❑  5 – 6 times per week
    5. ❑  More than 10 times per week

(Yes, I’ve seen stuff this bad.)

Of course, you could avoid problems with response sets by asking open-ended questions. They’re no panacea, though. Closed- and open-ended questions have their advantages and disadvantages. Open-ended questions can give respondents freedom to answer how they choose, they remove any potential for response set bias, and they allow for rich, in-depth responses if a respondent is motivated enough. However, respondents can be shockingly ambiguous themselves, they can give responses that obviously indicate the question was misunderstood, or they can just plain answer with total nonsense. The researcher is then left with a quandary— what to do with these responses? Throw them out? Is that honest? Try to make sense of them? Is that honest? Closed-ended questions do have their problems, but the answers are unambiguous, and the data they generate are easy to manage. It’s a tradeoff: With closed-ended questions, the researcher is structuring the data, which keeps things nice and tidy; with open-ended questions, the researcher is giving power to respondents to structure the data, which can be awfully messy, but it can also yield rich, unanticipated results.

Choosing open-ended and closed-ended questions to different degrees gives us a continuum of approaches to asking individuals questions, from loosely structured, conversational-style interviews, to highly standardized interviews, to fill-in-the-bubble questionnaires. When we conduct interviews, it is usually in a semi-structured interview style, with the same mostly open-ended questions asked, but with variations in wording, order, and follow-ups to make the most of the organic nature of human interaction.

When we interview a small group of people at once, it’s called a focus group. Focus groups are not undertaken for the sake of efficiency—it’s not just a way to get a lot of interviews done at once. Why do we conduct focus groups, then? When you go see a movie with a group of friends, you leave the theater with a general opinion of the movie—you liked it, you hated it, you thought it was funny, you thought it meant …. When you go out for dessert afterward and start talking with your friends about the movie, though, you find that your opinion is refined as it emerges in the course of that conversation. It’s not that your opinion didn’t exist before or, necessarily, that the discussion changed your opinion. Rather, it’s in the course of social interaction that we uncover and use words to express our opinions, attitudes, and values that would have otherwise lain dormant. It’s this kind of emergent opinion that we use focus groups to learn about. We gather a group of people who have something in common—a common workplace, single parenthood, Medicaid eligibility—and engage them in a guided conversation so that the researcher and participants alike can learn about their opinions, values, and attitudes.

Asking questions is central to much empirical social research, but we also collect data by directly observing the phenomena we’re studying, called field research or simply (and more precisely, I think) direct observation. We can learn about political rallies by attending them, about public health departments by sitting in them, about public transportation by riding it, and about judicial confirmation hearings by watching them. In the conduct of empirical social research, such attending, sitting, riding, and watching aren’t passive or unstructured. To prepare for our direct observations, we construct a direct observation tool (or protocol), which acts like a questionnaire that we “ask” of what we’re observing. Classroom observation tools, for example, might prompt the researcher to record the number of students, learning materials available in the classroom, student-teacher interactions, and so on.

The advice for developing useful observation tools isn’t unlike the advice for developing useful instruments for asking questions; the tool must enable an accurate, thorough, unbiased description of what’s observed. Likewise, a potential pitfall of direct observation is not unlike social desirability bias: When people are being observed, their knowledge of being observed may affect their behavior in ways that bias the observations. This is the problem of participant reactivity. Surely the teacher subjected to the principal’s surprise visit is a bit more on his game than he would have been otherwise. The problem isn’t insurmountable. Reactivity usually tapers off after a while, so we can counter this problem by giving people being observed enough time to get used to it. We can just try to be unobtrusive, we can make observations as participants ourselves (participant observation), or, sometimes, we can keep the purpose of the study a mystery so that subjects wouldn’t know how to play to our expectations even if they wanted to.

Finally, we can let other people do our data collection for us. If we’re using data that were collected by someone else for their own purposes, our data collection strategy is using secondary data. Social science researchers are fortunate to have access to multiple online data warehouses that store datasets related to an incredibly broad range of social phenomena. In political science, for example, we can download and analyze general public opinion datasets, results of surveys about specific public policy issues, voting data from federal and state legislative bodies, social indicators for every country, and on and on. Popular data warehouses include Inter-University Consortium for Political and Social Research (ICPSR), University of Michigan’s National Elections Studies, Roper Center for Public Opinion Research, United Nations Common Database, World Bank’s World Development Indicators, and U.S. Bureau of the Census. Such secondary data sources present research opportunities that would otherwise outstrip the resources of many researchers, including students.

A particular kind of secondary data, administrative data, are commonly used across the social sciences, but are of special interest to those of us who do research related to public policy, public administration, and other kinds of organizational behavior. Administrative data are the data collected in the course of administering just about every agency, policy, and program. For public agencies, policies, and programs, they’re legally accessible thanks to freedom of information statutes, and they’re frequently available online. Since the 1990s, these datasets have become increasingly sophisticated due to escalating requirements for performance measurement and program evaluation. Still, beware: Administrative datasets are notoriously messy. These data usually weren’t collected with researchers in mind, so the datasets require a lot of cleaning, organizing, and careful scrutiny before they can be analyzed.

Formal research designs

Simply collecting data is insufficient to answer research questions. We must have a plan, a research design, to enable us to draw conclusions from our observations. Different methodologists divvy up the panoply of research designs different ways; we’ll use five categories: cross-sectional, longitudinal, experimental, quasi-experimental, and case study.

Cross-sectional research design is the simplest. Researchers following this design are making observations at a single point in time; they’re taking a “snapshot” of whatever they’re observing. Now, we can’t take this too literally. A cross-sectional survey may take place over the course of several weeks. The researcher won’t, however, care to distinguish between responses collected on day 1 versus day 2 versus day 28. It’s all treated as having been collected in one wave of data collection. Cross-sectional research design is well suited to descriptive research, and it’s commonly used to make cross-case comparisons, like comparing the responses of men to the responses of women or the responses of Republicans to the responses of Democrats. If we’re interested in establishing causality with this research design, when we have to be sure that cause comes before effect, though, we have to be more careful. Sometimes it’s not a problem. If you’re interested in determining whether respondents’ region of birth influences their parenting styles, you can be sure that the respondents were born wherever they were born before they developed any parenting style, so it’s OK that you’re asking them questions about all that at once. However, if you’re interested in determining whether interest in politics influences college students’ choice of major, a cross-sectional design might leave you with a chicken-and-egg problem: Which came first? A respondent’s enthusiasm for following politics or taking her first political science course? Exploring causal research questions using cross-sectional design isn’t verboten, then, but we do have to be cautious.

Longitudinal research design involves data collection over time, permitting us to measure change over time. If a different set of cases is observed every time, it’s a time series research design. If the same cases are followed over time, with changes tracked at the case level, it’s a panel design.

Experimental research design is considered by most to be the gold standard for establishing causality. (This is actually a somewhat controversial statement. We’ll ignore the controversy here except to say that most who would take exception to this claim are really critical of the misapplication of this design, not the design itself. If you want to delve into the controversy, do an internet search for federally required randomized controlled trial program evaluation designs.) Let’s imagine an experimental-design study of whether listening to conservative talk radio affects college students’ intention to vote in an upcoming election. I could recruit a bunch of students (with whichever sampling plan I choose) and then have them all sit in a classroom listening to MP3 players through earbuds. I would have randomly given half of them MP3 players with four hours of conservative talk radio excerpts and given the other half MP3 players with four hours of muzak. Before they start listening, I’ll have them respond to a questionnaire item about their likelihood of voting in the upcoming election. After the four hours of listening, I’ll ask them about their likelihood of voting again. I’ll compare those results, and if the talk radio group is now saying they’re more likely to vote while the muzak group’s intentions stayed the same, I’ll be very confident in attributing that difference to the talk radio.

My talk radio experiment demonstrates the three essential features of experimental design: random assignment to experimental and control groups, control of the experimental setting, and manipulation of the independent variable. Control refers to the features of the research design that rule out competing explanations for the effects we observe. The most important way we achieve control is by the use of a control group. The students were randomly assigned to a control group and an experimental group. The experimental group gets the “treatment”—in this case, the talk radio, and the control group gets the status quo—in this case, listening to muzak. Everything else about the experimental conditions, like the time of day and the room they were sitting in, were controlled as well, meaning that the only difference in the conditions surrounding the experimental and control groups was what they listened to. This experimental control let me attribute the effects I observed—increases in the experimental group’s intention to vote—to the cause I introduced—the talk radio.

The third essential feature of experimental design, manipulation of the independent variable, simply means the researcher determines which cases get which values of the independent variable. This is simple with MP3 players, but, as we’ll see, it can be impossible with the kinds of phenomena many social researchers are interested in.

Experimental methods are such strong designs for exploring questions of cause and effect because they enable researchers to achieve the three criteria for making causal claims—the standards we use to assess the validity of causal claims: time order, association, and nonspuriousness. Time order is the easy one (unless you’re aboard the starship Enterprise). We can usually establish that cause preceded effect without a problem. Association is also fairly easy. If we’re working with quantitative data (as is usually the case in experimental research designs), we have a whole arsenal of statistical tools for demonstrating whether and in what way two variables are related to each other. If we’re working with qualitative data, good qualitative data analysis techniques can convincingly establish association, too.

Meeting the third criterion for making causal claims, nonspuriousness, is trickier. A spurious relationship is a phony relationship. It looks like a cause-and-effect relationship, but it isn’t. Nonspuriousness, then, requires that we establish that a cause-and-effect relationship is the real thing—that the effect is, indeed, due to the cause and not something else. Imagine conducting a survey of freshmen college students. Based on our survey, we claim that being from farther away hometowns makes students more likely to prefer early morning classes. Do we meet the first criterion? Yes, the freshmen were from close by or far away before they ever registered for classes. Do we meet the second criterion? Well, it’s a hypothetical survey, so we’ll say yes, in spades: Distance from home to campus and average class start time are strongly and inversely correlated.

What about nonspuriousness, though? To establish nonspuriousness, we need to think of any competing explanations for this alleged cause-and-effect relationship and rule them out. After running your ideas past the admissions office folks, you learn that incoming students from close by usually attend earlier orientation sessions, those from far away usually attend later orientation sessions, and—uh-oh—they register for classes during orientation. We now have a potential competing explanation: Maybe freshmen who registered for classes later are more likely to end up in early morning classes because classes that start later are already full. The students’ registration date, then, becomes a potentially important control variable. It’s potentially important because it’s quite plausibly related to both the independent variable (distance from home to campus) and the dependent variable (average class start time). If the control variable, in fact, is related to both the independent variable and dependent variable, then that alone could explain why the independent and dependent variables appear to be related to each other when they’re actually not. When we do the additional analysis of our data, we confirm that freshmen from further away did, indeed, tend to register later than freshmen from close by, that students who register later tend to end up in classes with earlier start times, and, when we control for registration date, there’s not an actual relationship between distance from home and average class start time. Our initial causal claim does not achieve the standard of nonspuriousness.

The beauty of experimental design—and this is the crux of why it’s the gold standard for causal research—is in its ability to establish nonspuriousness. When conducting an experiment, we don’t even have to think of potential control variables that might serve as competing explanations for the causal relationship we’re studying. By randomly assigning (enough) cases to experimental and control groups and then maintaining control of the experimental setting, we can assume that the two groups and their experience in the course of the study are alike in every important way except one—the value of the independent variable. Random assignment takes care of potential competing explanations we can think of and competing explanations that never even occur to us. In a tightly controlled experiment, any difference observed in the dependent variable at the conclusion of the experiment can confidently be attributed to the independent variable alone.

“Tightly controlled experiments,” as it turns out, really aren’t that common in social research, though. Too much of what we study is important only when it’s out in the real world, and if you try to stuff it into the confines of a tightly controlled experiment, we’re unsure if what we learn applies to the real thing. Still, experimental design is something we can aspire to, and the closer we can get to this ideal, the more confident we can be in our causal research. Whenever we have a research design that mimics experimental design but is missing any of its key features— random assignment to experimental and control groups, control of the experimental setting, and manipulation of the independent variable—we have a quasi-experimental design.

Often, randomly assigning cases to experimental and control groups is prohibitively difficult or downright impossible. We can’t assign school children to public schools and private schools, we can’t assign future criminals to zero tolerance states and more lax states, and we can’t assign pregnant women to smoking and nonsmoking households. We often don’t have the power to manipulate the independent variable, like deciding which states will have motor-voter laws and which won’t, to test its effects on voting behaviors. Very rarely do we have the ability to control the experimental setting; even if we could randomly assign children to two different kindergarten classrooms to compare curricula, how can other factors—the teachers’ personalities, for instance—truly be the same?

Quasi-experimental designs adapt to such research realities by getting as close to true experimental design as possible. There are dozens of variations on quasi-experimental design with curious names like regression discontinuity and switching replications with nonequivalent groups, but they can all be understood as creative responses to the challenge of approximating experimental design. When we divide our cases into two groups by some means other than random assignment, we don’t get to use the term control group anymore, but comparison group instead. The closer our comparison group is to what a control group would have been, the stronger our quasi-experimental design. To construct a comparison group, we usually try to select a group of cases similar to the cases in our experimental group. So, we might compare one kindergarten classroom enjoying some pedagogical innovation to an adjacent kindergarten classroom with the same old curriculum or Alabama drivers after a new DUI law to Mississippi drivers not bound by it.

If we’re comparing these two groups of drivers, we’re also conducting a natural experiment. In a natural experiment, the researcher isn’t able to manipulate values of the independent variable; we can’t decide who drives in Mississippi or Alabama, and we can’t decide whether or not a state would adopt a new DUI law. Instead, we take advantage of “natural” variation in the independent variable. Alabama did adopt a new DUI law, and Mississippi did not, and people were driving around in Alabama and Mississippi before and after the new law. We have the opportunity for before-and-after comparisons between two groups, it’s just that we didn’t introduce the variation in the independent variable ourselves; it was already out there.

Social researchers also conduct field experiments. In a field experiment, the researcher randomly assigns cases to experimental and comparison groups, but the experiment is carried out in a real-life setting, so experimental control is very weak. I once conducted a field experiment to evaluate the effectiveness of an afterschool program in keeping kids off drugs and such. Kids volunteered for the program (with their parents’ permission). There were too many volunteers to participate all at once, so I randomly assigned half of them to participate during fall semester and half to participate during spring semester. The fall semester kids served as my experimental group and, during the fall semester, the rest of the kids served as my comparison group. At the beginning of the fall semester, I had all of them complete a questionnaire about their attitudes toward drug use, etc., then the experimental group participated in the program while the control group did whatever they normally did, and then at the end of the semester, all the kids completed a similar questionnaire again. Sure enough, the experimental group kids’ attitudes changed for the better, while the comparison group kids’ attitudes stayed about the same (or even changed a bit for the worse). All throughout the program, the experimental group and comparison group kids went about their lives—I certainly couldn’t maintain experimental control to ensure that the only difference between the two groups was the program.

Very strong research designs can be developed by combining one of the longitudinal designs (time series or panel) with either experimental or quasi-experimental design. With such a design, we observe values of the dependent variable for both the experimental and control (or comparison) groups at multiple points in time, then we change (or observe the change of) the independent variable for the experimental group, and then we observe values of the dependent variable for both groups at multiple points in time again.

That’s a bit confusing, but an example will clarify: Imagine inner-city pharmacies agree to begin stocking fresh fruits and vegetables, which people living nearby otherwise don’t have easy access to. We might want to know whether this will affect area residents’ eating habits. There are lots of ways we could go about this study, but probably the strongest design would be an interrupted time series quasi-experimental design. Here’s how it might work: Before the pharmacies begin stocking fresh produce, we could conduct door-to-door surveys of people in two inner-city neighborhoods—one without a pharmacy and one with a pharmacy. We could survey households once a month for four months before the produce is stocked, asking folks about how much fresh produce they eat at home.

(A quick aside: We’d probably want to talk to different people each time since, otherwise, just the fact that we keep asking them about their eating habits, they might change what they eat—an example of a measurement artifact, which we try to avoid. We want to measure changes in our dependent variable, eating habits, that are due to change in the independent variable, availability of produce at pharmacies, not due to respondents’ participation in the study itself.)

After the pharmacies begin stocking fresh produce, we would then conduct our door-to-door surveys in both neighborhoods again, perhaps repeating them once a month for another four months. Once we’re done, we’d have a very rich dataset for estimating the effect of available produce on eating habits. We could compare the two neighborhoods before the produce was available to establish just how similar their eating habits were before, and then we could compare the two neighborhoods afterward. We might see little difference one month after the produce became available as people became aware of it, then maybe a big difference in the second month in response to the novelty of having produce easily available, and then maybe a more moderate, steady difference in the third and fourth months as some people returned to their old eating habits and others continued to purchase the produce. With this design, we can provide very persuasive evidence that the experimental and comparison groups were initially about the same in terms of the dependent variable, which increases our confidence that any changes we see later are indeed due to the change in the independent variable. We can also capture change over time, which is frequently very important when we’re measuring behavioral changes, which tend to diminish over time.

Case study research design is the oddball of the formal research designs. Many researchers who feel comfortable with all the other designs would feel ill equipped to undertake a case study. A case study is the systematic study of a complex case that is in-depth and holistic. Unlike the other designs, we’re just studying a single case, which is usually something like an event, such as a presidential election, or a program, such as the operation of a needle exchange program. With the other designs, we usually rely on a single data collection method, but with case study research design, we use multiple data collection methods, with a heavy emphasis on collecting qualitative data. In the course of a single case study, we might conduct interviews, conduct focus groups, administer questionnaires, survey administrative records, and conduct extensive direct observations. We make enough observations in as many different ways as necessary to enable us to write a rich, detailed description of our case. This written report is, itself, called a case study.

The richness of case studies highlights another key difference between this and the other research designs. The contrast with experimental design is sharpest: If you think about experimental design, its beauty lies in ignoring complexity. If I were to randomly assign a bunch of teenagers to experimental and control groups, my express intention would be to ignore all their pimply, hormonal, awkward, exuberant complexity and the group dynamics that would undoubtedly emerge in the two groups. I count on random assignment and experimental control to make all differences between the two groups a complete wash except the difference in the independent variable. With case studies, though, we embrace this complexity. The whole point is to describe this rich complexity, bringing only enough organization to it to make it understandable to people who can’t observe it directly—those people who will ultimately read our written case studies.

There are many elaborations on these formal research designs. A few more, along with a system of notation for depicting research designs, are presented in Appendix B.