Appendix B: More research designs

This appendix recaps some of the formal research designs covered in the main text and introduces some elaborations on these designs. We’ll learn about these designs as applied to program evaluation. Program evaluation is the use of research methods to learn about programs—such as job training programs, dropout prevention programs, substance abuse treatment programs, and so on—with the goals of learning about their effectiveness or how to improve them. I find that students tend to get the idea of using research methods this way very intuitively, so it’s a helpful lens for learning about research methods generally. You’ve all casually evaluated programs a lot—think about why you chose one college over others, why you chose your major, and how you’ve come up with ideas for how to make your major even better. Program evaluation accomplishes this same kind of thinking, but based on systematic observations using the tools of empirical social science research.

Along the way, we’ll also learn the standard notation system for research designs. This system of notation makes it much easier for us to communicate about research designs, so be sure you master this system of notation in addition to learning about the evaluations design themselves.

Our notation will use three letters: R, X, and O. R stands for random assignment (and will only be used to depict research designs that use random assignment). X represents our program “happening”—the “intervention” in the terminology of clinical psychology. O stands for observation. This refers to observing our outcome indicators. In research methods jargon, X represents the value of the independent variable (IV) that we want to know the effect of, and O represents the act of measuring the dependent variable (DV). So, if we were evaluating a job placement program, X would represent clients participating in the program, and O would represent measuring the key outcomes of that program—whether or not the clients are employed, or maybe their earnings. Program implementation functions as an independent variable (it “happens” to particular people or not), and our outcomes (employment status, wages) function as our dependent variables. The program manager’s hope is that the program (IV) will have a positive effect on the outcomes (DV).

We can use these three letters to depict all sorts of research designs. We could start with simple outcome measurement. With this type of evaluation, we make observations (O) of our outcomes just once—at the conclusion of an instance of program implementation (like at the conclusion of a client participating in the program). This should remind you of a research methods design: cross-sectional research design—our observations are made at one point in time with no effort to track change in our DV over time.

We can depict this design like this:

X O

We read that from left to right: The program happens (X), and then we make our observations (O). Another term for this is single-group posttest-only evaluation design. That means we’re making observations of just one group (usually people participating in our program, but it could also be, say, stretches of highway in an anti-litter program), and we’re measuring out outcomes only after the program.

(That term, posttest, like pretest, which we will see in a minute, makes it sound like the only way we measure outcomes is by administering tests—fortunately, that’s very much not the case, but it is an unfortunate implication of the term. You can use other terms, like before and after to get around that bit of confusion, but we’ll go with these terms for now.)

This is a very simple evaluation design, and it’s very common. Sometimes, it’s sufficient because we can confidently attribute the outcomes we observe to the program. Imagine a program in which employees attend a one-hour workshop on how to use the new campus intranet. There’s no way they would have had that knowledge beforehand, so if we observe indicators of their knowledge of the system after the program (like on a quiz—always makes for a fun way to end a workshop!), we can be quite confident that they gained that knowledge during the workshop.

Often, however, the single-group post-only design is weak because we can’t know that the observed outcomes are truly due to the program. (This would be weak internal validity, remember, in research methods jargon.) Imagine, instead, a 3-month program of weekly, one- hour workshops intended to improve employees’ workplace communication skills. You could use the simple X O design, but what if you observed indicators of excellent workplace communication skills? How confidently can you attribute those outcomes to the program? How do you know the participants didn’t already have strong communication skills? Or that they started with good communication skills, and now they have just slightly better communication skills? Or that they started with excellent communication skills, and now their skills are actually worse because they’re so afraid of messing up? The X O design can’t let us explore any of those possibilities.

There are two main approaches (and many, many elaborations on these two approaches) to strengthening the internal validity of our evaluations: (1) making observations over time, and (2) making comparisons. Let’s start with making observations over time. That should call to mind our longitudinal designs—time series and panel. We’ll usually be using panel designs.

For example, our workplace communication workshop participants might take a pretest—a measure of our outcome before the program and then a posttest—again, a measure of our outcome—after the program. That way, we can track changes in the individual participants’ levels of communication skills over time. This is a single-group pretest/posttest design, depicted like this:

O1 X O2

Notice that we’re now designating our observations with subscript numbers to help us keep them straight.

The single-group pretest/posttest design is a big improvement over the single-group posttest- only design. We can now see if our outcome indicators actually change from before to after the program. This is also a very common evaluation design, and, like the X O design, it may be adequate if you can confidently attribute the changes you observe to the program and not to some other factor. If we did see improvements in our participants’ workplace communication skills, we’d probably be pretty confident in attributing those improvements to our program.

Let’s imagine still another scenario, though. What if we’re evaluating a 12-week youth development program that involves weekly small group meetings with the goal of helping middle schoolers improve their self-image? A single-group pretest/posttest design would be better than nothing, but what if we did see improvement in our self-image indicators? How would we know that the program had made the difference? What if improvements in self-image just tend to happen naturally as kids become more acclimated to their middle schools and make new friends and so on? Or what if something else happened during the program—like what if they all happened to start doing yoga in PE, and that made the difference in their self-image? How do we know that these kids’ self-images wouldn’t have improved even without the program? To answer those questions, we need to use that second strategy for strengthening the internal validity of our evaluations: making comparisons.

Here’s where we come to the evaluation design that, as we’ve already learned, is considered the gold standard in evaluation design: experimental design. Here’s how we depict the classic experimental design:

R O1 X O2
R O3 O4

Now we have two rows, which indicates that we have two groups. The top row depicts the experimental group, also called the treatment group. In a client-serving program, this would be a group of people participating in our program. The second row depicts the control group. This is a group of people who do not participate in the program—they receive no services or just whatever the status quo is.

The Rs indicate that the clients participating in our evaluation were randomly assigned to the experimental and control groups. Remember, random doesn’t mean haphazard. Random assignment means that all of our cases—usually the people participating in our evaluation—had an equal probability of being assigned to the experimental group or the control group. This is really important because it means that, with a large enough number of participants, we can figure that the two groups were, on average, pretty much the same. They’re the same in terms of things we might think about—like motivation for change or pre-existing knowledge, and they’re also the same even in terms of things we don’t ever think about. The only difference, then, between the two groups is that the experimental group participates in the program and the control group does not.

The features of the experimental design give us a lot of confidence in attributing changes in outcomes to the program. We can see before-to-after change by comparing O1 to O2, and we can rule out the possibility that the change would have occurred even without the program by observing the control group’s outcome indicator changes from O3 to O4. This is key—because of random assignment, we can assume that the two groups started out pretty much the same in terms of the outcome we’re interested in and even in terms of everything else that might affect outcomes—things like their motivation or pre-existing knowledge. We can even double-check some of this by comparing O1 to O3, which we’d expect to be close to the same. And if there would have been some “natural” improvement in the outcome even without the program, we can account for that.

This is accomplished by calculating the difference in differences—that’s [(O2-O1)-(O4-O3)]—very literally the difference between the two groups of their differences from before to after the program.

Let’s look at some numbers to help that make sense. Let’s say we’re measuring our youth development program’s effect on our participants’ self-image using some kind of an assessment that gives a score from 0 to 100, and that we observe these average scores for our experimental and control groups before and after the program:

R 60 X 80
R 60 70

Here, I’ve substituted the two groups’ average pretest and posttest scores for the O1, O2, O3, and O4. First, note that our random assignment worked—our average pre-program outcome measures are the same for our experimental and control group. (In real life, these numbers wouldn’t be exactly the same, but they should be close.)

So, did our program work? Well, the program participants’ scores increased by an average 20 points, so that’s good. But our control group’s scores increased by an average 10 points, even without participating in the program. What would be our measure of the program’s effectiveness, then? We calculate the difference in differences—we calculate the change for the control group and subtract that from the change for the experimental group: 20 minus 10, or 10 points. We can be very confident, then, that our program accounted for a 10-point improvement in our participants’ self-image scores.

We can also see how the experimental design is a big improvement over the other designs. Imagine we had used a single-group posttest-only design:

X 80

We’d be pleased to see a nice, high average outcome score, but we wouldn’t be very confident at all in attributing that score to our program. If we used a single-group pretest/posttest design:

60 X 80

… we’d know that our outcome measures had, on average, increased during the program. We’d be very mistaken, though, to attribute this entire increase to our program—something we wouldn’t know if we hadn’t had the control group for comparison.

There are lots of variations on experimental designs. You might be comparing two different program models instead of comparing a program to no program, which we could depict like this:

R O1 X1 O2
R O3 X2 O4

… Now with two experimental groups participating in two different programs, represented by the two Xs, instead of one program and one no-treatment control group.

If you’re concerned about testing artifacts—the possibility that the act of taking the pretest might help your participants score better on the posttest, you can explore that possibility with a Solomon 4-group design:

R O1 X O2
R O3 O4
R X O5
R O6

Pause for a moment and think about how you would go about looking for a testing artifact. Which observations, or pre-to-post differences would you compare?

OK. Hopefully, you understand why experimental designs are considered the gold standard for evaluating program’s effectiveness. They use both strategies for strengthening the internal validity of our designs—we can measure change over time, and we can make good comparisons. Random assignment means that we can be very confident in our comparisons because the only difference between our experimental group and control group is the program, so we can attribute any differences we observe in their outcomes to the program.

Very often, though, experimental designs aren’t feasible. A program might be a full coverage program, meaning that everyone who is eligible participates, so there’s no viable control group. Or maybe it poses too great an ethical dilemma to withhold services from the control group (though maybe you can overcome that by providing services to the control group after the evaluation). Or maybe it’s just too complicated or expensive—very common problems with experimental designs. If these problems cannot be overcome, then a second-best is often a quasi-experimental design.

There are many, many types of quasi-experimental designs. One of the thickest books on my bookshelves is nothing but an encyclopedia of quasi-experimental designs. Obviously, we’re not going to cover all of those, but they all have one thing in common: These evaluation designs are all trying to get as close as possible to experimental design while creatively overcoming whatever obstacles keep us from carrying out an experiment in the first place. For the most part, I’m going to leave it at that—all of these quasi-experimental designs are creative solutions to overcoming challenges to carrying out experimental designs. Here’s the most common example, though …

If our basic experimental design looks like this:

R O1 X O2
R O3 O4

Then a very basic quasi-experimental design looks like this:

O1 X O2
O3 O4

This is called a nonequivalent comparison group design. All we’ve done is taken away random assignment. Instead of random assignment, we’ve used some other way to come up with our comparison group (which, recall, we must now call a comparison group, not a control group— the term control group is reserved for when we’ve used random assignment). Maybe we found a similar group—like a class of students in study hall instead to compare to the class of students participating in our program.

However we found our comparison group, the goal is to have a comparison group that is as similar to our experimental group as possible—just like a true control group would have been. This can be very, very tricky.

One big problem is what’s called self-selection bias, which we considered briefly before. If kids volunteered to participate in our program, meaning they self-selected into our program, then they probably tend to be different somehow than the average non-participant. If we just choose a bunch of other kids to be our comparison group, then, they’re probably not really a very good comparison group. We’d need to figure out some way to find a comparison group that had similar motivations—like a group of kids who volunteered for the program but couldn’t participate because of scheduling conflicts or had to be placed on a waiting list because we had too many volunteers. There are a lot of other ways of dealing with this problem and other problems you may encounter when designing a quasi-experimental evaluation, but we’re going to leave our discussion there, and you can learn more about quasi-experimental designs on an as-needed basis when you’re working on your own evaluations.

Sometimes, you’re going to be stuck with a single-group design, like in the full coverage scenario I mentioned earlier or when you otherwise just can’t develop a strong comparison group. In that case, we do have some strategies for improving the single-group design beyond the basic X O or O1 X O2.

I bet you can learn one way just by looking at the notation. See if you can interpret this:

O1 O2 O3 O4 X O5 O6 O7 O8

As I’m sure you can figure out, here we have a panel design with multiple pretests and multiple posttests. This is called an interrupted panel design (or, if we’re observing different cases over time, an interrupted time series design—recall the difference between panel and time series designs). This way, we can have a sense of any changes that are ongoing before the program and take that into account when interpreting our outcomes measures after the program. If those middle school students’ self-images were gradually improving before the program and then continued to gradually improve after the program, we’d be very cautious in attributing the changes to our program—something we may have missed if we’d done a simple before-and-after design.

By the way—to back up a little bit—we can have a really strong quasi-experimental design by combining the interrupted panel design and the nonequivalent comparison group design like this:

O1 O2 O3 X O4 O5 O6
O7 O8 O9 O10 O11 O12

This is called a multiple interrupted panel design or multiple interrupted time series design. Pause for a moment to make sure you understand what we’re doing here and why it would be such a strong evaluation design.

Back to improving the single-group design. We can also do something that’s a bit harder to depict with our notation: Make some outcome measures during the program itself. These are, rather inelegantly, called “during” measures, and you’ll even see these designs referred to as single-group before-during-during-during-after designs. That’s pretty awful sounding, but very descriptive, too! I’ve seen one stab at depicting this design like this:

O1 [X … O2 O3 O4 … X] O5

… with the brackets suggesting that the observations are taking place while the program is underway. If we did this with our 12-week youth development program, we could see if there were any changes in response to particular parts of the program. This design gives us the opportunity to associate changes in outcomes with specific events in the program, which gives us a lot more confidence in attributing changes in outcomes to the program than a simple before-and-after design.

One final option is a dose-response design. This design might be depicted just like the other single-group designs, but in the previous designs, we’ve treated the independent variable as a dichotomy—either the program happened or it didn’t. With a dose-response design, instead, we treat the independent variable as a continuous variable—as a program that can happen a little or a lot. In our youth development program, for example, some kids may participate in the program for 6 hours, others may participate for 10 hours, others may participate for 12 hours, and so on. We can make the most of this variation in the independent variable to determine if “more” program results in better outcomes. We’d have to make sure we’re not accidentally seeing the results of something else—like the kids’ motivation to participate—but this design can give us another opportunity to determine if changes in outcomes really can be attributed to the program, even with just a single-group design.

Finally, we can also use a case study approach for our program evaluation design. Case studies, with their multiple sources of data and multiple data collection methods, create a very in-depth, holistic description of the program. This is an especially helpful approach if your evaluation is intended to pursue a formative purpose—the purpose of learning how to improve a program.