Appendix C: Inferential statistics

I’m including this appendix as a very general refresher on inferential statistics for students who may be a bit fuzzy on the concept. This should also help students make connections between what you learn in a statistics course and research methods concepts. Some of our research methods concepts are included in this review as well to help you make those connections. We’ll also review the uses and limitations of p-values and consider two strategies for overcoming those limitations.

Inferential statistics is the branch of statistics that helps us use characteristics of a sample to estimate characteristics of a population. This contrasts with descriptive statistics, which are those statistical tools that describe the data at hand without attempting to generalize to any broader population.

We’re very familiar with examples of inferential statistics. For example, news reports commonly report the results of public opinion surveys, like the presidential approval rating. The surveyors—like the Gallup organization or CNN—randomly select maybe 1,500 adults from across the country and ask them whether or not they approve of the president’s performance. They might learn that, say, 45% of those surveyed approve of the president’s performance. The point, though, is to estimate what percentage of all adults support the president, not just the 1,500 adults they talked to. The use the 45% approval rating as an estimate of all adults’ approval rating.

Let’s use this example to learn (and review) some vocabulary:

Population: The population is the entire set of cases that we want to learn about. In our example, the population is all of the country’s adults. Note, however, that the population doesn’t have to be people. We could want to learn about a population of counties, a population of Supreme Court decisions, a population of high schools, or a population of counseling sessions.

Sample: The sample is the set of cases that you actually collect data for. In our example, the sample is the 1,500 adults actually surveyed.

Statistic: Obviously, we’ve seen this term before (like at the top of this page!), but here, we’re using the term statistic in a narrower sense of the term. A statistic is a quantified characteristic of a sample. A quantified characteristic could be a mean, median, mode, frequency, standard deviation, or any number of other measures. In our example, 45% is a statistic. It’s a characteristic of the sample of 1,500 adults who were surveyed.

Parameter: A parameter is a quantified characteristic of the population. We usually don’t know the parameter—that’s why we’re collecting data from a sample. Our statistics, then, are used to estimate parameters. In our example, we don’t know the parameter we’re interested in. We don’t know the percentage of all adults who approve of the president’s performance. We just know the statistic, so we use that to estimate the parameter. We know it’s very unlikely that our statistic is exactly equal to the parameter, but it’s our best estimate. If we had taken a different sample, we would have gotten a different statistic, even though the parameter was exactly the same.

Sampling frame: The sampling frame is a list. It’s the list that we choose our sample from. Ideally, the sampling frame would include every case in the population. In our example, the ideal sampling frame would be a list of every adult in the United States and their phone numbers. Obviously, no such list exists, so the pollsters have to come up with another strategy.

Sampling strategy: The sampling strategy is the set of rules followed in selecting the sample. A very common sampling strategy is simple random sampling. In simple random sampling, every case in the population has an equal (greater than zero) probability of being selected for the sample. In our example, if we could take the name of every adult, write them on index cards, dump all the index cards in a gigantic hat, mix up the cards really well, and then draw out 1,500 cards, we would have used a simple random sampling strategy. Every case in our population (that is, every adult in our country) would have had an equal probability of being selected for our sample. We learned about other sampling strategies earlier.

Level of confidence and level of accuracy: Two terms, but we have to talk about them at the same time. When we use a statistic to estimate the corresponding parameter, we have to report how confident we are in that estimate. In our example, we might see a news report like 45% of American adults approve of the president’s performance, and then in the fine print, 95% level of confidence, ±3%. That fine print means that if we were to repeat this survey again and again and again at the same time but with a different sample each time, we’d expect the statistic to fall between 42% and 48% in 95% of those surveys. Put another way, we’re 95% sure that the parameter is somewhere between 42% and 48%. (This is due to the central limit theorem, which tells us that statistics, when calculated from the same population again and again and again, many, many times, will follow a normal distribution. This is amazing stuff. Order out of chaos! It’s what makes most inferential statistics work. But back to confidence and accuracy…) In that statement from the news report, 95% is (obviously) the level of confidence, and ±3% is the level of accuracy. Here’s why we can only talk about these at the same time: Using our same survey of 1,500 adults, if we want to be more confident, like 99% confident, we’d have to be less accurate in our estimate, like maybe ±10%. (Note that ±10% is less accurate than ±3% because it’s less precise—don’t be fooled by the bigger number.) So, using the same data, we might say we’re 99% confident (almost positive!) that the population’s presidential approval rating is somewhere between 35% and 55%. Not very impressive, right? It’s easy to be really, really certain about a really, really imprecise estimate.

We often use inferential statistics to estimate measures of relationships between variables in the population. For example, we might want to know if men and women have different average presidential approval ratings. We could look at our sample data of 1,500 adults, which might include 750 men and 750 women. We could find in our sample that 43% of the men approve of the president’s performance, but 47% of the women approve of the president’s performance. Here’s the thing: Even if men’s and women’s presidential approval ratings are exactly the same in the population, we wouldn’t expect for them to be exactly the same in our sample—that would be an amazing coincidence. We’re interested in knowing whether the difference between men and women in our sample reflects a real difference in the population. To do that, we’ll conduct what’s called hypothesis testing.

We’ll imagine that there is no relationship between our two variables—gender and presidential approval—in the population. We’re just imagining that—we don’t know (that’s why we’re collecting and analyzing data!). We’ll then consider our sample data—men’s 43% approval rating and women’s 47% approval rating—and ask, What’s the probability that we would see that big of a difference between the men and women in our sample if there’s really no difference between men and women in the whole population? Put another way, we’re asking What’s the probability that we’re observing this relationship between the two variables (gender and presidential approval) if there’s really no relationship in the population? If that probability is really low, we’ll reject the idea that there’s no relationship and say we are really confident that there most likely is a relationship between the variables in the population. If that probability isn’t low enough to satisfy us, we’ll say we don’t have evidence to reject that idea, so we’ll assume there’s no relationship between the variables in the population until we get evidence that there is. Our initial assumption that there is no relationship between the variables in the population is called the null hypothesis.

The idea that there is a relationship between the variables in the population is called the alternative hypothesis. It’s the alternative to the null hypothesis suggested by our sample data that we’re interested in testing. It’s sometimes called the research hypothesis.

Statistics is a very cautious field, so we tend to require a high standard of evidence before we reject the null hypothesis and accept the alternate hypothesis. Most often, we’ll reject the null hypothesis and “believe” our sample data if there’s no more than a 5% chance that we’re rejecting the null hypothesis when we really shouldn’t. Put another way, we’ll reject the null hypothesis if there’s no more than a 5% chance that the results we see in our sample data are just due to chance. Sometimes, people will use a 1% or 10% standard, but it’s always a pretty conservative standard so that we’re very confident in the conclusions we draw about the population from our sample.

Even with such a high standard for evidence, though, there’s still a chance that our conclusions are wrong. That’s the risk we take if we want to use sample data to draw conclusions about the whole population. If we reject the null hypothesis when we shouldn’t have, we’ve committed what’s called a Type I error. If we fail to reject the null hypothesis when we should have, we’ve committed a Type II error. In other words, if we conclude from our sample data that there really is a relationship between our variables in the population when there really isn’t, we’ve committed a Type I error; if we conclude from our sample data that there is no relationship between our variables in the population when there really is, we’ve committed a Type II error.

If you back up two paragraphs, you may notice that I didn’t use the term p-value, but if you recently took a statistics course, I’m sure it rings a bell. The precise meaning of this term is almost comically debated among statisticians and methodologists. I’m not willing to enter the fray, so I’m going to totally cop out and quote Wikipedia. (Please don’t tell your professor.) Here you go:

“In statistical hypothesis testing, the p-value or probability value is the probability of obtaining test results at least as extreme as the results actually observed, assuming that the null hypothesis is correct. A very small p-value means that the observed outcome is possible but not very likely under the null hypothesis, even under the best explanation which is possible under that hypothesis. Reporting p-values of statistical tests is common practice in academic publications of many quantitative fields. Since the precise meaning of p-value is hard to grasp, misuse is widespread and has been a major topic in metascience.” (https://en.wikipedia.org/wiki/P-value, retrieved July 10, 2020)

That’s a good definition. The debate has to do with how we tend to forget what, exactly, we’re comparing the observed outcome (the result of our statistical analysis) to. Honestly, the important thing to remember is that a very small value p-value—again, less than 0.05 is a common convention—means the results we get from our statistical analysis probably represent a “real” relationship in the population, not just a fluke of our data analysis. I’m going to leave it at that, but if you want to have some fun, delve into the debates over p-value interpretation.

I do, however, want to make the case that p-values are important, but insufficient, for drawing conclusions from our statistical analysis. This is emphasized in introductory statistics courses much more often than it used to be, but I’ll take the opportunity to make the point here just in case you haven’t encountered it before.

Let’s start by considering this question: Why aren’t p-values enough? We use p-values as a measure of the statistical significance—a measure of how likely or unlikely it would be to get the results we got (like from correlation or a t-test) if, in fact, there were no relationship or difference—whatever we’re testing for—at all (and if all the real data in the population look like what we assume they look like, such as being normally distributed). (That convoluted last sentence gives you a sense of what the p-value interpretation debates are about!) P-values let us draw conclusions like It’s really likely that our finding is a fluke; we’d probably get a totally different result with a different sample and Our finding is almost definitely not a fluke; it almost definitely represents a real relationship in the population. Note two things: (1) These conclusions don’t say how strong the relationship is, just that it’s flukey or not, and (2) p-values are extremely sensitive to sample size. It’s easy to get a statistically significant finding for a really weak relationship if we have a big enough sample. P-values are important but insufficient.

We need to do additional analysis, then. I’ll commend two tools to you: emphasizing confidence intervals and calculating effect sizes.

We’ve already learned about confidence intervals when we talked about the degree of accuracy in the section about sampling and then again up above. A point statistic alone can connote an unwarranted degree of precision. It’s more honest to report confidence intervals whenever we can—to say, for example, that we’re 95% sure the average weekly hours spent studying in the population of students is between 10 and 20 rather than just reporting the point statistic, a mean of 15 hours.

Effect sizes may be new to you, so we’ll spend more time here. Effect sizes are what they sound like—a way to gauge “how big” the effect of an independent variable is on a dependent variable. They can also be a way to gauge the strength of non-causal relationships. There are many measures of effect size. There are different effect sizes for the various statistical tests, and each of the various statistical tests usually has several different effect sizes for you to choose from.

We’re going to learn about effect sizes in general by learning about one specifically: Cohen’s d. This is a very widely used measure that gives us an effect size when we’re comparing the means of two groups or the means of the same group in before-and-after measures. Sound familiar? If you’ve already taken a statistics course, this should call to mind t-tests, and, yes, Cohen’s d is often coupled with t-tests. The t-test gives us the p-value, our measure of statistical significance, and Cohen’s d gives us the rest of the information we want, the effect size.

There are a couple of variations of Cohen’s d. We’re going to use the simplest and most widely used version. It uses standard deviation as a measuring stick; you can interpret Cohen’s d as the number of standard deviations of difference between two means. (If you haven’t taken a statistics course yet, just keep skimming for the general idea and come back here once you’ve taken that course.) Since you’re calculating means from two groups, we’re faced with the question of which group’s standard deviation to use. We dodge the question by just lumping both groups’ data together for calculating the standard deviation, then called the pooled standard deviation. The formula for Cohen’s d is:

[(group 1 mean) – (group 2 mean)]/(pooled standard deviation)

That’s just the difference in the two groups’ means divided by the standard deviation for both groups lumped together.

Which group should be group 2 and which should be group 1? If we’re doing a before-and-after analysis, you’d want to subtract the “before” group’s mean from the “after” group’s mean so that increases in measures from before to after would yield positive effect sizes (and decreases would yield negative effect sizes—yes, that’s a thing). You could think of that effect size formula as:

[(the “after” group’s mean) – (the “before” group’s mean)]/(pooled standard deviation)

If you’re comparing two groups’ means on a dependent variable to determine the effect of an independent variable, you need to consider what value of the independent variable you want to know the effect of. If you were evaluating the effect of a program with an experimental design, you would deliver the program to one group of people and not deliver the program to a second group of people. Recall, these groups are called the experimental group and control group, respectively. Your independent variable could be called whether or not someone participated in the program (a little wordy, but clear enough!), and your dependent variable would be your measure of the program’s effectiveness. In this situation, you’d want to subtract the control group’s DV mean from the experimental group’s DV mean so that if the program has a positive effect, the effect size is positive (and if the program has a negative effect, the effect size is negative). You could think of that effect size formula as:

[(the experimental group’s mean) – (the control group’s mean)]/(pooled standard deviation)

Here’s an example: Let’s say we want to measure the effectiveness of a math tutoring program. We do this by giving a group of students a math test, then we enroll that same group of students in the math tutoring program for 12 weeks, and then we give that same group of students the math test again. Here’s the data we gather:

Mean score on the math test before the tutoring program: 68
Mean score on the math test after the tutoring program: 84
Standard deviation of all the tests (before and after): 19

We’ll use the formula we looked at above for before-and-after scenario and plug in those numbers:

(84 – 68) / 19
= 0.84

Our effect size, as measured by Cohen’s d, then , is 0.84.

Here’s another example: Let’s say we’re going to measure the effectiveness of that math tutoring program, but we’re going to do that by randomly assigning one group of students to participate in the program and another group to not participate in the program. (We randomly assign them so that the two groups are as similar to each other as possible, except one is participating in our program, but the other isn’t. That way, if there’s a difference in the two groups’ math test scores, we can confidently attribute that difference to the program instead of something else, like the students’ motivation or knowledge.) We enroll the first group (the experimental group) in the tutoring program for 12 weeks. We let the second group (the control group) just go about doing whatever they would have done anyway. At the end of the tutoring program, we give both groups a math test. Here’s the data we gather:

Experimental group’s mean score on the math test: 80
Control group’s mean score on the math test: 72
Standard deviation calculated based on all the math tests: 18

We’ll plug those numbers into our formula for Cohen’s d:

(80 – 72) / 18
= 0.44

Our effect size, as measured by Cohen’s d, then , is 0.44.

Cohen (the guy who came up with this measure) suggested some rules-of-thumb for interpreting effect sizes:

d = 0.2 is a small effect

d = 0.5 is a medium effect

d = 0.8 is a large effect

Cohen, himself, though, emphasized that these are just rough guidelines and that we would be better off comparing the effect sizes we obtain to what other studies get in similar situations to get an idea of the range of typical scores and what might be considered “small” or “large” in the context of those similar studies. Really, though, most people just kind of blindly apply the rules of thumb.

Notice one other benefit of Cohen’s d: We could compare evaluations of the same program that use different measures of effectiveness. For example, we could compare findings of a 2001 evaluation of The Math Tutoring Program that used the Fraser Test of Math Ability as the effectiveness measure to a 2010 evaluation of The Math Tutoring Program that used the Wendell Math Aptitude Test as its measure of effectiveness by comparing their Cohen’s d statistics. This has become a very common and fruitful application of Cohen’s d and similar effect size statistics.