10-3-2019 Lesson Plan WWD – Daniel Libertz

Standard Error and Confidence Interval (30-45 min)

Some of you are working with data that is not a sample but a population. That is, you have all the data from all the things. For instance, I believe Joe is working on information about all the courses in American Ninja Warrior–that is a population, there is nothing left out of that. Others are working with samples, but they may not be super representative of the population (e.g., not random data). Or, they are a population but you have to talk about only the thing you are looking at (e.g., 10 years of climate data is not all climate data, one year of hockey player statistics is not all hockey player statistics or even the last 5 years).

Today and the next few classes, we are talking about inferential statistics. These are statistics that you can use for a representative sample of a population to make inferences about that population. There are a lot of techniques to do this, and we will look at two basic ones. After that, we will look at correlation statistics. The goal here is not to teach you how to do these things perfectly or that even these are always going to be relevant for the writing with data (they won’t! even for this class, since not everyone has data where inferential stats would apply). The point is the following: how do we write about advanced calculations in accessible but still persuasive ways when making arguments or narratives about our data?

Probability is the foundation of ways in which data analysts can make inferences about the topics they write about. For those who haven’t taken stats courses (and for those who have!), they might not know how much theoretical knowledge is crucial for making inferences about samples. The idea of the sampling distribution is crucial to creating the standard error, which helps us calculate how likely it is that we got our results by chance or how much possibility of error in terms of the sample we should bake into our results. In terms of rhetoric, this helps us make arguments or tell stories about our analysis in more or less confident terms.

I need 6 volunteers whom are reasonably flexible. Imagine that each person represents 100 (the number does not really matter, just “a lot” is meant here) random samples (n=50) of college students graduating in the U.S. this year. Each sample will have different means for GPA.

Because of the Central Limit Theorem, we know that as you take many samples (and especially if those samples are large, at least 30), the distribution of means for those samples will look approximately normal. That is, bell-shaped. It would look like how are volunteers are arranged right now.

No one ever does this. No one ever takes this many random samples of anything. It is too expensive. But this theoretical idea helps practically in terms of inference because the sample you do have would fit somewhere into this sampling distribution. So, we can guess how likely is is or is not that we got our results by chance or how wide an interval we should give for the estimate we have.

One way to do that is by using a confidence interval, or, another way of expression this is by the margin of error. In order to calculate the confidence interval, we need to figure out the standard error.

The formula for the standard error is to divide the standard deviation of the population by the square root of the sample size. Because the sampling distribution is nearly always theoretical, we use in place of the actual standard error the estimated standard error. So, instead of the population standard deviation, we assume that the standard deviation of the sample will suffice.

Another way to think about standard error is how it is really just the standard deviation of the sampling distribution. That is, it just gives us a sense of how varied the different means (or whatever calculation) are for each drawn sample in the sampling distribution.

Once we have the standard error, we can construct a confidence interval. Most people use either 95% confidence or 99% percent confidence when reporting the confidence interval.

Think back to the amount of standard deviations in a normal distribution. There are usually three: about 68% of all values are within one standard deviation, about 95% are within two standard deviations, and about 99% are within three standard deviations…with everything else on the outskirts.

So, a 95% confidence interval is saying that 95 times out of 100 times, a sample you drew would have a mean between these two values. Therefore, the true population mean is probably in this range. But, this is all an estimate because we base the calculation only on the sample (so there can be an error there), so the interval may not be accurate since we don’t actually have the real sampling distribution. There is a great image of what a confidence interval looks like in this article.

A confidence interval generally looks like this for the min and max of the inerval:

point estimate (e.g., mean of your sample) – [test statistic (e.g., z, t) * standard error].
point estimate (e.g., mean of your sample) + [test statistic (e.g., z, t) * standard error].

“z” is a test statistic corresponding either to the 95% level (i.e., about two standard deviations away from the mean) or the 99% level (i.e., about three standard deviations), corresponding to the standardized normal distribution. The actual numbers here are 1.96 and 2.58, respectively. This would be used if you have categorical data (e.g., the proportion of votes for a political candidates relative to other candidates). This is pretty straightforward and can be done by hand.

“t” is a test statistic corresponding to the 95% or 99% confidence levels for the “t-distribution,” which in itself is a fatter bell-curve on the tails that also varies depending on the sample size. In other words, it produces more conservative estimates, which is great for sample sizes that are smaller (i.e., less than 30) or for quantitative data generally to help account for the issues associated with working from an estimated standard error. This is a little harder, and should be done with software (though, can also be done by hand by looking up the t-score on a t-table)

In public writing, we more often see the margin of error rather than the confidence interval.

THINGS TO KEEP IN MIND:

If your dataset is a population (that is, all values are present), inferential statistics are not useful. You already know what the true mean (or true anything) is (though, you should consider distribution of your data and other contextual factors for how to interpret).
If your data were not randomly collected, inferential statistics really should not be used. See here for what qualifies as random data. When finding a dataset in the wild that is a sample and not a population, you need to do some contextualizing to reason out whether it is random or not. And, how suitable it might be for any inferential statistics you might want to perform.
Since the standard error is estimated, the confidence interval (or any statistic) is always an estimate and never could be definite. However, because of the Central Limit Theorem, the higher the sample size the better chance you have a better estimate (and, thus, why you get a more precise confidence interval with a larger sample size).
When calculating, never estimate. Put the full number. When writing, that is the place to round.
Especially for public polling data: This often captures one point in time. Things change, so you can’t read too much into how representative it is. Plus, same goes for how representative the sample is.

Let’s try this out. Download “Jupyter Notebook 2-confidence interval” and save it to your computer (somewhere you can find it). Those using Google Colab instead of JN, I emailed you the version for Colab.

Open “Jupyter Notebook” on your computer and upload this notebook there (see instructions from August 29 for getting things up there, including the csv file, and opening it up in JN).

Then, go to CourseWeb>Course Documents>Confidence Interval Materials. Download the csv file of randomized health data for the class example today. Save it to your computer. Upload it to the JN in the browser (again, see August 29 instructions).

Compute the confidence interval for the column “Weight (Lbs)” by following the instructions where the code is. Let’s just see if we get it.

Then, let’s write about it. How would you write to explain what this means accessibly but without saying too much?

Here is an example of how confidence intervals are communicated to audiences that are familiar with the concept, in more technical circles. What do you notice?

Okay, let’s see some examples of this in the wild:

Example 1

Example 2

Example 3

Example 4 (this is the first page, click link at bottom for next page on ‘methods’)

How are confidence intervals expressed here? How are they written about for a public audience (with perhaps the fourth example written for a public audience though more technically inclined)? What has changed mathematically, rhetorically, or both?

Revision Plan (10-15 min)

Based on what we talked about in terms of aligning your piece for a specific publication or organization you are representing, elements of design and accessibility, and the style elements, let’s come up with a plan for revision.

Based on the following, let’s talk about the revision plan for you public writing draft:

aligning your piece for a specific publication or organization you are representing
elements of design and accessibility
style elements for emphasis and readability
“big picture” elements drawn from comments from me and your peers (e.g., contexutalizing your data, what you can actually say about the data you have, how the piece is organized in terms of argument or narrative, the quality of the argument or narrative you are constructing)
new possible directions to go in (more granular analysis? more secondary sources? a different direction for the argument or narrative?)

Go to CourseWeb and download the “revision plan.” We will work on this for rest of class and continue over next couple of classes as you work on revising.

Next Time (5 min)

-Review reading for today. It was most relevant for thinking about causality and statistical significance. We will talk more about that next class and the 2-3 classes after that one.

-Complete Journal 4. Notice that I rearranged the schedule so the journal will appear differently than it did before today. You will need to download the CSV file about bear data. Choose any column you wish.

-Get that revision plan in action