10-17-2019 WWD Lesson Plan

Correlation Approaches (20-30 min)

  1. For categorical variables, one common approach is the use of contingency tables or “cross-tabulation,” where you essentially line up two or more variables with two or more possible values to compare values against variables. For instance: if you wanted to see if this is an association between gender identity and political party affiliation (e.g., more women as independents than men?). If the sample is representative/random, to test how likely there is to be an association between them, one common test is the chi-squared test (we won’t go over this, but if anyone is working with categorical variables from a random sample, reach out to me and we can get you up to speed) and there are several ways to measure the strength of that association, too.
  2. For quantitative variables, one common approach is the Pearson correlation coefficient, where you can (if you have a random/representative sample) also use a t-test to test the significance of the correlation. With this statistic, you can say how strong an association is based on your sample, which you then can also use a hypothesis test for. For instance: number of posts and number of followers; amount of time in minutes doing cardio and heart rate.

Pearson correlation coefficient

This statistic essentially says how strongly two variables follow a straight line of either positive or negative correlation. Positive correlation would be “as x increases, y increases” whereas negative correlation would be “as x increases, y decreases.”

The formula for this statistic is the covariance (i.e., the product of [the sum of differences between each datapoint of x and mean of x] and [the sum of differences between each datapoint of y and mean of y]) divided by the product of the standard deviations of x and y. Essentially, since the numerator can be positive or negative and the denominator can only be positive and since both are very similar aside from their orientation toward negativity (i.e., both numerator and denominator are focused on all differences between datapoints and mean for both variables), it gives you a limited way to sense negative or positive correlation.

The formula produces a value between -1 and 1.

  • 0 would represent no correlation
  • 1 would represent a perfect positive correlation
  • -1 would represent a perfect negative correlation

Here is the gist on how most interpret these results:

  • Between -0.3 and 0.3 is generally considered not that strong of a correlation (and, 0, obviously, showing no correlation).
  • -0.4/-0.5 and 0.4/0.5 is a decent correlation.
  • Below and above that is pretty strong, with -0.8/0.8 and beyond very strong.

This page has a great display of how this would look visually in terms of scatterplots (NOTE: always look at the scatterplots! more on this later)

Hypothesis test for Pearson correlation coefficient

You can run a t-test for this statistic, and it functions much in the same way as the comparison between two means (confidence intervals, though, are a little trickier). It is the same kind of idea: how likely would we get a random sample like this among a bunch of random sample possibilities with that correlation result if the true correlation was zero? If unlikely, than there probably is some correlation. If decently likely, than we can’t say for sure so probably should not assume it is close to the true correlation.

This is based on the estimated standard error calculated from our sample and the test statistic (i.e., how many standard errors away from zero our correlation coefficient would be if zero was at the center of the sampling distribution)

Assumptions:

  1. Data are approximately normally distributed and look like a bell curve (at least one of the variables). If you want to measure correlation and your data are not normal, there is an alternative–let me know.
  2. You are using quantitative data (and, to see most clearly, continuous data).
  3. The data are randomly sampled or at least representative of the population.
  4. There are no outliers (or, at least, there are very few of them relative to other data, they are not very different from rest of data, etc.). Anything that relies on mean, as this statistic does, will usually have an issue with outliers.

Another thing to watch out for, which is why it is always important to look at the scatterplot, is when you get curvilinear data.

Question: What do you think is the correlation between caloric intake at breakfast of a student and their test score?

The correlation might look like this, which shows a real relationship in the visual but because it is not a straight line one way or the other, the Pearson correlation coefficient is not a good way to measure it.

Let’s try it out

Open up the bear data again from CourseWeb. I’m going to split you up into five groups to try to measure the correlation between the following:

  1. Height and weight
  2. Head length and width
  3. Chest and weight
  4. Neck circumference and head width
  5. Height and head length

Steps for the activity:

  1. Open the bear data file (make sure it is the original, so might be safer to get it from CourseWeb again) and save it to your computer.
  2. Open Jupyter Notebook
  3. Download Jupyter Notebook 4 from CourseWeb
  4. Save it to your computer rather than opening it
  5. Upload JN4 in Jupyter Notebook in your browser
  6. Upload the bear data if not there already
  7. Think about the assumptions that need to be met. We already know some (e.g., we know this is random data), but you’ll have to look at histograms and scatterplots to confirm some others.
  8. Calculate the correlation coefficient and the p-value

What did you all get?

Writing About Correlation (20-30 min)

Okay, now we have something. So how do we write about it? We should consider causality and thus our word choice.

Miller notes a few things to consider in chapter 3:

  1. What came first? If not sure, what is worth saying? If you can say one came before the other, are you sure there is a causal relationship? How do you know?
  2. Confounding variables: is it possible that a third (or more) variable explains the correlation? E.g., white hair and death are correlated because a third variable, age, explains both. Or, some variables are so hard to take away from the other that the any one of them might be hard to explain as the cause.
  3. Sampling or measurement issues. Was the sample representative to the population? Did you use the right measure? Did social norms influence how people respond to questions? Get to know your data and how they were created!!!!!!

Since the 1964 Surgeon General’s report on smoking and health and then A.B. Hill’s writing a year later (of 9 criteria for assessing causality), most statistical assessments of causality look to this rough summary by Miller of criteria to determine causality:

  1. Consistency of association. Several different studies, populations, and study designs have produced similar results.
  2. Strength of association. There is a large difference in cases where the predictor is present and not present.
  3. Temporal relationship. We can clearly tell that the cause precedes the effect.
  4. Mechanism. Based on what we know about this topic, we can confidently assert that this would make sense (i.e., other knowledge we possess does not contradict the findings we have).

Also: There are techniques to “control” for other variables to also help figure out how connected any two variables are, but we don’t have the time to get into much more detail.

Writing considerations:

  • Avoid causal language unless you are leaning on other experts who can help you with the causality criteria. However, use what you know to interpret any expert evaluation to make sure that if they are making an evaluation of causality that they have a good reason for doing so.
  • If you can’t establish causality (which is very typical!), try to restrict your evaluation to the sample you have and what the effects are, while qualifying possible issues. Pages 43-44 in Miller help show the range of this approach. Watch out for verbs or nouns that explicitly state causality (e.g., this caused, the reason this happened) or that might imply causality (e.g., affected, increased).
  • For less technical pieces, it is absolutely fine to speculate, but even here, stick to avoiding evaluating anything as causal.
  • For more technical pieces, speculation is also fine as long as you are qualifying by bringing up limitations with your analysis and/or data, a need for more research (and what kind), and referring to other research that has explored the topic.
  • Use a combination of caution and your knowledge about the world. No one really ever knows “for sure” what causes what, they have to sort of estimate–even with the most certain of things. Sometimes people are too cautious, as was the case with those who criticized research about the impact of smoking on lung cancer. Statistics are arguments, they rarely can concretely prove anything–thus, there are stronger arguments, weaker arguments, and falsehoods rather than a pure binary truth/falsehood determination.

Activity: Okay, so try to write about the correlation you found (or did not find) from the bear data. How would you write about it in a way that was ethical, interesting, persuasive, etc. for both a lay audience (e.g., a popular science news article about bears, an informational pamphlet at a zoo that houses bears) and for a more technical audience (e.g., scientists who study bears)? Think back to everything we have talked about in regard to telling a story or making an argument with data as you work on this as a group. Come up with a 2 sentence description for both audiences.

Proposal for Next Project (20 min)

Let’s spend some time in class working on your proposal. Go to last Thursday’s lesson plan under “Next Project” to think about questions related to the proposal. Next class we will talk more about genres of writing for this assignment, but for now, think more about something you’d genuinely want to get practice writing.

  • If you wanted to do an academic piece for a real academic journal, check out this great resource on possible undergraduate research journals you could eventually submit to. Look through some journals and see where your piece might be a fit. NOTE: Be sure to note the “tabs” for difference academic disciplines.
  • If you want to do a technical report, start researching organizations that would produce a report on your subject matter. What non-profits, government agencies, or private companies would be interested in publishing such a document for a specialized audience? Where can you find examples of technical reports by those entities? How can you align your piece with their expectations?
  • If you want to do a grant proposal, check out some example grant proposals I have up on CourseWeb.

Next Time (2-5 min)

-Read chapter 11 in Miller

-Submit proposal for next project

-Will talk more about genres of writing for next project