Emotion and Embodiment in Writing with Data (30 minutes)

Dora writes, “Evoking emotion does not mean the data is biased; evoking emotion is just part of the collected data” and “if you don’t exhibit any feelings towards the information you are looking at, you most likely will not be motivated to do anything about it.”

Jesus sees the value of the “data-ink ratio” but recognizes limitations.

So, when do we want to up the “ink” ratio and why? Does that lead to any issues ever? Or nah?

 

Think about this:

You are in a doctor’s office. You have to consider getting surgery. The doctor says the following:

There is a 2% chance your knee injury will worsen. Do you want surgery to reduce this chance?

Also consider if the doctor said this instead:

One in 50 people will have this knee condition worsen. Do you want surgery to see reduce this chance?

How do you feel hearing one version vs. the other? Is either deceptive? Is one version more “emotional” than the other? If so, what does that mean?

 

Rhetoric

There are lots of definitions of rhetoric. But, what it essentially comes down to is that rhetoric is what we are doing when we make intentional choices with making meaning (e.g., writing, speaking, creating visuals) to try to influence others.

Are we ever not doing that?

Depending on your answer to that question, you might have different thoughts on the relationship between rhetoric, data, and emotions.

 

God-Trick and Partial Perspective

Going back to last class, we can now think about a “feminist objectivity,” in which we acknowledge that are knowledge is always partial and limited, that we can be enriched by incorporating multiple perspectives, and that the potential harms and benefits have to be factored into the decisions we make with data–and *NONE* of this makes room for inaccuracy or falsehood. It just means nobody is going around thinking they are God.

We will keep talking and thinking about this in coming weeks.

 

Distributions and Variability (30 minutes)

Remember that you have to be careful about generalizing too much from data you use. Most of you have data from a sample, which means you have data not from all possible cases, but from some subset of those possible cases. But even if you have population data (i.e., a reasonable assumption that you have nearly every case possible), there are still qualifications on what you can say about the information you have about those cases.

Let’s drill down on population vs. sample.

For example, when you filled out the survey about rice on the first day of class, that was both a population and a sample.

It is a population in the sense that it represents all cases of our class.

It is a sample insofar as it could be analyzed to represent any one of the possible:

  • students in English classes at Baruch College
  • students in English classes
  • students at Baruch College
  • students at CUNY
  • students in NYC
  • students in NY
  • college students in the U.S
  • students in the U.S.
  • college students in the world
  • people who eat rice
  • people
  • etc.

If you wanted to use that sample, you would have to qualify your analysis quite a bit.

I can confidently say something about how that survey represented the views of our class about preparing rice. I am far less confident about how I could use that data to say much about college students’ thoughts about rice in general, since you all are not representative of all college students.

This is where finding out information about how your data set collection is important for how you phrase descriptions of your results of analysis and use your analysis as part of a broader series of evidence in your Data-Driven White Paper.

Even if your sample is not very representative, there will be a way you can use it as part of a story about something it is representative of and how it might connect to other pieces of analysis you find from data-driven analyses elsewhere.

To think more about this, here are some questions you should think about:

What do you know about the data collection in the data set you have? How does that inform how you interpret your analysis?

    1. What time periods was the data collected in? in one year? over a couple of weeks? with what regularity–monthly, weekly, etc.?
    2. Where was the data collected? in the U.S.? Another country? Where in that country? What kinds of places within that country?
    3. How was the data collected? who collected it? was it a population or a sample of a population? did they randomize the sample? if not randomized, how was the data found to be included in the sample? was it an experiment of some kind; if so what were the experimental conditions? was data generated from a survey, interview, administrative records, observations of subjects, physical examination, scientific measurements of matter, or other material?
    4. Who/what is in the data? A population or a sample of that population? If a survey or study where people opted into responding, what was the response rate? Did people drop out of the study? did you filter data out from the data set as a way to answer your research question–what did you filter, if so? Did you take out data, like outliers? Was there a lot of data missing; what did you do about that, if so? what variables do you focus your analysis on and why? What is the distribution shaped like for quantitative variables you analyzed? What measure of central tendency would make sense to use based on that distribution’s shape? How large is the sample, if it is a sample?

What contextual information is important to know in interpretation?

  1. What secondary research can help?
  2. What perspectives are needed to ethically represent what this data can mean?
  3. What methods of analysis were used (especially important if you are using more advanced methods than measures of central tendency or count and/or for secondary sources using more advanced methods)?

Take 5 minutes to start to answer these questions.

 

If we have time, we can talk about the below, too, but may have to push it to next week:

Distributions and Variability

It is really important to know about how values are distributed in your data set or database. If you are using a database, you should be able get some visual or measure of variance to help you make meaning of your measures of central tendency or the shape of your data in general.

 

Using Visuals To See Distributions

Visuals can be really helpful here. Making a bar graph for categorical data or a histogram for numeric data can help you get a sense of things.

Categorical Data and Bar Graphs

Calculating the mode will probably be just as informative as seeing it as a bar graph. Still, sometimes visuals can allow us to see patterns better than alphanumeric writing.

Click here to learn more about this and how to make a bar chart in Excel.

 

Numeric Data and Histograms

Seeing the distributions of numerical data can be much more important. It is foundational to the most simple and complex statistical analyses.

Click here to learn about the different types of distributions and how to create a histogram in Excel.

 

Variance, Standard Deviation, Interquartile Range

Variance is  just the sum of the difference between each value and the mean, squaring that, and then dividing it by the population size (if you have all values in existence) or by the sample size – 1 (to reflect that you do not have all values in existence).

Standard Deviation

To make using variance more manageable, Standard Deviation is conventionally used which is just the square root of the variance.

It can be helpful to use standard deviation as a way to think about any one value compared against other values. For example, if there are a lot of values clustered around the mean creating a standard deviation of 1.07, a difference of two points between two values will be really far a part in terms of spread (i.e., nearly two standard deviations away from one another).

By contrast, if the distribution is more spread out and, say, has a standard deviation of 2.18, a difference of two points between values would be within one standard deviation and would not be very far a part in terms of spread. (previous example adapted from Jane Miller, 2015, The Chicago Guide to Writing about Numbers, Second Edition, pp. 79-81).

Click here on how to calculate standard deviation, quartiles, and interquartile range.

 

 

Quartiles and Interquartile Range

Interquartile range is another way to describe the variability in your numeric data. A quartile is a value that is above a certain percentage of other values recorded. There are four quartiles:

  • Q1 (25%)
  • Q2 (50%–i.e., the median)
  • Q3 (75%)
  • Q4 (100%–i.e., the maximum).

You could include all of that information, along with the minimum to get a sense of the spread of the data.

You could also calculate what is called the interquartile range, which can be especially helpful if you have outliers. Since standard deviation relies on the mean, if there are outliers, then the standard deviation may not be that useful in explaining the spread (because the outlier will greatly increase or decrease the standard deviation).

To calculate the interquartile range, you simply subtract Q3 from Q1.

Click here on how to calculate standard deviation, quartiles, and interquartile range.

A convention in statistics is to use the interquartile range as a way to help see if you have any outliers. To do so, multiply the interquartile range by 1.5.

Take that number and subtract it from Q1–if any number is below that result, then it is considered an outlier.

Additionally, take the IQR*1.5 figure and add it to Q3–if any numbers is above that result, then it is also considered an outlier.

Click here on how to calculate standard deviation, quartiles, and interquartile range.

Using Median and Mode vs. Mean

You can use all three common measures of central tendency, though, for continuous data–data that is numeric and can technically go on for ever (think decimal points). Discrete data–numeric data that does not go on forever, like the number of cloudy days in a year–can also use all three, but the mean might not be as useful as median and mode.

If outliers or really wide spread in data, it might be better to use median rather than mean.

 

Proposal (2-5 minutes)

On Tuesday, March 15 the proposal is due by 11:59pm. Here is what I want you to do:

Tell me about your research question, the genre you will choose, the organization you are writing on behalf of, how you will analyze your data set, and any other secondary sources you have found so far. You should also include any questions you have. No word requirement, just include as much information as you’d like so I can help you.

We will talk more about the genres on March 15, but for now, you just have to think about the kind of writing you want to do: something more like non-fiction to entertain public audiences (long-form non-fiction essay) or something more professional and technical for decision-makers (white paper).

The readings for Tuesday, March 15 are examples of long-form non-fiction essays that are data-driven in various ways. I also have some example white papers on Blackboard in the assignment instructions folder for March 15.

 

Next Time (2-5 minutes)

-Read one of the three data journalism pieces.

-Do your proposal for your Data-Driven Argument assignment. The proposal is due to Blackboard by March 15 at 11:59pm.