8-29-2019 WWD

Data, like Soylent Green, is People (10-15 min)

I will count off to three and in groups of 3-4, we will consider the below questions together. Use the reading in the discussion to support your claims and to consider claims from others in your group.

  1. Why provide context about a given dataset?
  2. What is meant by context, do you think?
  3. What are some ways to consider context as you explore a dataset and write about it?
  4. Where do you look to find this context?
  5. How do you account for what is missing? Are there situations where this might be the case? What could you do as a writer?
  6. Is there a value in taking a pause to say to periodically ask yourself “what am I saying? What can I reasonably interpret from what is in front of me?” Why or why not?
  7. What is your role as a writer with data? Should you say what you think or leave it to your audience? Can you give an example? Are their exceptions?
  8. What bad things happen when people do not consider the context of the data they have? Brainstorm a bunch.

Looking for Data (10 min)

  1. Check out some of the dataset repositories I offered, but also look for a bunch of stuff with your search engine skills.
  2. Convert to CSV file if not already available as CSV file (CSV files are one of the more common types. It may be the case that you come across a different file type–check its extension and do a search for “convert FILE to CSV” to see if there is a way to convert it–if not, it may be the case that the data may not be well suited for the CSV format).
  3. Save the CSV file. If you change anything in your CSV file (e.g., maybe you are only interested in one year and you delete all other years in original file), make sure you save it as a different version. Take notes to yourself about what each version is.
  4. It could be that there is a useful interface for the data you want to look at and you never have to touch a spreadsheet. That is cool with me if it makes sense for your project.
  5. CONTEXTUALIZE. Don’t jump in too fast. How were the data collected? How were different variables defined?

Let’s do a quick example:

  1. Go to Awesome Data. Go under “SocialNetworks,” and then click “Twitter Data for Sentiment Analysis.”
  2. Click “Stanford link” under “Where is the training data?”
  3. Open the zip file.
  4. Click the csv file beginning with ‘training’
  5. What is this? CONTEXTUALIZE. What do they do a good job of? What questions do you still have?

Exploring Data (30-45 min)

One disclaimer for this term is that I’m trying to balance the line between making things simple while also giving you options to do a variety of things. It is a tricky balance, so we will probably have hiccups along the way. If you run into difficulties, please stay in touch and DEFINITELY work with your classmates as they might have mastered something you haven’t–and vice versa. Further, we are a collection of people from a lot of different backgrounds–we should share expertise as relevant as we go through the semester.

To make things simple, rather than looking at our example “found” data above, let’s look at some data we helped make the other day. We are going to clean it, try to filter certain aspects, and then analyze it a little together. We will see how far we get today.

Cleaning Data

Here is a good resource for data cleaning (it is a little technical, but not overly so–most of it is applicable to things we can do in this class)

The summary mostly comes down to this:

  • Missing data. Missing data, different choices. You might just leave it missing if there is not that many values missing overall in the row. You might enter in the mean or median. You might enter a different value that resembles a value of similar rows (e.g., hitters with about a .380 on-base percentage that hit about .250 tend to have around 150 strikeouts per year). No matter what you do, you should have a reason for doing it. You could also drop the missing data if there isn’t a lot there.
  • Consistency/Standardization. This can be an issue with categorical data in particular. If you have some values that are written as “Tennessee” and other values written as “tennessee”, then that could mess with your analysis. Misspellings are also a possibility (e.g., Tenessee and Tennessee). Though data-type inconsistency is a more technical issue at times, with what we’re doing, it is less likely. Still, you might have a string (e.g., “six”) when the rest of the column has an integer (e.g., “6”). Could be different words that mean the same thing, too (e.g., “information technology” and “IT”). Extra spaces can be tricky–I have something written in the Jupyter Notebook to help find these–but they might also screw things up (e.g., “Tennessee” and “Tennesse “).
  • Accuracy. This is hard to do, but something could set off red flags or something might just be imprecise and you could do more digging. For instance, you might notice someone gave a county for a location instead of a town for where an event occurred. There might be a way where you could find the town where that event took place. Other things are more straightforward: the ‘total’ column does not match the values that make up the total, a child in the U.S. can’t be married.
  • Uniformity. One big one here is that there are different units of measurement. For instance, perhaps some values are in kilometers and another is in miles.
  • Irrelevant Data (for us, this is filtering–see below). If you are only interested in data about health, you don’t need the column on phone numbers. If you are only interested in certain countries, you don’t need to look at all countries.
  • Duplicates. Sometimes things are entered more than once. This could be for a variety of reasons. One common one is the same name or ID number because the person was polled more than once at different times or previous data sources were combined. It can be hard to be sure, sometimes (e.g., if you have more than one “Mike Jones”). If all the values in the row, however, are the same, you can assume duplicity.
  • Outliers. Outliers are technically defined anything outside of the “interquartile range (IQR)” multiplied by 1.5. This just means the difference between the value at the 75th percentile and the value at the 25th percentile. You can kind of use the “eye test” here, though, if you look at a visualization. This can be difficult to deal with. You generally don’t want to remove outliers because it is difficult to argue that they are not representative, inaccurate, or whatever. However, if there is such a dramatic difference or if you have a good case for why the datapoint(s) just don’t fit in with the rest of the data, you can consider dropping them.
  • Converting things. You might want to convert categorical variables into numbers for easier calculation and for visualizing purposes (e.g., “low income = 1, middle income = 2, high income = 3”).

Activity: Okay, let’s try to clean some data! In CourseWeb, there is a CSV file of the “Background and Interests” survey results from Tuesday (NOTE: I randomly reordered the GPAs to ensure no one can match a GPA to other information). Download that CSV file and, with a partner, try to clean the data based on the above criteria. Keep a record of your decisions with a justification for why you are making them.

Some hints:

There are other things to consider: categorical data worth converting into a number? Accurate data? Appropriate data type for the analysis you’ll want to do for each column? Etc.

Don’t worry about outliers for now. Such a small sample that this would be meaningless anyway.

When finished, here is the answer key to some things you could have done to clean the data. Don’t peak until you are finished! And, by all means, there are things you could have done that I don’t have here.

NOTE: Excel doesn’t always care about capitalization, but you should still make those changes! Further, filtering in Excel makes it easy to keep similar categories together. It is a good habit to have your data standardized.

OTHER NOTE: If you didn’t already, rename columns: “GPA” for column about GPA, “Major” for column about majors.

Now that you have done this manually, here are a few things that could help when trying to use Excel to clean your CSV files.

Filtering

There are efficient ways to do a lot of the stuff above and for the filtering stuff I’m going to talk about now. I tried (and failed!) to make a user-friendly version that works in Jupyter Notebook. This is likely due to my poor programming skills rather than the impossibility of the task (any programmers in class—let me know if you have some thoughts on this!).

To create a filter, click on the entire column by clicking on the letter that represents the column (e.g., A, B, C) which will then highlight the whole column. Then, on the home tab, click “Sort & Filter” and select “Filter.”

There should be a down arrow that appears on that column now. Click that and you’ll have checkboxes appear that show all of the values in that column. You can select and deselect in order to filter the values that you want to remain.

Once you have the data filtered for the stuff you want to look at, copy/paste those cells into a new CSV file and save it under a name that will let you know what is filtered here (e.g., STEM majors).

Analysis and Visualization

There’s a lot of ways to analyze data in Excel–see the resource here on different functions in Excel for statistical analysis. Some important ones to keep in mind, especially for visualization purposes:

  • To add up numerical data in a column: =SUM(cell1:cell_end_of_range)
  • To take the mean: =AVERAGE(cell1:cell_end_of_range)
  • To take the median: =MEDIAN(cell1:cell_end_of_range)
  • To count specific items in a column: =COUNTIF (cell1:cell_end_of_range, number or “text”)
  • Same as COUNTIF but for multiple criteria: =COUNTIFS(cell1:cell_end_of_range, “criterion 1,” cell1:cell_end_of_range, “criterion 2)

The COUNTIF and COUNTIFS functions are especially helpful for turning categorical data into discrete data (i.e., data made up of words into counts of those words). This will be important if you wanted to make a bar graph, for instance.

A lot of this is easier with other proprietary software like Stata or SPSS. That stuff costs money, though, eventually (free for now at Pitt, at least Stata is). Programming languages like Python and R are free, so that’s part of the reason why I wanted to do some stuff with Python through a Jupyter Notebook (which lets you work through a browser in a more pleasing, iterative interface compared to the command line).

Below are some steps for working with this program.

Steps for Analyzing via Jupyter Notebook

  1. Go to CourseWeb>Course Documents and download Jupyter Notebook 1 to your computer (or, any future versions we use).
  2. Find “Jupyter Notebook” app on your computer and open it up.
  3. Click “Upload” in top right of Jupyter Notebook screen in your browse. Upload Jupyter Notebook 1 from your computer to your folder in the browser there.
  4. IMPORTANT: for the class example, delete the last column in the csv file for the background interests/info survey. For some reason, it does not like it and I can’t figure out why. Everything will work fine without that last column.
  5. Click “Upload” again and upload the csv file you want to analyze.
  6. Follow the instructions in each cell for the sorts of operations you want to perform. You’ll always want to import the libraries and you’ll always want to load in the csv file into the program. Other than that, things can exist fairly independently.
  7. When instructions say to fill in information, this will be for things like the column title as it is written in the csv file (computers aren’t smart, so you have to tell them exactly what you mean), adding labels to the x or y axis in a visualization, changing elements of visualization, etc.
  8. When you are ready to execute the code, make sure the cursor is blinking inside the cell that has the code. Click the “run” button and an output will be generated. I also have the “save” button highlighted below–even though there are auto-saves, you might want to save just in case it crashes.
  9. When you “run” the cell, do not run the next cell until the current one is still running. You know the cell is still running when you see [*] to the left of the cell. If you see output below the cell and/or the brackets to the left of the cell filled with a number (e.g., [3]), then you can move on and run the next cell.
  10. You can clear all outputs by going to “Cell” and then “All Outputs” and then “Clear.” Might be easier to work with when you are using the program multiple times.
  11. Finally, when you are finished working with the program, go to “File” and then “Close and Halt.” This helps things close out a bit more cleanly.

Visualization with Python, Excel, and Word

Seeing visuals of your data early on can be really helpful to give you a broad idea of the shape of your data–where things cluster, how varied your data are.

Some of the more basic visualizations are just easier to manipulate in Word and Excel, but there was some stuff in the Python above that was a bit different. You can do bar graphs, pie charts, and line graphs easier in Excel–see this resource. You can also do scatter plots and histograms, but I think it is a little easier in the Python program we were working from.

In Word, you are limited to making tables–but tables can be very effective visuals.

Other programs that just generally do stuff with images can also be useful: PowerPoint, InDesign, Photoshop, etc. Something to mess around with, especially as you get toward the end of the semester.

In-class Data Exploration and Journal 2 (10-15 min)

With a partner, talk about some interests you have that you explored in Journal 1. Any feedback for where your partner might look to develop a project? After a brief chat, start exploring.

-Some things you can do:

  • Review your Journal 1 response and explore dataset respositories and the internet more broadly to see what might be related to your interests.
  • Start opening up some datasets and just look inside.
  • See if there might be any technical issues (e.g., weird file format) and whether you might be able to use it or not
  • What metadata are available to describe the data? Thinking back to D’Ignazio and Klein, is there anything potentially amiss here? Do the way the data are collected/defined make it useful for some purposes and less useful for others?
  • Try to clean or filter the data in the CSV file as a new CSV file and analyze it from there. Conversely, you can try my code for filtering and cleaning, but I’m worried it is kind of messed up so we might have to do it the old fashioned way for now.
  • Get the descriptive statistics for various columns and filtered versions of columns.
  • Create a few visualizations for various columns and filtered versions of columns.
  • Take some notes on what you find interesting about these data. Why do you find it interesting?
  • Start to sketch out any possible directions you could go in writing about these data.

Upcoming projects (5 min)

-Learning Narrative 1

-Journal 2

-Public Writing Proposal