Cleaning Data (20-30 minutes)

So what do you make of Au’s claim about cleaning as analysis? Terence thinks it makes sense but there are important sub-distinctions. Aftar says they are different but maybe it is a little murky sometimes to see differences. What do you all think?

Data cleaning is essentially preparing your data before you analyze it…..or maybe something more complicated than that. To make sure your analysis gives you results that are meaningful, it is important to make sure your data are a number of things before you can analyze it:

  • complete (i.e., is stuff missing that should be there?)
  • accurate (i.e., is stuff just flat out wrong in some places or calculated incorrectly?)
  • relevant (i.e., are some rows, columns, or cells irrelevant to the purpose of the data collection or analysis?)
  • not corrupt (i.e., did something get weird through the process of collection?)
  • consistent (i.e., are standards kept in data entry that ensure that data can be aggregated correctly?)
  • formatted correctly (i.e., are some entries filled out in ways that won’t be able to be processed in analysis?)
  • free of duplicates (i.e., sometimes things are entered more than once, which would create issues in analysis)

If any of the above issues are present, then that means you have decisions to make. These decisions that you make is data cleaning. As Au talks about, these involve value judgments which is why some of the thinking we did during the Data Set Critical Biography assignments were helpful to think about.

 

Cleaning Activity (30-45 minutes)

Some of you are working with databases and you cannot access the data (or maybe you can, but it is too complex or in a file format you can’t navigate) so you won’t really be able to worry about cleaning data. This should be fine, though it is always a good thing to be able to look at how things are formed into a data set to look for things like:

Click here to learn more about issues in data sets that you would have to clean up.

person cleaning glass door with yellow rubber glove on and squeegee

The above link has tips on cleaning data for:

  • missing data
  • inconsistent data
  • inaccurate data
  • data that is not uniform
  • duplicated data
  • outliers
  • data that could benefit from conversion

 

ACTIVITY

Click here to download an Excel file of data we are going to try to clean. Does it look familiar?

Take about 10 minutes to write down things that look off, decided what makes it look off (e.g., duplicate? not uniform?), and then talk about the decision you would make to clean it.

There are about 10 things you could do, but there’s a lot you could do to ready this for anlaysis. Use this link (same as above link) to help you sort out reasons why you would clean data.

After spending some time cleaning, we will go over it as a class.

Question

Any of the cleaning we did: were there value judgments baked in that were “analytical” to a degree? What do you think?

 

Preliminary Analysis (10-15 minutes)

You gotta look at stuff in different ways! Today we will talk about filtering and sorting. Next week we will talk about measures of central tendencies, different ways to count things, and distributions/variability.

 

Try out filtering and sorting

Au mentioned “exploratory data analysis.” Once your data is cleaned (and, this could be a recursive process as you learn more about your data–that is, you go back and clean again if seeing some odd things in later analysis), you gotta mess around with it to see some things that you can find. You want to see if you can get a sense of the sort of hypothesis you might have by taking a look at a variety of things in your data.

We won’t get into all of the details with this process. If you want to learn more about EDA, go here. For our purposes, we are just using this term for early investigations of our data to work toward finding the path that works for the writing we want to do this semester.

Using Excel In Early Stages

There are more automated ways to do the kind of filtering, sorting, and other sorts of things to help with some exploratory and preliminary analysis. But, for simplicity’s sake and to keep us all in the same program (unless you opt for something different), we are going to stick with Excel.

As a reminder, if you don’t have Excel on your device, you can download it here.

If you have issues with your device, let me know. There are ways to get a loaner laptop, a Chromebook, etc. [though, if you are working with a Chromebook, you will likely have to work with Google Sheets, which has very similar, but not the same, ways of doing things as Excel]

 

Filtering/Sorting/Counting

To create a filter in Excel, click on the entire column by clicking on the letter that represents the column (e.g., A, B, C) which will then highlight the whole column.

Then, on the home tab, click “Sort & Filter” and select “Filter.”

Screenshot of filter button in Excel

Once the filter is created for that column, you can click the arrow at the top of the column to see what you can filter out.

Excel screenshot of arrow to click to select things to filter in a column

A window will open up and you will see a series of check boxes that you can select or de-select. For instance, in the case of the filter I made for NYC neighborhoods in the Airbnb data set we worked with in class from February 22, I can filter out only specific neighborhoods I want to see. Like, for instance, Harlem (or Harlem and East Harlem…or whatever).

Excel screenshot of using filter on data set for only neighborhoods in column for Harlem and East Harlem

You don’t have to do this, but I want to offer another step in filtering to help keep things organized. Especially if it is something I want to keep working with, I like to select all, copy, and then paste it into a new tab in Excel (IMPORTANT: make sure you save your csv file as an Excel file, otherwise the tabs won’t save). The shortcut command on a PC to select all is holding down “ctrl” and “a”. On a Mac, it is “cmd” and “a” at the same time.

Once everything is selected, copy it and then paste it in a new tab. The tabs are at the bottom of the Excel document.

screenshot of tabs at bottom of Excel file

If you click on the “+” sign you will add a new tab. I like to label it with what I filtered.

If you do paste things into a new tab, you will want to save the file as an Excel file and NOT a csv file. Csv files are not really designed to have the tab feature, so it might not save in the way you want it to.

 

Setting Standards

To help consider if a value is meaningful in terms of the story you are telling or the argument you are making, it is useful to set a standard to compare values against. In a vacuum, this would be the measure of central tendency of your sample (more on that soon!). However, many of you are not working in a vacuum, so you might need more measures.

For instance, you might compare the weather on September 17, 2019 against the mean temperature of September 17 across several decades. Some measures are adjusted by age or they use cutoffs for years (e.g., everything since 1982) for contextual reasons. Thinking about what standards or cutoffs to use depends on your getting to know your data, the context that surrounds it, and the goals of your project.

If you use standards like this, you may have to use some filtering and sorting to help create a new data set out of something larger.

 

Next Time (2-5 minutes)

-Chapter 5 of Data Feminism

-Response Post if your turn, Comment if not.

-Keep thinking about your topic for your Data-Driven Argument assignment. Proposal due March 15.