What is a Data Set? A Database? (5-10 min)

Here is a good and quick read on what a data set is. Essentially, a data set is a number of observations (rows) that has a number of different attributes or variables (columns) stored in a file of some kind. In a lot of ways, they are just tables. For us, that will almost always mean a spreadsheet. As I’ll mention later, I will try to keep us working with csv files that you can transform into Microsoft Excel files to work with.

For our purposes, a database is a little different. There are technical differences and disciplinary differences between these terms (data set and database) but for us the most useful distinction will be that a database will have a GUI of some kind (graphical user interface, usually a website where you can enter input and receive an output) that calls from a data set or multiple data sets.

Either is good to work with this term! But the idea here, for now, is that there is never just “data” to worry about.

To really be data, it has to be collected and the framing of collection is always going to be a data set or database of some kind. Even notes in a notebook will be a data set as there will be the name of the object to signal an observation and attributes described (or “variables”) that are the notes about the object.

Data is organized and transformed information as we played around with earlier in the term (though I know some liked it the other way around, too, but either way the model works the same: something is “there” and then it is organized in some fashion).

 

Asking Questions of Data (20-30 min)

  1. Who is doing the work (and who is not)?
    • Example: D’Ignazio and Klein cite the Amazon algorithm that was used to flag resumes for interviews, but the model was trained on data of previous applicants that heavily skewed male.
  2. Who benefits (and who is overlooked or actively harmed)?
    • Example: Missing data, like the data on femicide in Mexico cited by D’Ignazio and Klein. Data on murders was not sufficient in terms of the information of the victims that were prioritized, and this opened up opportunities for sexist propaganda to fill the gap, benefiting the status quo of Mexico’s law enforcement.
  3. What goals are prioritized (and what are not)?
    • Example: the Allegheny County Office of Children, Youth, and Families that prioritized the goal of efficiency of the bureaucracy because it oversampled poor families that were more likely to use public services. Because this agency did not have enough resources to best help families, they chose the goal of efficiency instead. The goal was reached, but in doing so, poor families were unfairly harmed and targeted.
  4. How does the matrix of domination help with these questions?

Let’s look at the Long-Term Productivity Database. Learn more about this data set from the about page but also click this spreadsheet to look into the data itself. (some of the acronyms used here are explained on the about page linked).

This is a good example because there is both the database and they give you the spreadsheet which is a data set. Let’s try to apply these questions above to pick up where we left off last week.

 

Data Set Critical Biography

Let’s shift gears toward Chapter 2 of Data Feminism by starting to examine the questions from the Data Set Critical Biography prompt.

Which of these questions remind you of the kinds of questions that you saw in the second chapter? Why?

Let’s try out some of these “chapter 2” questions now on the productivity database / data set.

 

 

Choosing a Research Question or Topic and then Finding Data Sets and Databases (20-30 min)

Let’s go back to our Class Glossary that is and will continue to be a work in progress.

I also want us to think about “social justice,” a term that gets thrown around a lot that has picked up some baggage, rightly or wrongly. In the syllabus I talk a bit about it and want to return to that here, too.

What topics do you think this class makes available for you to learn more about?

Let’s come up with a list of topics in our text channel for our lesson today in Discord (# feb-15-2022).

 

Finding Data

Let’s start with these resources here for publicly available data sets and databases. This page is also linked on our website under “Data Resources.”

Let’s go back to our list of topics and check out each one in groups.

This week, I will also look through what is available and send you some good ones that are decently easy to work with. I want to point out some technical considerations along those lines.

 

Technical Considerations

There are some technical considerations to think about long term. For instance, always consider the types of data you are working with (go to 2/1 lesson plan for more on that).

Because I can’t expect everyone to have any experience with statistics or with programming, I decided the simplest thing would be to work with software you all had access to and to require that you all work with one (common) file type that data are often managed and analyzed on: comma separated values (csv).

CSV files work really well in Excel, a program you all have access to. If you do not already have Excel downloaded to your device, go to the Baruch Computing and Technology Center’s (BCTC) page for free downloads for students. Once there, you can download the Microsoft Office package that includes Excel.

I recommend Excel instead of just using Google Sheets for two reasons:

  1. there are more formulas you can use in Excel compared to Sheets.
  2. You should be saving many, many different versions of your files, which is much more intuitive to do in Excel compared to operating Sheets from the browser.

 

 

Basic Analysis in Excel

In future lessons and learning modules, we will talk about cleaning data, which will be very important in doing analysis. However, I thought it might be useful to give you a few basic analysis features in Excel

There’s a lot of ways to analyze data in Excel–see the resource here on different functions in Excel for statistical analysis. Some important ones to keep in mind, especially for visualization purposes:

  • To add up numerical data in a column: =SUM(cell1:cell_end_of_range)
  • To take the mean: =AVERAGE(cell1:cell_end_of_range)
  • To take the median: =MEDIAN(cell1:cell_end_of_range)
  • To count specific items in a column: =COUNTIF (cell1:cell_end_of_range, number or “text”)
  • Same as COUNTIF but for multiple criteria: =COUNTIFS(cell1:cell_end_of_range, “criterion 1,” cell1:cell_end_of_range, “criterion 2)

The COUNTIF and COUNTIFS functions are especially helpful for turning categorical data into discrete data (i.e., data made up of words into counts of those words). This will be important if you wanted to make a bar graph, for instance.

A lot of this is easier with other proprietary software like Stata or SPSS. That stuff costs money, though, eventually (free for now at Pitt, at least Stata is). Programming languages like Python and R are free, so that’s part of the reason why I will show you how to use programs I wrote (and encourage you to learn this stuff at some point in you future if you don’t already know some of it).

Below are some steps for working with this program.

 

Basic Visualization in Excel

Seeing visuals of your data early on can be really helpful to give you a broad idea of the shape of your data–where things cluster, how varied your data are.

Some of the more basic visualizations are just easier to manipulate in Word and Excel compared to stuff in Python I will show you later. You can do bar graphs, pie charts, and line graphs easier in Excel–see this resource. You can also do scatter plots and histograms, but I think it is a little easier in the Python program we were will be working from that I’ll show you later.

In Word, you are limited to making tables–but tables can be very effective visuals.

Other programs that just generally do stuff with images can also be useful: PowerPoint, InDesign, Photoshop, etc. Something to mess around with, especially as you get toward the end of the semester. I’ll also show you how to use Tableau toward the end of the semester.

 

Going Forward

CSV files can be easily handled by the programs I’m going to share with you in the coming weeks to do more advanced analysis and visualization.

As you look for data, please only work with files that are already .csv files, can be easily turned into such files (you may need help from me on that–this usually won’t work), or you are operating from a database on a website and never work with any files at all.

One final caveat: if you have some experience working with data already, please do whatever you like. If you are comfortable working with JSON or like to work with statistical analysis programs you already know like SPSS or SAS or whatever, go for it. Same goes for the level of statistics you might use. If you know how to do a multiple regression, go for it. I just won’t expect anyone to do anything beyond manipulating csv files in Excel and in the programs I write or do any sort of statistical analysis beyond descriptive statistics.

 

Looking **Around** Not Just In

Sometimes finding stuff can be tricky, and will involve a little bit of clicking around.

Here is one of the sample data sets I had given you as an example from last week’s Learning Module. It was the first link in the Data is Plural newsletter options that I offered on student loans: Student Aid Data | Federal Student Aid

This is what it looks like when you click that link:

image of federal student loan data in US webpage that hosts data. a few links on front page

Sometimes you will just be led directly to a .csv file (or another file type) or an interactive database hosted on the site. However, as in this case, it is not as clear where the data is and what it is.

So, you gotta click around! I clicked around the first few links to see what I could learn and where the links would take me.

Clicking on “Default Rates” led me to another page with more information about data on default rates (i.e., people not paying their loans back on time). Each one led to different kinds of options. One was “Cohort Default Rates by School, lender, state, and institution type.” I clicked that and went to another page where I found data.

Toward the bottom was this:

webpage with links to different types of data on defaulting on student loans in U.S.

This gave a little information on what the data were and gave me some formats of data that I could use (I could use Excel files easily, .csv is a similar file type that you can manipulate in Excel, too).

Clicking on the first one was focused on default rates for 2017, 2016, and 2015. I opened the file, and got this:

image of spreadsheet

The first column was an ID number for each row of data. The second column was the name of the college. The third was the address. Then the city. Pretty basic, right? Well, there are lots of variables so we gotta figure out what they mean. Further to the right there are things like “Denom 1,” “DRate 1,” and “PRate 1.” What do those mean?

 

Finding Meaning of Variables

Most data sets you find will have **some** information that helps you piece together what each variable (i.e., what goes in each column) means. In this case, we had to do some more clicking around.

In the table at the bottom of the webpage where I downloaded the Excel file, there was a link called “instructions on using these files may be found here.” On that page, I was able to get a better sense of what each variable meant.

Still, it might mean I have to do some more research to learn about the relevance of those variables in a larger context (what is a default rate any way? is this about all loans or just ones issued by the government? how would I know?).

 

Databases on Websites

Some of you may not ever work with files in the way shown above. There might be websites that host a database that you can query. If that is the case, you have to spend some time learning how to perform searches to bring up different kinds of tabulations of data that the website is drawing from. It might seem tricky at first, but usually, if someone spent the time making an interface, spending some time with it and clicking around will give you enough to learn how to use it to find the information you need for your project.

 

Next Time and Further (2-5 min)

  • Continue meeting with you all this week. A few haven’t signed up, sign up soon (let me know if these times don’t work for you and we can find another time).
  • Since I am doing 20 conferences of 20 minutes in length, I always save myself some labor by cancelling a class (a common practice for teacher conferences). No class on Thursday, February 17.
  • Start looking for data!!
  • Read Chapter 4 of Data Feminism and do a Response Post or Comment on Discord for February 22.
  • Data Set Critical Biography Part I is due February 24. Part II is due March 1.