Sometimes finding stuff can be tricky, and will involve a little bit of clicking around.
Here is one of the sample data sets I had given you as an example from last week’s Learning Module. It was the first link in the Data is Plural newsletter options that I offered on student loans: Student Aid Data | Federal Student Aid
This is what it looks like when you click that link:
Sometimes you will just be led directly to a .csv file (or another file type) or an interactive database hosted on the site. However, as in this case, it is not as clear where the data is and what it is.
So, you gotta click around! I clicked around the first few links to see what I could learn and where the links would take me.
Clicking on “Default Rates” led me to another page with more information about data on default rates (i.e., people not paying their loans back on time). Each one led to different kinds of options. One was “Cohort Default Rates by School, lender, state, and institution type.” I clicked that and went to another page where I found data.
Toward the bottom was this:
This gave a little information on what the data were and gave me some formats of data that I could use (I could use Excel files easily, .csv is a similar file type that you can manipulate in Excel, too).
Clicking on the first one was focused on default rates for 2017, 2016, and 2015. I opened the file, and got this:
The first column was an ID number for each row of data. The second column was the name of the college. The third was the address. Then the city. Pretty basic, right? Well, there are lots of variables so we gotta figure out what they mean. Further to the right there are things like “Denom 1,” “DRate 1,” and “PRate 1.” What do those mean?
Finding Meaning of Variables
Most data sets you find will have **some** information that helps you piece together what each variable (i.e., what goes in each column) means. In this case, we had to do some more clicking around.
In the table at the bottom of the webpage where I downloaded the Excel file, there was a link called “instructions on using these files may be found here.” On that page, I was able to get a better sense of what each variable meant.
Still, it might mean I have to do some more research to learn about the relevance of those variables in a larger context (what is a default rate any way? is this about all loans or just ones issued by the government? how would I know?).
Databases on Websites
Some of you may not ever work with files in the way shown above. There might be websites that host a database that you can query. If that is the case, you have to spend some time learning how to perform searches to bring up different kinds of tabulations of data that the website is drawing from. It might seem tricky at first, but usually, if someone spent the time making an interface, spending some time with it and clicking around will give you enough to learn how to use it to find the information you need for your project.
Task
In the data set you chose for your proposal, comment below on the following:
- tell me some of the variables (titles on each column) that seem most interesting to you and why. If working from a database, just talk about the different variables that you have an option to check or select when performing a query. (if you are just beyond confused, that is okay, you can tell me where you are stuck–see next bullet)
- if you have any, ask any questions you have that I can help with.
After commenting below, click the button below to continue:
In my dataset, I found it interesting and helpful to see columns for Economy(GDP) and Freedom. It seems that these two factors influence on the happiness score of the country which makes me wonder whether the score has a correlation to the country’s equality.
I think I also include this step listing variables in the previous page.
According to my dataset, Queens seems to have the lowest dropout rates compared to the other boroughs. I also noticed that there was a higher percentage of dropouts for males in contrast to females.
The columns that interest me most are the Percentile and the Percentage. First, the Percentage is easy for me to understand that the average wealth of China grows year by year. However, the Percentile seems a little tricky to me, because the data is written as, for example, p90p100. I think it could represent the range of the data is for the top 10% of people who hold the way more shared than people with lower wealth status.
In my dataset, the columns were very self explanatory, and not all that interesting. They describe what country is being examined, the interval at which rates were being described, and the date.
Not all that interesting, but fortunately they are quiet easy to understand.
In my dataset, I find location to be the most interesting as well as criminal charges or
convictions. It makes it easier to understand which locations have what numbers.
My dataset has two columns that interest me the most. The “Complainants” and the “Accused” columns. The” Complainant” column describes the person who is accusing an officer of misconduct by their age, race, and gender. The “Accused” column does the same, but is describing the officer who is being accused by the same criteria. These columns are interesting because they provide more insight into factors that could contribute to certain cases of misconduct occurring against certain individuals.
The column that seems interesting to me is “From Inequality To Wealth Inequality” because it talks about what needs to be looked at to see income inequality, it also shows a method to measure the inequality of income and wealth from a worldwide point of view.
The columns of my data set is about depression, anxiety, obsessive compulsive disorder, insomnia, panic attack, counseling, and psychiatric, and the one that catches my attention is depression and anxiety which has the highest amount of searching those words on internet during quarantine.
In my dataset, what I found interesting was the trends in income and wealth levels in low, middle, and high income economies. Also, the dataset that I chose is interesting because it shows income derived from capital investments.
In my data set I found how it differentiates from mechanical restraints to all else to be interesting as it gives a good example of how much influence it has.
Throughout the various data sets I have found, co2 emissions, glacier mass balance, and sea level are variables that are all interesting to me. It shows me how these 3 different data sets show how climate change is affecting the world in a negative way.
In my data set, I thought it was interesting how certain columns were labelled. There were different categories of genders and ages. It was interesting how everything was labelled ‘targetHostCommunitiesMen’, etc. with women, boys, girls, refugees, returnees, and targetIDPs. After researching, targetIDPs basically set up your digital identities which was very interesting in understanding how profiles are created online. These various categories identified different sectors of the population that were to be given relief and resources during the COVID-19 pandemic.
In the rows of my dataset, It presents the ages of people like 18-25, 26-older and under that it includes the data on the specific ages of the people (18-20, 21-25), which I found was compelling.
In the rows of my data set what caught my attention most is seeing the age groups that are being affected most by mental health issues/illness while living below the poverty line.
As for the columns, the most interesting to me are sex and age because I realized there is a correlation with female and age 65 which seem to be the most susceptible to getting heart disease.
The three variables I’m most interested in is total incarceration count, state population, and total violent crime. One is able to distinguish which jurisdictions exhibit the highest and lowest figures. One can compare each variable in relation to the other. Or how these figures have fluctuated over time.
The variables in the data represent different search terms and tracks if those increased / decreased throughout the specified time period.