Numbers and People, Speaking

The opening example of chapter 6 in Data Feminism about the GDELT database and how it overstated its completeness and the accuracy of the data (i.e., “Big Dick Data”) is a pretty extreme and embarrassing example of something that frequently happens in data collection, management, and analysis: they overpromise.

One of the goals of the Data Set Biography and Influence Project was for you to start to think toward ways in which your data set might have potential issues that are worth mentioning.

D’Ignazio and Klein write that to refuse to acknowledge context is to assert a mastery in your data set without addressing the complexity of what that data represents. Ultimately, it is lazy and irresponsible analysis! It is crossing your fingers and hoping for the best when you get some results.

For GDELT, both the people who created that database and the reporter for FiveThirtyEight crossed their fingers and hoped their data could say something about the prevalence of important events happening around the world.

For the Clery data about sexual assault reporting on college campuses, people who created that system and those who reported that data hope for the best that it would show the prevalence of sexual assault.

  • GDELT, though, had a (very) partial perspective on the realities of media reporting, media framing, etc. which caused overrepresentation of certain events (like the Nigerian kidnapping of high school girls in April 2014).
  • Clery reporting arguably does not provide enough of an incentive for accurate reporting for some schools, causing a disparity among colleges who create good environments for reporting vs. those that do not.

There will never be a perfect data set because there is never a “view from nowhere.” There can be bad, good, and better views, but they will always be somewhere and not “nowhere”. So where is your data set coming from, and how will you communicate that in your writing?

 

Preliminary Analysis from 3/15, DB&I Project, and Communicating Context

Go back to the Monday’s (3/15) lesson under the heading “Preliminary Analysis.” Check out all of the questions there to consider important information about your data set or database.

Also, go back to your Data Set Biography and Influence project. What questions from the prompt did you try to answer and what was most important to know if someone were to analyze data from that data set or database?

Depending on how large a role your data set or database will play in your Data-Driven White Paper (e.g., maybe you only plan to talk about it in a couple of sentences vs. a whole section), how will you communicate important context?

 

Task

On 3/15, we talked about information to consider about data collection and important information about perspectives necessary for interpreting that data.

Let’s try out how you might communicate that. On Monday, March 22, we will talk about communicating about methods. Think of the task here as a first draft for writing about your methods of analysis.

Knowing what you know from your reading of chapter 6 of Data Feminism, the work you did for your Data Set Biography and Influence Project, and the work you did on Monday (3/15) about thinking about data collection and perspectives, try out the following:

In a comment below, in 2-4 sentences, introduce your data set or database along with important contextual information your readers should know in order to interpret any analysis you do about those results. 

You can’t include everything. But you can include the most important things. What are those most important things do you think and why? In 1-2 sentences, explain why you included what you did.

After commenting below, click on the button to continue:

Button with text that reads click here to continue

16 thoughts on “Numbers and People, Speaking

  1. Arti says:

    My data set pertains to the COVID-19 relief funding provided to high risk nations across the world primarily in Africa, South America and Asia. The data set shows the distribution of funding provided by the World Bank and IMF primarily. The most important aspect to understand when analyzing this data is that there is a skew in the allocation of resources across these nations to help nations that provide greater economic impact, rather than nations that impose greater risk in health due to extreme poverty rates. I think this is important to focus on because when providing a program that aids nations deemed high risk, shouldn’t health be above economy? What is the priority or agenda here?

  2. LIAM SCHNEIDER says:

    What my dataset is communicating is essentially the outcomes of the binary questions on the FAFSA application. It shows the results over a 10 quarter period and how they changed each quarter. It is important to know that there is not much analysis that can be performed on this data in a vacuum. It requires supporting data to make claims, as it is particularly bland. It is also important to know that the data has been created by students filing an application. It is possible that some of the data could be false as students may lie, or manipulate their answers in order to receive the aid they require.

    The most important things to recognize in this data set is that they are solely the result of students answering questions on their application, and that is does not reflect all student aid provided in the United State, only the government funded aid.

  3. Lynden L Frank says:

    My dataset is communicating how much money was borrowed from the US government and the different types of loans that can be used by students. My dataset divides the type of loans into three different categories and as well how many recipients have received loans. The most important context here that might be missing is the effects of the recession from the time of 2008-2010 respectfully.

  4. Andrea Flores says:

    My data set is communicating the frequency in which people were looking for information or help about psychological symptoms that they were developing during the quarantine period. The frequency was collected form google trends from 2019 to 2020, and from several countries to, in some ways, compare and contrast what other people from another country were experiencing. The most important context here, is to understand that everybody can go through a psychological stage in which we might feel trap, or overwhelmed by the situation.

  5. DALANDA BAH says:

    My database is about the world income inequality. The important contextual information my readers should know is that the database focuses on providing open and suitable access to the most sizable accessible database on the ancient growth of the world distribution of income, and wealth, both within countries across the world. I think this is important because this is a great way for readers/researchers to learn more income inequality and get the concept of this topic.

  6. Kimberly Barrios says:

    My data set focuses on heart disease risk using clinical data collected by MIT. The most important information is the specific factors that are trending such as their medical data which would put them at risk. I wrote about gender and stress playing a role but have yet to find additional data that really supports this claim. I can see a correlation between women over the age of 50 being more at risk. I also have a speculation that stress generated by low income can contribute to poor eating habits and lack of exercise which isn’t shown in this particular data.

  7. Queen says:

    My dataset is about COVID-19 in ICE facilities, places hold immigrants or protect detained people from the COVID-19 pandemic particularly. The data shows pretty much about personal info, divides into several categories such as gender, age, race/ethnicity, report of cases and death. The important thing to focus on this data is to analyze which race/ethnicity has been affected the most by the disease.

  8. Elaine says:

    My data set is about Asian Hate and Counter-hate in Social Media during the COVID-19 Crisis. I want to drive the connection between social media and its connection to actual Asian Hate crime across the nation, whether people are more motivated to The data shows classifications for hate, counter-hate, neutral geolocations and annotations csv. I think these are helpful categories to consider specially the geolocation of the Asian Hate tweets can tell whether there is higher crime in those areas compared to others. The creators of the dataset are of asian background and the dataset is from Georgia Tech.

  9. MAHIMA KHANEJA says:

    My data set focuses on global income inequality by exploring the causes of such income differences and their consequences. The project picks inequality data set from the World Inequality Database because it is interactive, allowing comparisons between the top 10% and bottom 40% earners’ income levels. This data analysis will assist me in eliminating less important issues and understand issues where decisions can be fabricated and will also help me see the trends in income inequality over the years and condense voluminous information in little space.

  10. Gina DiGiacomo says:

    My dataset is showing all of the cases of police misconduct in Chicago from 1988 to present. The organization who created the dataset pulls reports from the Chicago police department and makes them public. The dataset has different categories such as; race, gender, and age of the officers and civilians. It also specifies locations where each incident took place. I would say the most important parts of the dataset would be race, gender, and location. I think this because overwhelmingly, the data show that police misconduct is most likely to happen if the officer is a white man and the civilian is a black male. I also think this because the areas that see the most policing are lower income areas where the majority of the people living there are black.

  11. Leonida H. says:

    My data set focuses on the income inequality gender gap depicting the major differences in income/pay globally between men and women. The disadvantages women are facing in the workforce being underrepresented and lack of opportunities in many industries. The most important part of my data set analysis is the emphasis on men and women in the same field of work being underpaid against men while still upholding the same qualifications and experience.

  12. Liz Fadel says:

    My data set relates to New York City Public High School 4 year Graduation rates.
    Remaining aware of the numerous factors that influence the graduation rate of NYC Public High School Students, a study has been undertaken to respond to the question, to what extent does poverty influence the graduation rate of NYC Public High School students, as opposed to non-economically disadvantaged students, gender, and learning disabilities? I think it is important because the study was released by Measure of America, showing that students who live in New York neighborhoods with a higher poverty rate report a significant decline in the graduation rate compared to the students who live in wealthier neighborhoods of New York. This implies that poverty could be one factor that could decline the high school graduation rate. Other than the poverty level, other factors possibly influencing graduation rates include gender and language barriers.

  13. SAMEER DHIMAN says:

    My data set is about how the temperature of the Earth is rising. It shows how the temperature is going up annually, thus causing global warming. It measures the temperature in Celsius since it’s widely used around the world and there is a chart in the data set that visually shows how it’s rising. The important part of this data set is to show that temperatures rising globally is having a negative impact on our Earth such as sea levels rising and glaciers melting along with other such negative impacts.

  14. KEMBPELL PORCENAT says:

    My dataset is about mass incarceration and violent crime. It documents the historical data on the number of various violent offenses and the number of people held in prisons. It records both figures throughout all state and federal jurisdictions in the United States. It shows the trend in violent crime and prison custody over time. In certain municipalities, such as New York, it shows both variables on a downward trajectory.

  15. MINGYI YOU says:

    My dataset is about police misconduct in NYC. The CCRB has 12056 total complaints on file and approximately 4000 officers involved in police misconduct. The database regarding police misconduct from ProPublica indicates that there are approximately 4000 in-service officers involved in police misconduct, and they received about 12000 complaints in total. On average, each suspectable officer would receive 2 to 3 complaints regarding police misconduct. By looking at the number, the audience would be able to perceive the chance for police misconduct to happen, and the high chance that an officer would choose to do certain police misconduct again.

  16. Joseph Habert says:

    My dataset is about the use of restraint and seclusion on disabled students in US elementary schools. The CRDC has collected information from about millions of students and believe that about tens of thousand of students are suffering from restraint and seclusion in school, primarily disabled students. The most important thing to include is the statistics in the data that prove what I am trying to show.

Comments are closed.