Cleaning Data / Doing Analysis

Posted on February 27, 2021February 27, 2021 by Daniel Libertz Posted in Learning Module Pages

Randy Au argues that cleaning data is doing analysis. It is something that might come early in the analytical process, but that does not mean it is somehow different or less than.

Au writes that “The act of cleaning data is the act of preferntially transforming data so that your chosen analysis algorithm produces interpretable results. That is also the act of data analysis.”

[don’t be scared off by the word algorithm, it really boils down to a process of calculations…using a program to find a mean is essentially an algorithm]

To clean data is to find methods for amplifying the signal and diminishing the noise. This is incredibly important!

Cleaning and other things we might typically call analysis are about value judgments. Everything works toward finding meaning and making judgments about those meanings.

The misspellings of Philadelphia are a great example to show what Au means:

Going back to the Philadelphia example, it may seem obvious to you that all those misspellings should be grouped together and cleaned to be the same, correct value. As a frequent analyst of crappy data from open text boxes on the internet, that’s my automatic reaction upon seeing it. This is because the work I do generally wants all data from a given location to be grouped together.

But what if you’re doing a linguistic study? Those typos are actually extremely valuable to you. Normalizing that variation away would make any analysis on that front impossible. If we didn’t store the original data (which is recommended best practice that only some people follow), it would forever be lost.

Choosing to transform the data in any one way necessarily implies you’ve already chosen to do certain things with it, to the exclusion of others. It doesn’t matter if the changes are big (imputing missing values, removing entries) or small (removing trailing whitespace, normalizing date formats).

Doing Data Cleaning “Better”

Au points to 5 elements of doing data cleaning “better” (see the final paragraphs of the essay):

Don’t make permanent changes
Leave a paper trail all of all decisions you make
Fix things that make the analysis not work
Reduce unwanted variance, but keep the variance you want
Make sure cleaning decisions don’t introduce bias

Task

Cleaning data is important! And it is not just “something to do” that is annoying! Please take it very seriously and do your best to make sure what you have is in good shape for analysis. I can’t expect you to do expert work here, but I do want you to make the effort.

In a comment below, do one of the following:

Keeping Au’s 5 steps of data cleaning in mind, look through your data set. Why is your data set in good shape according to Au’s criteria? What are potential problems? Do you notice anything you should clean up? What and why?
If you are working with a database and you don’t have access to their data, answer the following: Keeping Au’s 5 steps in mind, learn as much as you can about how the data was collected, managed, and maintained. Based on what you know about this data, what possible issues might have come up while they cleaned this data so people can use it? What might seem potentially tricky that would make you nervous about using this data without being able to see it?

After commenting below, click the button to continue:

17 thoughts on “Cleaning Data / Doing Analysis”

February 28, 2021 at 3:30 pm
Elaine says:

I believe my data set is good in shape in terms of showing categories for different races, test and cases related to Covid-19. However, I think potential problems I have working with this data set is that the data has been already “cleaned” by previous organizations as the Covid Tracker get their data from hospitals, news, etc. It seems that there is more focus in certain areas than others. I wish I could have access to the raw data set because I could see the whole picture and any other details that might be useful for my examination. I feel limited working with this data only.
March 1, 2021 at 10:19 pm
LIAM SCHNEIDER says:

I believe my data set is in good shape as a whole. As a result of this data being based off of an application, it seems to be reported in a clearly categorized, uniform format. Though I wish there were other questions the application itself had asked so that I had more data to look at there appears to be nothing for me to clean here.
March 2, 2021 at 11:09 am
PRATAP THAPA says:

My dataset is categorized in segment things like race/nationality, age, sexual orientation, training, conjugal status, pay, medical coverage, and income. The data was collected from the random selection of about 70,000 individuals. However, it does not display the raw data which seal me to access for the total number of people participating based on their race/nationality and respective demographics. It makes me doubt about the overall dataset provided to me.
March 3, 2021 at 12:51 am
Queen says:

I think my dataset is in a good shape generally. As it’s showing each category represents for each type of information; the percentage of detainees who has infection by the disease, age, racism, sexual orientation. The only thing that I would likely to clean up from this dataset is “Name” category because since this prevails analysis as publicity, it is so personal and not necessary to show up patients’ names.
March 3, 2021 at 12:16 pm
Arti says:

The dataset I’m using is in good shape. It leaves a constant trail of what’s going on and how they’re reacting to the relief financial system they’ve implemented in these poor high risk nations. The website has a blog trail of what’s going on and how they’re changing things if need be. I think this is a great way of creating a reacting cycle and incorporating necessary changes if need be. The data also has so many categories of higher risk individual groups within the population so it takes it a step further in categorizing and isolating the problem.
March 3, 2021 at 7:34 pm
DALANDA BAH says:

I think my database is in good shape because it seems clearly to reduce unwanted variance. My database isn’t biased at all. My database provides open and suitable access to the most sizable available database on the historical evolution of the world distribution of income and wealth within different countries.
March 3, 2021 at 10:59 pm
Andrea Flores says:

The data set that I chose, I think it is in a good shape because the organization that has is easy to read, and easy to understand what the data set is trying to say. However, what I would change is the organization of the columns by organizing it from the highest search term frequency to the lowest search term in order to see quickly which one are the terms to consider more important than the others, and not going back and forth to compare and contrast with the other results.
March 4, 2021 at 2:29 am
Liz Fadel says:

When I looked at my original data set, it was categorized with all the five boroughs of NYC high school graduates. It was in good shape by having all the calculations of percentage dropout rates and the percentage of poverty, SWD, regents, and ELL. I notice that there were too many years to analyze. However, I only wanted to work on a few of the categories, so I had to manipulate the data. I clean out the data with regents, SWD, and I only work with ELL graduates, non-ELL, economically disadvantaged, non-econ disadvantaged, gender, from2015-2019 four-year cohorts. It was too much data for me to perform the analysis from 2000-2020.
March 4, 2021 at 11:43 am
Natalia Bielonko says:

I think my data set could be presented better. For a data set presented by ICE it’s very standard. It includes the locations of field offices, Confirmed cases currently under isolation or monitoring, Detainee deaths, and Total confirmed COVID-19 cases. It doesn’t provide info like when exactly these detainees died from COVID-19 or any possibilities of how it spread in the facilities.
March 4, 2021 at 5:36 pm
Gina DiGiacomo says:

My data is in pretty good shape. The data spans from year 1988-2018. The organization who created it regularly updates the information by requesting it from the City of Chicago. Some cases (my data is on police misconducts) have not been closed yet, so as updates come in, they put it in the dataset. Potential problems could be, the organization states that, although they try, they cannot guarantee that all the data is accurate and current. This is an issue that could be resolved if, while cleaning the data, they made sure the data that got published was accurate. Also, you don’t know if the organization omitted cases from the dataset that didn’t fit a certain narrative while cleaning it. You don’t want to think someone would do this, but that is possible. They also grouped every race aside from lack, hispanic, and white into one group called “other” and that’s tricky because it makes me wonder why they chose to do that.
March 4, 2021 at 6:52 pm
MINGYI YOU says:

Although I can’t say my data set is perfect as nothing is so, I believe it looks good enough for me to analyze. Percentile is used to classify; the year is used as a time frame; the percentage is used to show the value of a group. By following Au’s criteria, all variances are related to others and it does not contain bias. However, it is just tax data, and I have to look for other resources to help me fully analyze it.
March 4, 2021 at 8:59 pm
MAHIMA KHANEJA says:

Data cleaning enables an individual to become familiar with datasets and also helps in attaining exciting results. Automating cleaning operations undermines the comprehending of the data from analysis and interpretation of the outcomes. According to Au’s article, my data is in good shape since all the things that can make the analysis choke have been eliminated. The data lacks countless variations that can make the results unrealistic or misleading. Furthermore, one can understand the sequence of events by just looking at the data sequences. Therefore, in such circumstances, when analyzing data, one cannot get invalid input, incorrect array size, or other software errors when the analyze button is clicked. I noticed that some sections of the data need clean-up. For instance, some countries lack data on the gross domestic product for some years. The data is missing since these nations, especially the developing ones, gathering and recording such data were hard. The exercise of recording and monitoring gross domestic product data began a few decades ago. Cleaning the data will help in getting substantive results that are generalizable.
March 4, 2021 at 9:02 pm
Kimberly Barrios says:

My data set is pretty organized. I like that the one column that detects 0= no possibility of getting heart disease or 1= possibility of having heart disease is pretty straight forward and uncluttered. I think that regarding Au’s criteria, the data does not contain bias and there is an evident correlation between the factors in the rows that leads to the final point.
March 4, 2021 at 10:45 pm
Joseph Habert says:

I think my data sets are pretty organized according to the criteria here. The one thing I would change personally is the inconsistency of the colors used on the bar graphs. At first a purple bar is used to represent the “White” populace, but in a later graph they use a maroon bar. It is the same for most of the races. More consistency there would be better in my opinion.
March 5, 2021 at 1:11 am
SAMEER DHIMAN says:

I think my data sets are organized well. Besides an excel file, there is a table as well to visualize the data. The variables used for the data are good too. They use Celsius, which is accepted worldwide mostly for temperatures to show how it goes up each annually. It shows the years and even the months. I feel like the data has already been cleaned up and is simple enough for people to look at and not be nervous or terrified of what they’re observing.
May 10, 2021 at 11:38 pm
KEMBPELL PORCENAT says:

I think my dataset is in good shape, overall. Columns and rows are appropriately labeled. Rows are sorted alphabetically and in sequential order with respect to time. Columns are given meaningful labels. The author allows you to easily track prison and violent crime statistics from each jurisdiction over time, for example. There are no spelling errors. To better clean the dataset, I would adjust the formatting style and size of the text and numerical data to be the same.
May 17, 2021 at 5:36 pm
Lynden L Frank says:

my data does not live up to Au’s standards. They would be more explanation on who these people were who took out the student loans each year, a breakdown of who these people might be and where they come from. Even a racial breakdown or wage breakdown would better prepare the reader of this dataset. the potential problems are that the fork creates too many prongs where the data becomes much harder to slice and dice because it is insurmountable.