Things to Clean Up – ENG 4950: Data and Writing Toward Social Change, Spring 2021

See below for common issues that come up in data collection and data entry as well as approaches to cleaning. Learn more here (a bit technical at times, but not so much that you can’t get a good sense of practical approaches).

Missing data. Missing data, different choices. You might just leave it missing if there is not that many values missing overall in the row. You might enter in the mean or median. You might enter a different value that resembles a value of similar rows (e.g., hitters with about a .380 on-base percentage that hit about .250 tend to have around 150 strikeouts per year). No matter what you do, you should have a reason for doing it. You could also drop the missing data if there isn’t a lot there.
Consistency/Standardization. This can be an issue with categorical data in particular. If you have some values that are written as “Tennessee” and other values written as “tennessee”, then that could mess with your analysis. Misspellings are also a possibility (e.g., Tenessee and Tennessee). You might have a string (e.g., “six”) when the rest of the column has an integer (e.g., “6”). Could be different words that mean the same thing, too (e.g., “information technology” and “IT”). Extra spaces can be tricky. There are programs and software you can use to help with some of this stuff, but sometimes you just have to do it manually (or at least be able to identify the issue first before you automate a solution)
Accuracy. This is hard to do, but something could set off red flags or something might just be imprecise and you could do more digging. For instance, you might notice someone gave a county for a location instead of a town for where an event occurred. There might be a way where you could find the town where that event took place. Other things are more straightforward: the ‘total’ column does not match the values that make up the total, a child in the U.S. can’t be legally married, etc.
Uniformity. One big one here is that there are different units of measurement. For instance, perhaps some values of distance are in kilometers and other values are in miles.
Irrelevant Data (for us, we will talk about sorting and filtering). If you are only interested in data about health, you don’t need the column on phone numbers. If you are only interested in certain countries, you don’t need to look at all countries. There could also be issues of data included that are not relevant to a specific column and should be taken out. Always got to save a new version when you choose to take things out.
Duplicates. Sometimes things are entered more than once. This could be for a variety of reasons. One common one is the same name or ID number because the person was polled more than once at different times or previous data sources were combined. It can be hard to be sure, sometimes (e.g., if you have more than one “Mike Jones”). If all the values in the row, however, are the same, you can assume duplicity.
Outliers. Outliers are technically defined anything outside of the “interquartile range (IQR)” multiplied by 1.5. This just means the difference between the value at the 75th percentile and the value at the 25th percentile. You can kind of use the “eye test” here, though, if you look at a visualization. This can be difficult to deal with. You generally don’t want to remove outliers because it is difficult to argue that they are not representative, inaccurate, or whatever. However, if there is such a dramatic difference or if you have a good case for why the datapoint(s) just don’t fit in with the rest of the data, you can consider dropping them.
Converting things. You might want to convert categorical variables into numbers for easier calculation and for visualizing purposes (e.g., “low income = 1, middle income = 2, high income = 3”).