Filtering and Sorting

Au mentioned “exploratory data analysis.” Once your data is cleaned (and, this could be a recursive process as you learn more about your data–that is, you go back and clean again if seeing some odd things in later analysis), you gotta mess around with it to see some things that you can find. You want to see if you can get a sense of the sort of hypothesis you might have by taking a look at a variety of things in your data.

We won’t get into all of the details with this process. If you want to learn more about EDA, go here. For our purposes, we are just using this term for early investigations of our data to work toward finding the path that works for the writing we want to do this semester.

In the learning module this week, we are going to talk about doing some filtering/sorting/counting, doing some measures of central tendency, and getting a sense of the kind of distribution of values you have for certain variables.

 

Using Excel In Early Stages

There are more automated ways to do the analysis work we are going to do with csv files on this page and the next two pages after this page, but I am going to limit the amount of computer code you have to work with because I don’t want to overwhelm anyone who has no experience with programming.

So, what we are going to work with in this Learning Module for this page and the two pages after it are things you can do in Microsoft Excel, since it is a common program that many are familiar with. And, even if you are not familiar with it, it is something you have access to and something you can learn more about (at least easier than some alternatives!).

As a reminder, if you don’t have Excel on your device, you can download it here.

If you have issues with your device, let me know. There are ways to get a loaner laptop, a Chromebook, etc. [though, if you are working with a Chromebook, you will likely have to work with Google Sheets, which has very similar, but not the same, ways of doing things as Excel]

 

Filtering/Sorting/Counting

To create a filter in Excel, click on the entire column by clicking on the letter that represents the column (e.g., A, B, C) which will then highlight the whole column.

Then, on the home tab, click “Sort & Filter” and select “Filter.”

Screenshot of filter button in Excel

Once the filter is created for that column, you can click the arrow at the top of the column to see what you can filter out.

Excel screenshot of arrow to click to select things to filter in a column

A window will open up and you will see a series of check boxes that you can select or de-select. For instance, in the case of the filter I made for NYC neighborhoods in the Airbnb data set we worked with in class from February 22, I can filter out only specific neighborhoods I want to see. Like, for instance, Harlem (or Harlem and East Harlem…or whatever).

Excel screenshot of using filter on data set for only neighborhoods in column for Harlem and East Harlem

You don’t have to do this, but I want to offer another step in filtering to help keep things organized. Especially if it is something I want to keep working with, I like to select all, copy, and then paste it into a new tab in Excel. The shortcut command on a PC to select all is holding down “ctrl” and “a”. On a Mac, it is “cmd” and “a” at the same time.

Once everything is selected, copy it and then paste it in a new tab. The tabs are at the bottom of the Excel document.

screenshot of tabs at bottom of Excel file

If you click on the “+” sign you will add a new tab. I like to label it with what I filtered.

If you do paste things into a new tab, you will want to save the file as an Excel file and NOT a csv file. Csv files are not really designed to have the tab feature, so it might not save in the way you want it to.

 

Setting Standards

To help consider if a value is meaningful in terms of the story you are telling or the argument you are making, it is useful to set a standard to compare values against. In a vacuum, this would be the measure of central tendency of your sample (more on that soon!). However, many of you are not working in a vacuum, so you might need more measures.

For instance, you might compare the weather on September 17, 2019 against the mean temperature of September 17 across several decades. Some measures are adjusted by age or they use cutoffs for years (e.g., everything since 1982) for contextual reasons. Thinking about what standards or cutoffs to use depends on your getting to know your data, the context that surrounds it, and the goals of your project.

 

Task

You have three options for the task on this page.

  1. If your data set is a csv file, open it in Microsoft Excel. Look over each column, and try to find something that catches your eye. Some grouping that would be interesting to look at based on your interests in your data set. Follow the steps above (or alternative steps that you know and are more comfortable with!) and filter your data set for that grouping. Once filtered, write in a comment below what you filtered, why you wanted to filter, and anything interesting that you might have noted while glancing through what you filtered.
  2. If you are working from a database on a website, look at all of the options you have to filter things through checking checkboxes, selecting items in a drop-down menu, clicking various buttons, etc. Once you filter something, write in a comment below what you filtered, why you wanted to filter, and anything interesting that you might have noted while glancing through what you filtered.
  3. If you have a database or if you can’t quite sort/filter in the ways #1 and #2 ask for, but you want to practice with Excel, you can use the Airbnb data set we used in class and complete #1 above (just do #1 but with the Airbnb data set).

After commenting below, click on the button to continue:

Button with text that reads click here to continue

17 thoughts on “Filtering and Sorting

  1. Elaine says:

    I filtered Covid-19 cases by race, ethnicity, age, gender as well as state. I wanted to see the correlation between these characteristics and the number of cases. Doing this allowed me to see which state had more death as well as cases and which race predominantly were.

    1. Queen says:

      Hi Elaine, I also used Multi-select option to filter race, ethnicity, age, and gender of those who are infected for the same purpose since we are working on a similar dataset. By doing this, it really helps emphasizing which ethnicity is affecting the most.

  2. LIAM SCHNEIDER says:

    I used the filter on the airbnb data set to look at full rooms/ apartments that were listed on airbnb for a larger majority of to the full year. I did this to try and find out how many of these apartments were being listed on airbnb specifically for rental.

  3. PRATAP THAPA says:

    I used a filter function on the airbnb dataset to sort the data on the entire home/app and privates room based on the location. I found out that entire home/apt is more popular in East Harlem while Private room in Harlem location.

  4. Arti says:

    I filtered the designation of the funding, it’s important to see which sectors of the society is being funded for COVID-19 relief. I thought this was essential because it specifically shows where money was being allocated into certain countries and what reaction this would leave and if changes need to be made.

  5. DALANDA BAH says:

    I filtered on the airbnb data set to see the reviews between private room and a entire home/apt because I just simply wanted to see the reviews and see which is better. While glancing through what I filtered I found out that private rooms is have better reviews.

  6. Andrea Flores says:

    I filtered on the data set that I chose, and I filtered the day when the term was searched with at its highest frequency: anxiety in February 2020, depression and panic attack in march 2020, ocd in December 2020, insomnia in may 2020, mental health October 2019, counseling and psychiatrist in January 2020.

  7. Liz Fadel says:

    For the dataset to be useful in the analysis, it had to be cleaned of the attributes that were not relevant to answering the question. The dataset was filtered by cohort to allow every cohort to display only 4 year August graduates. District 79 and non-numerical data were removed as they did not suit the extent of the view; Regent scores, TASC, SACC were removed because they were needless to the question. Next, the graduation classes were filtered to include cohorts 2011-2015. The final dataset displayed 4 years August graduates, Cohorts 2011-2015, five boroughs, economically disadvantaged, non-economic disadvantaged, gender, ELL, and Non-ELL. Based on this data, it was obvious that the variables that express a positive relationship are the economically disadvantaged, non-economically disadvantaged, male, female, and Non-ELL.

  8. Gina DiGiacomo says:

    Using my data, I filtered the columns “Allegation Type” by False Arrest Allegations and the race and gender of the category “Complainants”. I did this because I was curious to see what group of men report more cases of false arrest. It was interesting that the data shows that black men reported 2,500 false arrest reports in Chicago and white men only reported 377 false arrest reports.

  9. MINGYI YOU says:

    I have tried the filter function with the Airbnb data set. By filtering room_type column to the entire room, private room, and shared room. I can see that 1522 out of 3775 people choose the entire room, 2159 out of 3775 people choose a private room, while 94 out of 3775 people choose the shared room. The result shows that most people would like to live in places that are owned by themself. Therefore, the room would not be disrupted or seen by others. Also, by filtering this out, I can see what kind of room people want.

  10. MAHIMA KHANEJA says:

    I filtered the gross domestic product data by clicking on it since it was one of the variables I needed for analysis. I ignored other options that were not required in this exercise. I also selected all countries option to access data for all nations. I noted that the database provides various options that users can click to get the data they require. The database is also fast at generating the datasets after the filter checkboxes have been checked. These two were the most exciting things I noted when filtering the data.

  11. Kimberly Barrios says:

    With my data set I filtered the age column and the last “target” column that equaled 1 which means they are susceptible to getting or already have heart disease. This way, I can clearly see if there is a correlation between age and the end result. I saw that almost half the people were in the “1” target group but the age groups were scattered. I should soon be able to see the specific age range once I put them in order.

  12. Joseph Habert says:

    For the Airbnb data set I decided to filter by column K, which is minimum nights. I noticed that the smaller the number their is, the larger the amount of available rooms their are to an immense degree.

  13. KEMBPELL PORCENAT says:

    In my dataset I filtered the property crime total column. I was able to isolate which jurisdiction accumulated the least property crime offenses. Vermont and Wyoming were the lowest.

Comments are closed.