Milestone 4 – ETL Process

Meeting several times a week, we have completed the ETL process on Pentaho (Below are an example of two of our dimensions). As we are working through this, we have been taking screenshots of our ETL and documenting for our final report.

Below is the fuzzy match because we did not have a unique identifier to connect the two datasets. we used the combination of restaurant names and addresses from both the Yelp business and Nevada’s inspection data sets. We set the accuracy measure to 90%. 

Below are just a few of our dimensions we created on Pentaho.

Address Dimension:

Date dimension:

 

Our next step is to work on the analytics and dashboard. We have connected our Oracle cloud to Tableau and working through our KPIs.

Milestone 2 – Yelp Ratings and Inspection in Nevada Data Source and KPI

Yelp Data:

https://www.kaggle.com/yelp-dataset/yelp-dataset?select=yelp_academic_dataset_review.json

Restaurant Inspection:

https://data.world/lasvegasnevada/restaurant-inspections/workspace/file?filename=restaurant-inspections-1.csv

Violation Code (with explanation):

https://opendataportal-lasvegas.opendata.arcgis.com/datasets/restaurant-inspection-violation-codes/data?page=6

 

January 2017 – November 2017 Data 

Purpose: To identify if 2017 restaurant inspections in the Nevada Area have any affect on their Yelp ratings.

KPIs:

  • Name of restaurants with the highest percentage change in monthly average review stars before and after the inspection.
  • Name of restaurants that had the largest percentage in monthly average “useful” review stars before and after the inspection.
  • What is the average review (per restaurant) and frequency of inspection (monthly).

Grain:

The review table has a grain of: day, business, review, useful

The inspection table has a grain of: day, grade, inspection result

Dimensions:

Date (date, month), Location (address, city, state, postal code), Business (business_id, name, category), Review Text (Junk Dimension), Inspection Violations (Violation code, explanation)  

Facts:

  • review fact table (transaction fact table) (xx  located at xx received a xx review star on xx date.)
  • Inspection_grade fact table (Factless fact table) (xx located at xx received an inspection grade of xx with the inspection result of xx and xx violations codes on xx date.)

Dimensional Model Diagram:

https://drive.google.com/file/d/1QK4d7ya1AR7kcnf3pP2N1zVmEfmmBIS1/view?usp=sharing

Image of Dimensional Model Diagram:

https://drive.google.com/file/d/1cs_c-j0j-lXrURQsNnSKEgQTCD7FhFK-/view?usp=sharing

Milestone 1 – Project Ideas

JMS

CIS 9440- Course section 53483

Maliha Maliha: [email protected]

Junhao Zhang: [email protected]

Stephany Tong: [email protected]

Ideas:

1- COVID affecting employment based on industry

The project aims to focus on COVID-19 data set including cases and deaths based on the different counties in New York State along with the census labor data which includes employment for different industries and the wages for different time periods. 

Our goal is to analyze both data sets and gain insights into how this pandemic has affected different industries across New York state. The questions we would like to answer are:

  1. Which industry was the most affected based on the number of layoffs and which industry was the least affected?
  2. Is there a significant relationship between employment and COVID-19 cases/ deaths based on different industries across different time periods?
  3. Which county was the most affected in New York state?
  4. For the industries that laid off employees, were there substantial wage cuts?
  5. Did the industries that laid off employees start hiring again when New York state began lifting the restrictions?

Tentative Data Sets:

https://catalog.data.gov/dataset/quarterly-census-of-employment-and-wages-quarterly-data-beginning-2000/resource/5d662b63-78ee-49f6-8b8a-6d2f8bf32d57 (Employment by industries and on a county level for New York)

https://github.com/nytimes/covid-19-data/blob/master/us-counties.csv (US county wise COVID data)

2- COVID and NYPD complaints

The project aims to focus on the NYPD complaints during the COVID-19 pandemic to identify the impacts of staying at home order.

Our goal is to analyze both data sets and get the insight into the below questions:

  1. Which type of complaints happen most frequently?
  2. Which county has the most complaints?
  3. Does the number of complaints have a positive relationship with the number of COVID cases?
  4. What type of complaints increase mostly, and what is the relationship with the number of COVID cases?

Source data:

NYPD Complaint Data Current (Year To Date)

https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Current-Year-To-Date-/5uac-w243/data

Coronavirus-data by-boro.csv

https://github.com/nychealth/coronavirus-data/blob/master/boro/boroughs-case-hosp-death.csv

3- COVID cases & import/export goods/ services in the US

The project focuses on COVID-19 cases and deaths by state within the US. Based on that data, what goods and services did the US import/ export most during the peak times? 

We would also like to answer the following questions: 

  1. When and where were COVID cases the most concentrated?
  2. What goods or services were mostly imported / exported during the peak time?
  3. How was import and export impacted by COVID?  
  4. What goods were being imported pre-COVID? 
  5. Based on this information, can businesses in the US start producing these products to promote domestic goods? This should result in increasing employment within the US. 

Potential Data Sources: 

US COVID Cases & deaths by state

https://catalog.data.gov/dataset/united-states-covid-19-cases-and-deaths-by-state-over-time-845b7

International trade in goods and services 

https://catalog.data.gov/dataset/us-international-trade-in-goods-and-services

Import export price indices 

https://catalog.data.gov/dataset/import-export-price-indices

COVID-19 Related Goods: US Imports & Tariffs 

https://catalog.data.gov/dataset/covid-19-related-goods-u-s-imports-and-tariffs-updated

4- Yelp Rating & Restaurant Inspection in the Nevada Area

The project focuses on the 2017 Yelp Review & Restaurant Inspection in the Nevada Area. Based on that data, it is to identify if 2017 restaurant inspections in the Nevada area have any effect on their Yelp ratings.

We would also like to answer the following questions: 

  1. What is the relationship between inspection and rating?
  2. What is the change in average rating before and after the inspection?

Potential Data Sources: 

Yelp Data:

https://www.kaggle.com/yelp-dataset/yelp-dataset?select=yelp_academic_dataset_review.json

Restaurant Inspection:

https://data.world/lasvegasnevada/restaurant-inspections/workspace/file?filename=restaurant-inspections-1.csv