Ideas Proposed by the NYC Government


These below are the project ideas proposed by the NYC Government for CUNY-IBM Watson Case Competition of 2018.


The potential of IBM Watson AI technology in understanding how substance abuse impacts child welfare cases across the city. (TEMPORARILY UNAVAILABLE)

The New York City Administration for Children’s Services, Division of Child Protection (DCP) is charged with investigating all allegations of child abuse and maltreatment that the City receives from the New York Statewide Central Register of Child Abuse and Maltreatment. Each year the division investigates about 60,000 reported cases. The division is comprised of approximately 3,500 employees spread across the five boroughs at nineteen sites.

Over the last year, DCP has received a number of reports of alleged parental drug use.  There have also been a number of high profile cases/child fatalities that involved drug abuse concerns. Parental substance abuse is recognized as a risk factor for child maltreatment and child welfare involvement.  Children who come from homes where parents use drugs are more likely to suffer from abuse and neglect, which may result in out of home placements.

Identifying substance abuse and meeting the complex needs of parents with substance use disorders and those of their children can be very challenging.  The Division of Child Protection would like for the CUNY/IBM project to explore the potential of IBM Watson AI technology to assist DCP personnel identify potential patterns of parental drug use by analyzing structured (Connections) and unstructured data (progress notes) available to DCP and assist in the Division’s understanding of how substance abuse impacts child welfare cases across the city.

DCP’s goal is to use the results of analyzing available data to determine the services needed to address the substance abuse concern for parents and children, develop training for Children Protective Services (CPS) workers, coordination between service providers, development of policies to address child welfare and substance abuse treatment systems.



PROJECT: 2020 Census Challenge, Mayor’s Office of the Chief Technology Officer 

The results of the 2020 Census count will “define” New York City for the next decade. Census results determine the number of seats each state has in the House of Representatives. These results are used to draw political districts at federal, state and local levels, and affect the distribution of billions of dollars of federal funding to local communities for infrastructure and vital services like hospitals and schools.

Getting New York’s fair share of political representation and federal resources will hinge upon the willingness of residents to participate. Given the dynamic nature of its population, varied and complex housing, and the presence of a wide variety of racial, ethnic and national origin groups, the hurdles to a complete and accurate enumeration in New York City have been and continue to be formidable. An under count in the Census damages our understanding of the city and its neighborhoods by under representing specific groups, such as young children, members of minority groups, and immigrants. While there will always be logistical hurdles when taking a Census, like language barriers and issues involving the place of residence, it is the fear/distrust of government, concerns about privacy, and just the plain lack of information on why the Census is important that can compromise response.  Heightened tensions over federal policies have only served to exacerbate these problems, despite guarantees that individual responses on the Census are confidential.

The purpose of this project is to increase the participation of New Yorkers in the 2020 Census, yielding a more accurate count of the City’s population. In particular, ideas should focus on increasing Census completion among groups that are most often undercounted (lower-income, minority Black/Hispanic communities). Innovative solutions should leverage IBM Watson AI to increase the level of Census response initially, both through the mail and the Internet and in Census Bureau operations aimed at obtaining responses from those who failed to respond initially.

The 2020 Census Challenge is too complex and resources intensive to be completed by any individual team during the limited time available during the CUNY IBM Watson Case Competition.  However, there are a number of specific issues related to the 2020 Census Challenge that could be addressed during the competition.  Below are a set of questions each one identifying a potential project:

Idea 2.1. How might Watson AI encourage more people to respond to the Census online in 2020?

Idea 2.2. Are there tools, such as chat bots, that can be developed to better educate residents on the benefits of completing the Census and the negative impact it will have on the City if they are undercounted?

Idea 2.3. What steps can be taken to understand and quantify the sentiments of immigrants and other vulnerable groups to the Census to help inform our outreach strategy for the Census (e.g. sentiment analysis based on immigration legislation)?

Idea 2.4. How might we make the Census more accessible and understandable to residents who speak English as a second language?

Idea 2.5. How might we use unstructured publicly-accessible social media data to forecast which communities are less likely to respond to the Census and why?

Idea 2.6. How might we connect the issues residents care about (e.g. funding for local schools, improvements to buses & trains, greater access to healthier food) to the results of the Census (e.g. automated response system showing how Census funding impacts specific needs)?

Idea 2.7. How might we better track the cost and benefits of the City’s outreach during the 2020 Census outreach? (Ultimately the City would like to determine its return on investment, ROI.

Background & Data Sources:



Using Machine Learning Methods to Improve Whole Genome Sequencing Quality Control Analysis (TEMPORARILY UNAVAILABLE)

The New York City (NYC) Department of Health and Mental Hygiene (DOHMH) recently introduced Whole Genome Sequencing (WGS) technology at the Public Health Laboratory (PHL). WGS is a rapidly growing field of public health infectious disease control, providing high-resolution disease information for surveillance, outbreak investigations, anti-microbial resistance detection, and other clinical and molecular testing methods.

To ensure satisfactory WGS data for genomic analysis, large amounts of quality control (QC) data is generated, manually reviewed, and submitted for further analysis.  While the QC data is mostly structured statistical data, interpretation of acceptable data quality is time-consuming and subjective as it depends on a combination of different quality metrics.

Over time, machine learning methods will improve automation and objectivity in WGS QC analysis.  For this project, the students participating in the CUNY-IBM Watson competition will employ machine learning techniques to accurately predict adequate WGS data quality for data submission, ultimately improving analytical turn-around time and minimize inadvertent sharing of poor quality data with collaborators. These developments could benefit other PHL analytical procedures, thus improving data for surveillance and outbreak investigations.


IDEA 4. 

Improving Disease Surveillance Facility and Provider Matching (2 team limit)

The New York City Department of Health and Mental Hygiene (DOHMH) currently has a process to match incoming providers and facilities on incoming laboratory and provider case reports with a master list. The information in these reports is notoriously incomplete and dirty. The current automated process to match the data often cannot make a determination, requiring a manual review by members of the disease surveillance systems taking valuable time from other surveillance issues. Not having this information can hinder case investigation as it is unclear where the patient was treated and how to reach the providers involved in the person’s care. A more robust system would speed case processing and get the proper care to those in need. NYC DOHMH seeks a system to quickly and accurately match provider and facility information from the incoming laboratory and provider reports for the disease surveillance teams at the department.



Provider Communication Improvements (2 team limit)

The New York City Department of Health and Mental Hygiene (DOHMH) is in regular communication with health care providers, giving information in emergency situations and disseminating news of public health import. Thus, maintaining an accurate, updated contact list is of vital importance. This is a complicated task, as the list is the result of consolidating various distribution lists from across and outside Agency. These lists might have conflicting information, be updated at differed times, and vary in accuracy. The data from the disparate sources need to be matched and cleaned appropriately to produce a universal provider contact database. DOHMH seeks a team to develop a cognitive computing solution that can: 1) reconcile providers between the different lists to create a common “gold standard” list of providers; and 2) automatically match new information to this “gold standard” as new data enters the Agency. DOHMH would also like to better understand the gaps in the current lists to guide outreach and better coverage.



Adding National Provider Identifier to the Provider Data Warehouse (2 team limit)

The New York City Department of Health and Mental Hygiene (DOHMH) maintains a provider data warehouse to maintain contact information from providers in the agency. The Informatics team would like to enrich this data with information from the National Provider Identifier (NPI). The National Provider Identifier list contains millions of records for providers throughout the country. The project would be to identify records of interest to the New York City DOHMH, match them to records in the department’s current data warehouse and validate that the information provided by the NPI system is valid. This project requires the management of large datasets and validation using a wide number of sources that would be impossible to manage manually.