Dataset Cleanup: Consolidating our Inventory

 
A visual breakdown of this year's dataset consolidation

The goal of NYC Open Data is to take the complex information generated from the operations of the largest municipal government in the country and make it accessible to anyone. After publishing data for fifteen years, there are approximately 3,000 datasets from more than 90 different City agencies, making it possible to learn about virtually every facet of City life in just a few clicks. As the number of datasets increases, it becomes more difficult to tell them apart and find the one you’re looking for. Meeting this challenge motivates our efforts both in vetting new data and in cleaning up our existing inventory. Over the past year, our team worked with agencies to consolidate what were 524 separate datasets into just 34 datasets that contain the same information.

This effort builds on an ongoing project to improve the quality of the data that we make available. That work has included standardization of metadata and data dictionary templates, a dashboard to track dataset updates, and a data quality checklist and review process. This philosophy has also been echoed in changes to our reporting for the Mayor’s Management Report (MMR). In recent years we have moved away from counting the total number of datasets published as a key metric – which could inadvertently incentivize publication for its own sake – to instead reporting on the total number of rows available – which captures both the quantity of information and success of automated updates.

Most recently, we’ve turned our attention to the consolidation of existing datasets covering similar topics. When datasets are identified for possible consolidation, their structures, especially their level of aggregation and number of columns, are evaluated to determine whether they might readily be combined. In some cases, even datasets with slightly different structures can be combined, keeping the most granular information and removing redundant columns. A group of datasets might be identified for consolidation if there are different reports under the same topic for multiple years or months in a row, or from different locations. For example, a 2022 Bronx dataset and a 2023 Queens dataset on the same topic can be combined into a single table that also allows for additional data to be readily added in the future.

After an initial assessment, we consult with the Open Data Coordinator whose agency manages the dataset and often with staff who have expertise on the topic that the data covers. If consolidation is indeed possible, the structure and documentation are updated and all changes are tracked in the data dictionary’s change log. We also typically use this work as an opportunity to closely review the dataset’s documentation, often adding more context to the data dictionary and updating it to our latest template – just as we would for a brand new dataset. Out of the 524 consolidated datasets, we combined:

  • 268 Department of Education datasets into 10
  • 60 Department for the Aging datasets into 5
  • 49 Office of Technology & Innovation datasets into 6
  • 45 Mayor’s Office of Operations datasets into 2
  • 34 Department of Youth & Community Development datasets into 1
  • 30 Department of Probation datasets into 2
  • Smaller efforts across other agencies collapsing 38 datasets into 8


The result of this work is a better experience for our users and an easier-to-manage inventory for our team and City colleagues. For anyone using Open Data, with fewer datasets on the same topic it’s easier to find the right one. And with more data in the same place, it’s easier to start an analysis without first figuring out what datasets cover the same topic and how they might be joined. This consolidation also streamlines the dataset update process that our team manages, since updating an existing dataset is easier than creating a new one and keeping the dataset current can often be done using an automation.

Open Data tells an important story about New York City through sharing some of the daily workings of its government. As custodians of this resource, we will continue to share new datasets about the latest work of City agencies. This past year, we worked with those agencies to publish 97 new datasets. You can read about some of them in our highlights. We will also continue to improve the quality of existing datasets – adding more documentation, consolidating where possible, and sharing the most granular information that we can. If you have any questions about this work, or notice an opportunity where our data can be improved, please let us know at nyc.gov/askopendata!