Updates to Data Quality and Documentation Standards


 

The mission of NYC Open Data is taking the complex information generated from the operations of the largest municipal government in the country and making it accessible to anyone. After publishing data for over a decade, there are approximately 3,000 datasets from more than 90 different City agencies and offices. As the data has proliferated, it’s possible to learn about virtually every facet of City life, from health and transportation to the environment and finance, in just a few clicks.

NYC Open Data datasets vary as much as the functions, missions, and sizes of the agencies that produce them. Some datasets undergo rigorous internal quality assurance checks before they’re finalized, while others are reviewed as historical snapshots long after they’re logged. And agencies’ internal data documentation has a similar variety of standards and practices. Improving the ability of users to discover datasets and distinguish between similar ones is critical to providing a single, authoritative source for NYC data.

Starting in 2020 and continuing through 2021, our team met with Open Data Coordinators in NYC agencies, and with open data programs in other cities, to discuss how they have managed their data documentation and ensured they are sharing high-quality datasets. After testing a revised data dictionary template with some new datasets, we held a public workshop at Open Data Week 2022’s School of Data community conference where we asked participants to identify gaps in current documentation standards. Ultimately, we created two new documents: a comprehensive quality assessment checklist, and a simplified and improved data dictionary template.

The quality assurance checklist acts as a “self-assessment” for agencies to review datasets internally. This represents a significant change in process for both agency staff and our team. When Open Data first began publishing datasets, the priority was to build up our inventory by making more data available and doing so quickly. More than a decade later, this self-assessment checklist prompts everyone to pay attention to elements that make data accessible and ensure we’re publishing responsibly: Does the dataset have a single level of aggregation? Is the information as granular as possible? Should the data be integrated with an existing dataset on the same topic? Does the dataset contain any duplicate records?

Every agency looking to publish a new dataset now receives written feedback from our team and, if necessary, additional guidance and coaching. The process of preparing a dataset for publication often provides an opportunity to think about how the data is used internally, and to discuss how it might be optimally structured for broader consumption. While this new quality assurance process can mean that new datasets take longer to publish, the feedback we share with agencies is based on common standards to both simplify the review and ensure that published data is more readily usable regardless of its origin. Data documentation is as crucial as the quality of the data itself to ensure that Open Data datasets can be interpreted and used by the public. Particularly given the scale of NYC’s government and the variety of its responsibilities, each dataset needs clear documentation that communicates the intricacies behind it in plain language. While some users might have extensive knowledge of the topic they are exploring, we should not expect anyone to be experts in order to understand it.

The new data dictionary template incorporates the lessons learned from NYC’s 2018 Metadata for All initiative, intended to capture richer and more complete context about each dataset. The new template simplifies the process of using and updating documentation by integrating the nuanced information contained in the formerly separate “user guide” within the data dictionary itself. Some little-used fields have been removed – like one to note if the order of columns in the dataset differ from the order in the dictionary – the improved quality assurance process has made it unnecessary. Other fields have been added, like one to differentiate between how frequently an Open Data dataset is updated versus how often the source data is changed, and one to explain more about the government action that results in an update to the dataset (for example, a tree is planted, a building is inspected, or a street is repaved). Finally, the new template includes a variant for documenting relational datasets that better communicates how to use the different datasets together.

While all datasets published after July 1, 2022 follow these new standards, there are thousands of older datasets with varying documentation and data formatting. We’re still determining the best way to bring all of these older datasets up to the current template and standard, and for now our team is working with ODCs to move data dictionaries to the new template and fill in any documentation gaps as substantial data updates are made. We are also exploring how an automated assessment of data quality could work.

Building together and sharing knowledge is a cornerstone of NYC Open Data. If you have any suggestions or feedback on the quality assessment checklist and standards, the data dictionary template, or what we might do to continue improving NYC’s data, please let us know!


Previous page:

Build Communities

Next page:

Dataset Highlights