Introduction

Dear New Yorkers,

If you are curious enough about how New York City government works that you have made it here to our annual report, you probably already know that NYC Open Data is a big deal: it’s a massive collection of thousands of datasets about the large government of the mega-city that we call home. What you may not know about – and perhaps what you’ve come here to learn – is all of the behind-the-scenes work that goes into the creation of each of those datasets.

First, someone has to realize that information a City agency is collecting – about people they interact with, their assets, or their activities – is, or should be, a dataset. That could be a subject-matter expert at that agency who has been working in that area for years, or it could be a curious New Yorker like you, struggling to find an answer to a question and reaching out to our Open Data Help Desk. Other datasets originate from this report itself, which is the culmination of an internal inventory process. That inventory process has been expanded this year to connect public datasets to Mayor’s Management Report (MMR) performance indicators and identify gaps where the data used to create those indicators might still be shared.

Once a dataset is identified, an agency works internally to figure out how to best structure it for public consumption. This can mean some data detective work to explore how a decades-old system stores data in a mainframe, or piecing together a collection of spreadsheets into a coherent picture of a government operation. Protecting New Yorkers’ privacy is an essential part of determining a dataset’s structure – this can involve removing certain fields, grouping similar values together, or even aggregating individual-level data into summaries. With the variety of skills involved, structuring a dataset can bring together program staff, data analysts, data engineers, lawyers, all contributing their expertise to help construct the public dataset.

Even the best designed dataset needs thorough documentation to make it usable. The job of creating documentation can be thought of as data translation, capturing the knowledge that someone has accumulated after years of working on a topic, and as many of the nuances of making use of the data as possible. Like structuring a dataset, writing documentation is ideally a team effort, and given the public audience, staff who work on public communications otherwise are often involved here.

Before any dataset is released to the public, it is reviewed at least three times: by the Open Data Coordinator or team of the agency that is releasing it, and by at least two people on the central NYC Open Data Team – all using a standard checklist. These layers of reviews help to capture concepts that might need more explanation and catch errors or inconsistencies in the data. Our goal is to channel a curious New Yorker and stress test the dataset and its documentation.

And while publishing a dataset marks the end of one process it is the beginning of another one – each question can be used to improve data documentation, every new user is an opportunity to find an error or improve data quality, and every analysis or tool can inspire someone else to find their own insights in the data. We hope you will continue the journey of building NYC Open Data with us – whether that means picking up the basics at an Open Data Ambassadors class, asking questions about an existing dataset, suggesting a new dataset for publication, or sharing your work at the annual Open Data Week festival.

Sincerely,
The NYC Open Data Team