While non-electronic information resources, such as physical artifacts, are not subject to the Open Government Data principles, it is always encouraged that such resources be made available electronically to the extent feasible.
Primary data is an important aspect of compliance with the Open Government Data principles. All too often, audio, video, and images are only made available at low resolution to Internet user, making the data impossible to use in any professional application. The choice of an appropriate "low" resolution format yesterday begins to look unusable by the standards of today. If an entity chooses to transform data by aggregation or transcoding for use on an Internet site built for end users, it still has an obligation to make the full-resolution information available in bulk for others to build their own sites with and to preserve the data for posterity.
Just as one should not destroy information by presenting and preserving only low-resolution imagery, numeric or tabular data should not be aggressively aggregated for use in one particular Internet application at the cost of throwing public information that could be used.
The determination of what is an acceptable level of granularity to present and preserve is a moving target and should be based on best practices of the time, with a heavy bias towards "more is better."
What is reasonable depends on the nature of the data set. As an example, when the data is a record of ongoing events, is relevant to current policy debate, or is otherwise time sensitive, a delay of more than one month is not acceptable. On the other hand, geographic data collected for purposes independent of any current policy debate, for example, may releasing data periodically in bulk.
Newly updated complete data sets should be provided in a timely manner as well. Time-sensitive data sets should be updated at the same frequency with which the data changes.
When individual records change, notices of the changes should also be made timely available.
Despite the forgoing, if data is not released in a timely manner because of technical constraints, that is not a reason to continue delaying release. Better late than never!
Data must be made available on the Internet so as to accommodate the widest practical range of users and uses. This means considering how choices in data preparation and publication affect access to the disabled and how it may impact users of a variety of software and hardware platforms. Data must be published with current industry standard protocols and formats, as well as alternative protocols and formats when industry standards impose burdens on wide reuse of the data, and this includes honoring handicapped-accessibility initiatives.
If the data is retrievable from a Web interface, there must be some straightforward means of exporting it (flattening it) to be inspected in raw form directly, downloaded, and imported into other tools. Data is not accessible if it can be retrieved only through navigating web forms, or if automated tools are not permitted to access it because of a robots.txt file or other statement of policy.
The ability for data to be widely used requires that the data be properly encoded. Free-form text is not a substitute for, e.g., tabular and normalized records. Images of text are not a substitute for the text itself. Sufficient documentation on the data format and meanings of normalized data items must be available to users of the data.
Following the principle that data must be accessible, the accessibility must extent to automated access. If the data is accessible from some kind of interface, it must be possible to download the complete data set in raw form through an automated process.
Anonymous access to the data must be allowed for public data, including access through anonymous proxies. Data should not be hidden behind "walled gardens," accessible only to certain classes of Internet users. To use analogies from earlier periods of the Internet, data only accessible via AOL, Internet 2, or Bloomberg would be considered to be presented in a discriminatory manner. This principle reiterations some of the goals of principle 4, accessibility.
Proprietary formats add unnecessary restrictions over who can use the data, how it can be used and shared, and whether the data will be usable in the future.
While some proprietary formats are nearly ubiquitous, it is nevertheless not acceptable to use only proprietary formats. Likewise, the relevant non-proprietary formats may not reach a wide audience. In these cases, it may be necessary to make the data available in multiple formats.
Because government information is a mix of public records, personal information, copyrighted work, and other non-open data, it is important to be clear about what data is available and what licensing, terms of service, and legal restrictions apply. Data for which no restrictions apply should be marked clearly as being in the public domain.
The first part of the accessibility principle speaks of availability, meaning the ability for the entirety of the data to be acquired over the Internet. A data set being large does not exempt it from the requirements in this section. Disks are cheap and high definition video is no longer hard to achieve and distribute. When data sets are too large to be made available in whole, in bulk, directly from the source, assistance from the nonprofit and private sector must be sought. As a last resort, a rotation scheme can be deployed to make available a limited window of data at a time.
Accessibility also relates to uses by disabled individuals. Accessibility initiatives to be followed include the World Wide Web Consortium'sWeb Accessibility Initiative, and in the United States Section 504 and Section 508 of the federal Rehabilitation Act and Section 255 of the federal Telecommunications Act.
Benchmarks for accessibility include whether existing tools are available to process the data and whether tools that use the data could enable vision-impaired individuals to achieve the same comprehension of the data as a sighted individuals, for instance using a Braille workstation or a screen reader, and whether non-English-speaking individuals can use a web service to translate the data (in this case a document) into another language.
For tabular or structured data, each record should include an identifier. This identifier should be persistent across revisions to the data set so that external references to individual record can follow updates. The identifier can be a globally unique URI identifier following Semantic Web best practices, for instance. The data format should be documented so that those familiar with the domain of the data set can understand it. All columns, tags, and abbreviations should be described. However, XML schema or the like are not necessary.
A benchmark for meeting this requirement is whether a programmer can build a parser for the data in a scripting language in just an afternoon. That parser should be able to crawl through the published dataset and push the data into a database.
There should be a means of notifying users of the data to changes in the data format. A mail list or RSS feed aimed at data users, plus a document describing the history of the data format, are recommended.
Benchmarks for meeting these requirements include whether the data can be used in applications based entirely on free software (including license and patent free), and whether individuals are able to redistribute the data without restriction, without requiring the permission of any third party (including the government).