In May 2011, Europe faced a significant health crisis: a deadly outbreak of the bacteria E.coli had emerged from an unknown food source, affecting 4000 people, and killing 53. Researchers turned to the broader scientific community for help: releasing details of the sequenced bacterial genome via Twitter[i] and sharing publicly accessible sequence data via the NCBI (National Center for Biotechnology Information) database.[ii] Within 24 hours, international teams were uploading analyses and annotations to the open data repository GitHub,[iii] and within days, possible ancestral strains were being identified. At record speed, scientists were able to pinpoint the source of the contamination, allowing authorities to isolate the farms in question, and to declare the outbreak over by the end of July.[iv]
Such collaboration was enabled by the open licensing of the genomic data under the ‘no rights reserved’ CC0 licence.[v] This licence, released by Creative Commons,[vi] enables copyright holders to waive their rights to materials, placing them as completely as possible in the public domain. This allows scientists, educators, artists, and other creators to build upon, enhance, and reuse these materials for any purpose, without restriction under copyright or database law.
‘Open data’ is data ‘that can be freely used, reused and redistributed by anyone – subject only, at most, to the requirement to attribute and sharealike.’[vii] According to the Open Knowledge Foundation, open data’s key features are:[viii]
- Availability and Access: data must be available as a whole, and at no more than reasonable reproduction cost, preferably over the Internet. It must be in convenient and modifiable form;
- Reuse and Redistribution: data must be made available under terms that permit reuse and redistribution, including intermixing with other datasets. It must be machine-readable;
- Universal Participation: data must be available to everyone to use, reuse, and redistribute. There must be no restrictions against persons or groups, or against commercial interests, for example.
The many benefits of open data for both the institution and the community include greater accessibility, collaboration and innovation, greater transparency and accountability, and greater responsiveness of institutions to changing conditions, including emergency scenarios such as the above.[ix]
As individuals and institutions face increasingly complex computational challenges and grapple with exponentially increasingly amounts of data, there is an urgent need to establish frameworks to support open data distribution, use, and reuse. It is here that the library of the 21st century plays a critical role – that of digital curator.
What is Digital Curation? Why Does It Matter?
The UK’s Digital Curation Centre (DCC) defines this essential research activity as follows:[x]
‘Digital curation involves maintaining, preserving, and adding value to digital research data throughout its lifecycle.’
Through its education and oversight of the various stages of the digital curation lifecycle,[xi] depicted using the DCC model below, the library can ensure that data is appropriately captured, described, stored and secured, appraised and preserved, and disposed of, according to relevant policies, procedures, and legal requirements. Such frameworks will ensure that meaningful data is preserved for others to access, use, share, and re-use in both the short and long term.
The Digital Curation Centre’s Digital Curation Lifecycle Model[xii]
Digital curation processes also play a crucial role in guaranteeing that data is accurate, authentic, and has integrity; i.e., it is what is says it is, and has not been added to, deleted, or otherwise modified since creation. This is fundamental in many fields, and forms the very foundation of scientific endeavour. By its insistence on internationally recognised information standards, the library ensures that both the value and veracity of data can be established on an ongoing basis.
Embracing ‘Intelligent Openness’
Worldwide, scientific institutions such as the Royal Society have called for ‘intelligent openness,’[xiii] in which data and its associated metadata (‘data about data,’ which enables its retrieval, management, and use)[xiv] must be accessible, intelligible, assessable, and re-usable.
Here, the library plays an integral role in achieving intelligent openness – by encouraging owners of data to engage in the following steps, defined by the Open Knowledge Foundation:[xv]
- Make your data available: in bulk and in a useful format;
- Make it discoverable: put it on the web with its associated metadata;
- Apply an open licence to your datasets.[xvi]
In this way, with the help of the library, we will be free to use, reuse, and redistribute data in all its forms.
[i] Wiles, S. (2011). An outbreak of crowdsourcing, Sciblogs. Retrieved October 1, 2013, from http://sciblogs.co.nz/infectious-thoughts/2011/06/09/an-outbreak-of-crowdsourcing/. The genome was originally sequenced by BGI and the University Medical Centre Hamburg-Eppendorf.
[ii] NCBI. (2011). Sequencing for E.coli. Retrieved October 1, 2013, from http://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra&from_uid=67657.
[iii] GitHub. (2011). E. coli O104:H4 Genome Analysis Crowdsourcing. Retrieved October 1, 2013, from https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/wiki.
[iv] Edmunds, S. (2011). Notes from an E. coli “tweenome” – lessons learned from our first data DOI, GigaScience. Retrieved October 1, 2013, from http://blogs.biomedcentral.com/gigablog/2011/08/03/notes-from-an-e-coli-tweenome-lessons-learned-from-our-first-data-doi/.
[v] Creative Commons. (n.d.) About CC0 – “No Rights Reserved”. Retrieved October 1, 2013, from http://creativecommons.org/about/cc0. See also Creative Commons. (n.d.). CC0 FAQ. Retrieved October 1, 2013, from http://wiki.creativecommons.org/CC0_FAQ.
[vi] Creative Commons, http://creativecommons.org/.
[vii] Open Knowledge Foundation. (n.d.). Open Definition. Retrieved October 1, 2013, from http://opendefinition.org/. An open licence may require that users of the data to credit the owner of the dataset (‘attribution’), or that users who mix the data with other data must also release the results under an identical licence (‘share-alike’). For more information regarding open licence terms, see Creative Commons Australia. (n.d.). About the licences. Retrieved October 1, 2013, from http://creativecommons.org.au/learn-more/licences.
[viii] Open Knowledge Foundation. (n.d.). Open Data. Retrieved October 1, 2013, from http://okfn.org/opendata/.
[ix] See, for example, Open Knowledge Foundation. (2012). Why Open Data? Open Data Handbook. Retrieved October 1, 2013, from http://opendatahandbook.org/en/why-open-data/.
[x] Digital Curation Centre. (2013). What is digital curation? Retrieved October 1, 2013, from http://www.dcc.ac.uk/digital-curation/what-digital-curation.
[xi] Digital Curation Centre. (2013). DCC Curation Lifecycle Model. Retrieved October 1, 2013, from http://www.dcc.ac.uk/resources/curation-lifecycle-model.
[xiii] The Royal Society. (2012). Science as an open enterprise. Retrieved October 1, 2013, from http://royalsociety.org/uploadedFiles/Royal_Society_Content/policy/projects/sape/2012-06-20-SAOE.pdf.
[xiv] For more information about metadata, see National Information Standards Organization (NISO). (2004). Understanding Metadata. Retrieved October 1, 2013, from http://www.niso.org/publications/press/UnderstandingMetadata.pdf.
[xv] Open Knowledge Foundation. (2013). Open Data – An Introduction. Retrieved October 1, 2013, from http://okfn.org/opendata/.
[xvi] For more information on appropriate data licences, see Open Knowledge Foundation. (n.d.). Making Your Data Open: A Guide. Retrieved October 1, 2013, from http://opendatacommons.org/guide/ and Open Knowledge Foundation. (n.d.). Guide to Open Data Licensing. Retrieved October 1, 2013, from http://opendefinition.org/guide/data/.