10.1108@oclc 07 2013 0022
10.1108@oclc 07 2013 0022
[Link]/[Link]
Purdue
The Purdue University Research University
Repository
HUBzero customization for dataset publication
and digital preservation 15
Carly C. Dearborn, Amy J. Barton and Neal A. Harmeyer Received 1 July 2013
Purdue University Libraries, Purdue University, West Lafayette, Indiana, USA Revised 18 July 2013
Accepted 18 July 2013
Abstract
Purpose – The purpose of this case study is to discuss the creation of robust preservation
functionality within PURR. The study seeks to discuss the customization of the HUBzero platform,
composition of digital preservation policies, and the creation of a novel, machine-actionable metadata
model for PURR’s unique digital content. Additionally, the study will trace the implementation of the
Open Archival Information System (OAIS) model and track PURR’s progress towards Trustworthy
Digital Repository certification.
Design/methodology/approach – This case study discusses the use of the Center for Research
Libraries Trusted Repository Audit Checklist (TRAC) certification process and ISO 16363 as a rubric
to build an OAIS institutional repository for the publication, preservation, and description of unique
datasets.
Findings – ISO 16363 continues to serve as a rubric, barometer and set of goals for PURR as
development continues. To become a trustworthy repository, the PURR project team has consistently
worked to build a robust, secure, and long-term home for collaborative research. In order to fulfill its
mandate, the project team constructed policies, strategies, and activities designed to guide a
systematic digital preservation environment. PURR expects to undertake the full ISO 16363 audit
process at a future date in expectation of being certified as a Trustworthy Digital Repository. Through
its efforts in digital preservation, the Purdue University Research Repository expects to better serve
Purdue researchers, their collaborators, and move scholarly research efforts forward world-wide.
Originality/value – PURR is a customized instance of HUBzerow, an open source software platform
that supports scientific discovery, learning, and collaboration. HUBzero was a research project funded
by the United States National Science Foundation (NSF) and is a product of the Network for
Computation Nanotechnology (NCN), a multi-university initiative of eight member institutions. PURR
is only one instance of a HUBzero’s customization; versions have been implemented in many
disciplines nation-wide. PURR maintains the core functionality of HUBzero, but has been modified to
publish datasets and to support their preservation. Long-term access to published data are an essential
component of PURR services and Purdue University Libraries’ mission. Preservation in PURR is not
only vital to the Purdue University research community, but to the larger digital preservation issues
surrounding dynamic datasets and their long-term usability.
Keywords Institutional repositories, Data sharing, Digital preservation, Metadata, Data curation
Paper type Case study
Data sharing and open access are no longer simply buzz phrases in the scientific and
publishing communities. Benefits to data sharing and reuse include increased OCLC Systems & Services
Vol. 30 No. 1, 2014
collaboration, interdisciplinary innovation, and new solutions to pervasive social pp. 15-27
problems. The life and use of scientific data should extend beyond its original purpose. q Emerald Group Publishing Limited
1065-075X
The new and sometimes competing demands placed on data have created what many DOI 10.1108/OCLC-07-2013-0022
OCLC call the “data deluge.” Coupled with the sheer bulk of data created in modern research,
30,1 the rapid advances in technology and tools, and the interdisciplinary research
objectives can make open access a challenging objective (Faniel and Zimmerman,
2011).
While the benefits of open access seem clear, the logistics surrounding
sustainability are less so. A key factor in developing and sustaining open access to
16 data are addressing issues surrounding preservation and data management. The
emerging field of data curation combines the scalability of data management with the
commitment to long-term preservation. An effective data management plan will factor
in issues such as use of open standards for file formats, well-formed metadata, and
information management literacy with the goal of viable future access (Lee and Tibbo,
2007). The strategies of data management are currently being addressed more and
more by university libraries and institutional repositories. These bodies are
increasingly providing assistance with data creation, management, curation, and
ultimately, preservation (Cragin et al., 2010). Purdue University Libraries specifically
sought to operationalize this narrative by providing a platform on which Purdue
researchers can receive data management support from subject librarians, fulfill the
data management requirements of most funding agencies, take steps toward long-term
data preservation, and provide immediate access to their research data.
Incentives and mandates for data sharing are changing the landscape of university
libraries, especially at Purdue – a leader in science, technology, and engineering
research. More and more federal funding agencies require data management plans as
part of their grant awarding process. In January of 2011, the National Science
Foundation (NSF) began requiring a two page data management plan while other
agencies have been requiring them for much longer. The National Institutes of Health
required its grant applicants to take measures towards data management as early as
2003. The National Institution of Justice requires awardees to submit a data-archiving
policy 90 days prior to the end of a funded project (Witt, 2012).
The increased focus on data management and data sharing by these major funding
agencies has necessitated a sea change in university library core functions. Librarians
have the unique training to help researchers handle their data and prepare sustainable
approaches to its management. It was this understanding that prompted the Purdue
University Dean of Libraries, the Purdue University Vice President of Information
Technology, and Purdue University Vice President for Research to plan the
development of a campus-wide data management platform using HUBzero software.
This group of Purdue administrators created the Purdue University Research
Repository Working Group in March of 2011. This group represented the major
stakeholders from the University community and included experts from Purdue
Libraries, Sponsored Program Services, and Information Technology at Purdue (ITaP).
The Working Group began planning and developing the data management platform,
now realized as the Purdue University Research Repository (PURR). Invested parties
from the Libraries gradually formed the PURR Project Development Team (PURR
Team). This team includes librarians, archivists, software engineers, and graduate
students and is largely responsible for the continued development and maintenance of
the repository.
This case study will discuss the progress of the Working Group and the PURR
Team as they continue to develop PURR’s platform and preservation infrastructure.
This discussion will include the creation of guiding policies and procedures, plans to Purdue
place those policies into action, and the unique metadata which describes and informs University
these actions. While PURR is operational, many of the components discussed within
this case study are still in development and have not been fully implemented within
PURR’s environment.
21
Figure 1.
PURR’s OAIS workflow
OCLC entire submission meets with PURR’s collection policy. If not, the Gatekeeper will send
30,1 the submission back to the producer with comments for review or suggestions to place
in another repository.
Once the Gatekeeper process approves the SIP, the process to create an Archival
Information Package (AIP) begins. While the AIP creation tool has been written and
tested, it is still in development and is not fully integrated with the live PURR site.
22 PURR uses the Library of Congress BagIt specifications to package the dataset and its
associated metadata – the descriptive information and the preservation description
information. BagIt is a hierarchical file packaging format used primarily for storage
and transfer of preservation-quality digital content. BagIt “bags” consist of a
“payload,” or the dataset encapsulated in the bag, and “tags,” the metadata used to
record bag transfer and storage (Boyko et al., 2009). The “bags” are read-only and
cannot be altered once serialized.
These standards, chosen due to their acceptance by the information management and
academic communities, will continue to undergo support for the foreseeable future.
More specifically, METS was chosen because it was designed to be extended by
incorporating other defined metadata standards within the descriptive metadata and
the administrative metadata container. METS acts as the wrapper into which the other
standards are embedded. DCMI dcterms was selected for the descriptive metadata in
anticipation of Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)
implementation within PURR. The MODS standard was selected to identify ownership
and access conditions of the published dataset. Lastly, PREMIS was selected to
support PURR’s requirement for long term preservation of published datasets.
In order to achieve a high-level of dataset discoverability, the descriptive metadata
must be complete and indexed. Datasets are described using the DCMI Metadata
Terms (dcterms) standard. Through the project creation and dataset publication
processes, the producer fills out online forms that capture descriptive metadata. The
fields on the forms include Project Name, Project Alias, Title, Synopsis, Abstract,
Authors, Tags, License, and Release Notes. The Project Name and Project Alias fields
are populated by the producer at the time a project is created. During the publication Purdue
process, the Synopsis field captures a succinct description of the dataset or the research University
that produced the dataset. The Abstract field provides more space for the producer to
describe the dataset and/or the research project. The Authors field is repeatable and
can capture a single author or the primary author and additional authors and/or
contributors. The Tags field captures keywords that are indexed for dataset searching.
The producer typically provides natural language terms in the Tags field. 23
Once submitted, the publication is queued until a subject librarian reviews the
publication to ensure appropriateness, checks grammar and spelling, and adds
controlled vocabulary subject terms in the Tags field. The natural language and
controlled vocabulary terms in the Tags field, along with the other bibliographic fields
such as Title, Author, Abstract, etc. enrich the description and are indexed for
searching and discoverability. The License field is a dropdown list of Creative
Commons[12] licenses reviewed and approved by the PURR Steering Committee.
Finally, the Release Notes field is a place for the producer to include any notes with
regards to file(s) descriptions, and any other pertinent information about the dataset or
research methods.
Once the dataset is approved and published, the descriptive metadata values
provided by the producer and subject librarian are stored in tables in PURR’s database.
The approval and subsequent publication of a dataset will eventually trigger an AIP
Creation Tool to run. The Tool first process creates the AIP BagIt bag and then begins
the process to dynamically generate serialize the metadata in a well-formed, validated
Extensible Markup Language (XML) file to be included in the completed AIP. First, the
METS metadata wrapper is written to the file. Next, the AIP Creation Tool maps the
producer, librarian, project and system generated descriptive metadata values to
dcterms elements and inserts the elements in the METS descriptive metadata
container. The mapping is shown in Table I.
the files’ storage locations. It also records the names of the files included in the AIP.
After the AIP Creation Tool finishes generating the metadata, the well-formed,
validated metadata are included in the AIP along with all the files and other
preservation data and preservation files. The AIP will then be considered completed
and ready for preservation.
In special occasions, an AIP can serve as a SIP. For example, in the case of transfer
from one preservation system to another, a current AIP would serve as the submission
package to this new platform. A Dissemination Information Package (DIP) is created
from the same source material as the AIP; however, it is not a copy or a derivative of an
AIP. A DIP is generated on demand once a member of the designated community visits
the web interface and downloads the dataset. The designated community also has the
ability to download the DIP’s associated metadata files. This is not a typical process
but one that works best for PURR’s publication model, especially as the AIP workflow
is still in development. Generating the DIP from the submission materials allows PURR
to provide immediate access to published datasets.
Conclusion
While development of PURR’s preservation infrastructure is ongoing, the team is
making progress toward the goal of becoming a trusted digital repository. PURR will
utilize a distributed digital preservation model as a strategy for AIP back-ups. In early
2013, Purdue University Libraries became a member of the MetaArchive Cooperative.
OCLC Developed in partnership between six southeastern US university libraries with
30,1 backing from NDIIPP, MetaArchive utilizes LOCKSS software to create a digital
preservation network which approaches digital preservation through replication and
geographic distribution. While still in the early phases of integration, MetaArchive
promises to provide PURR with robust archival backup, in addition to Purdue’s local
and satellite storage infrastructure. Once Purdue and PURR are fully integrated with
26 the cooperative, PURR will be able to satisfy additional ISO 16363 items.
ISO 16363 continues to serve as a rubric, barometer and set of goals for PURR as
development continues. To become a trustworthy repository, the PURR project team
has consistently worked to build a robust, secure, and long-term home for collaborative
research. In order to fulfill its mandate, the project team constructed policies, strategies,
and activities designed to guide a systematic digital preservation environment. PURR
expects to undertake the full ISO 16363 audit process at a future date in expectation of
being certified as a Trustworthy Digital Repository. Through its efforts in digital
preservation, the Purdue University Research Repository expects to better serve
Purdue researchers, their collaborators, and move scholarly research efforts forward
world-wide.
Notes
1. The PURR Digital Preservation Policy was written by Paul Bracke, Associate Dean for
Assessment and Technology, Jake Carlson, Data Services Specialist and Associate Professor of
Library Science, and Sammie Morris, Head, Archives and Special Collections and Associate
Professor of Library Science. The policy may be accessed on the Purdue University Research
Repository web site, available at: [Link]
2. PURR Designated Community definition, available at: [Link]
PURR/whocancreateanewproject
3. Information regarding both TRAC and ISO 16363 can be found online at the Center for
Research Libraries, Metrics for Repository Assessment, available at: [Link]/archiving-
preservation/digital-archives/metrics-assessing-and-certifying
4. As of July 1, 2013, these PURR documents are not available online.
5. DROID. The National Archives of the United Kingdom, available at: [Link].
[Link]/information-management/our-services/[Link]
6. Figure 1 designed by Brandon Beatty (Purdue University Research Repository).
7. In 2010 Purdue University Libraries became a founding member in DataCite, an
international consortium which promotes the sharing of datasets by issuing DOIs, available
at: [Link]
international-cooperative-to-advance-research/
8. METS. Metadata Encoding and Transmission Standard. Library of Congress, available at:
[Link]/standards/mets/
9. DCMI. Dublin Core Metadata Initiative, available at: [Link]
terms/
10. MODS. Metadata Object Description Schema, Library of Congress, available at: [Link].
gov/standards/mods/
11. PREMIS. Library of Congress. Preservation Metadata Maintenance Activity, available at:
[Link]/standards/premis/
12. Creative Commons licenses, available at: [Link]
13. PURR Terms of Deposit, available at: [Link] Purdue
14. PREMIS. Library of Congress. Preservation Metadata Maintenance Activity, available at: University
[Link]/standards/premis/
15. Data Dictionary for Preservation Metadata, available at: [Link]/content/dam/
research/activities/pmwg/[Link]
References
27
Boyko, A., Kunze, J., Littman, J., Madden, L. and Vargas, B. (2009), NDIIPP Content Transfer
Project: The BagIt File Packaging Format, available at: [Link]
display/Curation/BagIt
Cragin, M., Palmer, C., Carlson, J. and Witt, M. (2010), “Data sharing, small science and
institutional repositories”, Philosophical Transactions of the Royal Society A, Vol. 368
No. 1926, p. 4023.
Faniel, I.M. and Zimmerman, A. (2011), “Beyond the data deluge: a research agenda for
large-scale data sharing and reuse”, The International Journal of Digital Curation, Vol. 6
No. 1, p. 59.
Klimeck, G., McLennan, M., Brophy, S.P., Adams, G.B. and Lundstrom, M.S. (2008),
“[Link]: advancing education and research in nanotechnology”, Computing in
Science and Engineering, Vol. 10 No. 5, pp. 17, 19, 22.
Lee, C. and Tibbo, H. (2007), “Digital curation and trusted repositories: steps toward success”,
Journal of Digital Information, Vol. 8 No. 2, available at: [Link]
php/jodi/article/view/229/183
Witt, M. (2012), “Co-designing, co-developing, and co-implementing an institutional data
repository service”, Journal of Library Administration, Vol. 52 No. 2, p. 3.
Corresponding author
Carly C. Dearborn can be contacted at: cdearbor@[Link]