A Capability Maturity Model for Research Data Management
Welcome to CMM for RDM » The CMM for RDM Blog » Infrastructure, Standards, and Policies for Research Data Management

Infrastructure, Standards, and Policies for  Research Data Management

Last modified by Arden Kirkland on 2014/10/05 15:20

Aug 04 2014

Note: This blog is an excerpt from my paper presented last year at the COINFO 2013. For full paper, please visit http://jianqin.metadataetc.org/?p=144

Research data management has gained increasing recognition for its value and importance among funding agencies and research institutions, as evidenced by the fast growth of data repositories at disciplinary community and institutional levels. Examples of these repositories include the Global Biodiversity Information Facility (GBIF, http://www.gbif.org/), Dryad (http://datadryad.org/), and GenBank (http://www.ncbi.nlm.nih.gov/genbank/), among others. While these disciplinary repositories are important venues for data curation and sharing, they targeted on the end product of a research lifecycle. The large amounts of work necessary for data to reach the submission point are left to researchers to deal with. 

Two years ago the Science magazine conducted a survey to their peer reviewers from the previous year on the availability and use of data. The 1,700 responses represented input from an international and interdisciplinary group of scientific leaders. As the Science editorial reported, “About 20% of the respondents regularly use or analyze data sets exceeding 100 gigabytes, and 7% use data sets exceeding 1 terabyte. About half of those polled store their data only in their laboratories—not an ideal long-term solution. Many bemoaned the lack of common metadata and archives as a main impediment to using and storing data, and most of the respondents have no funding to support archiving” [1]. 

The Science magazine survey presents two major problems in the current state of scientific data management in a research lifecycle: there is a lack of funding and staff support for managing active data and a lack of metadata standards and tools for managing active data in research lifecycle. What does it take to solve these problems? In other words, what needs to be done to provide the support necessary for improving research productivity through effective data management? The answers lie in a good understanding of research and data lifecycle and their implications to data management and support needed for managing scientific data. 

Key concepts

Research lifecycle and data lifecycle

Lifecycle is a term frequently used in our technology-driven society. Examples include information systems lifecycle, information transfer lifecycle, and many other variations depending on for which domain the term lifecycle is used. In the science data management domain, this term is used in several contexts: research lifecycle, data lifecycle, data curation lifecycle, and data management lifecycle. Each version has a different emphasis but they are often related or overlap in one way or the other.

A research lifecycle generally includes study concept and design, data collection, data processing, data access and dissemination, and analysis [2]. As a research project progresses along the stages, different data will be collected, processed, calibrated, transformed, segmented or merged. Data at these stages go through one state to the next after certain processing or condition is performed on them. Some of these data are in the active state and may be changed frequently while others such as raw data and analysis-ready datasets will be tagged with metadata for discovery and reuse. At each stage of this lifecycle, the context and type of research can directly affect the types of data generated and requirements for how the data will be processed, stored, managed, and preserved.  

Regardless of the context and nature of research, scientific data need to be stored, organized, documented, preserved (or discarded), and made discoverable and usable. The amount of work and time involved in these processes is daunting and intellectually intensive as well as costly. The personnel performing these tasks must be highly trained in technology and subject fields and able to effectively communicate between different stakeholders. In this sense, the lifecycle of research and data is not only a technical domain but also a domain requiring management and communication skills. To be able to manage scientific data at community, institution, and project levels without reinventing-the-wheel, a data infrastructure is necessary to provide the efficiency and services for scientific research as well as data management. 

Research data management as an infrastructure service

The data-centric research lifecycle no doubt relies heavily on effective research data management. But what is research data management? In a nutshell, research data management is essentially a series of services that an organization develops and implements through institutionalized data policies, technological infrastructures, and information standards. The concept of data infrastructure adopts the principle of “Infrastructure as a Service (IaaS),” which is “a standardized, highly automated offering, where compute resources, complemented by storage and networking capabilities are owned and hosted by a service provider and offered to customers on-demand” [3]. In the context of a data infrastructure, stakeholders will be able to carry out data management functions through a Web-based user interface.  

Infrastructure is a notion of modern society. Being modern is to live within and by means of infrastructures: basic systems and services that are reliable, standardized, and widely accessible, at least within a community. Susan Leigh Star and Karen Ruehdler [4] neatly summarized the features of infrastructures:

  • Embeddedness. Infrastructure is sunk into, inside of, other structures, social arrangements, and technologies.
  • Transparency. Infrastructure does not have to be reinvented each time of assembled for each task, but invisibly supports those tasks.
  • Reach or scope beyond a single event or a local practice.
  • Learned as part of membership. • Links with conventions of practice. 
  • Embodiment of standards. 
  • Built on an installed base.
  • Becomes visible upon breakdown.
  • Is fixed in modular increments, not all at once or globally. [4]

These characteristics can also well describe the one that supports science data management. For example, a service that ingests a large number of small data files to build a searchable and filterable database can be scaled up for any disciplines that have the same data management need.

Three dimensions of data infrastructure services

The Technology Dimension

The technology infrastructure covers a wide range of technologies for collecting, storing, processing, organizing, transmitting, and preserving data as well as platforms for communication and collaboration. Included in this dimension of the data infrastructure are networks, databases, authentication systems, and software applications. Scientific data and databases are different from conventional ones used for business transactions or employee records due to the idiosyncrasies of scientific data. Not only are scientific data collected from various sources such as observations, experiments, crowd-contributions (e.g., data generated from citizen science projects), or computer modeling /simulations, but also come with a wide variety of types and formats as well as varying levels of processing. Raw data collected from observations, experiments, modeling, or simulations often need to go through a series processing, transformation, and quality check before the data can be used for analysis. Differences in data types and formats cross disciplines or even within the same discipline field can become barriers for data sharing and reuse [5]. The technological dimension of data infrastructure, therefore, is not just a simple technical issue but rather, is closely tied with the policies and standards. 

The Dimension of Data and Metadata Standards

Another important dimension of a data infrastructure is data and metadata standards. Scientific data can be grouped into three large blocks based on discipline and type:

  • Physical and chemical data: include element data, chemical data, isotope data, and particle data;
  • Earth and astronomical data: range from weather and climate data, geodesy data to astronomical data for static and dynamic properties of stars, planets, and other objects; and
  • Life sciences data: this group contains a long list of varieties, including genome data, flora and fauna data, protein data, nucleotide sequences, biomedical and clinical data, and the list can go on.

What complicates the diverse types of scientific data is the large number of data format standards that were developed since the introduction of computer into research. Data formats range from the very basic physical level to metaformats to specialized scientific data formats. As data formats move from basic level to more specialized formats, the diversity and complexity increases drastically. 

Metadata standards for scientific data are designed to document details about who collected the data at where, what the data content is about, and how the data were collected. All these are critical for effective data discovery and use. The complexity of scientific data mentioned earlier has led to complex metadata standards. It is not uncommon for metadata standards in the scientific data domain to have hundreds of elements with deep layers of structures. While complex, large metadata standards do provide a comprehensive description for data sets and satisfy the requirements for data discovery and use, their sizes can become barriers for metadata description because large standards make automatic metadata creation almost impossible and at the same time, manual metadata creation is time consuming and expensive and can never keep up with the pace of scientific data growth. At present, each metadata standard has its own tool(s) and most of them are standalone, that is, names of entities and controlled vocabularies are not automatically linked and relationships between data and publications need to be manually added. A data infrastructure will be able to tackle these problems by making metadata schema, entity instances, and controlled vocabularies into infrastructural services. 

The Policy Dimension

Policies for scientific data cover a wide range of topics. From national and global perspectives, data policies are mostly related to data sharing, intellectual property protection, ethical issues, and open access [5]. At this level, the role of data policies is to guide the practices of data management, sharing, and use. The National Institute for Health (NIH) has implemented guidelines on data sharing as early as in 2003, which require projects exceeding $500,000 “in direct cost in any year” to include plans for data sharing [6]. NSF also made it mandatory in 2011 that research grant proposals submitted to NSF must include a supplementary document with a label “Data Management Plan” (DMP). This supplementary document should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results [7]. 

The NSF mandate for data management plan also brought up many issues that many institutions have not well thought out before. For example, DMP requires research proposals to specify the types and formats of data to be produced and how they will be stored, shared, and managed. To address these requirements, researchers must make their DMP compliant with their institutional data policies in addition to the federal mandate. Researchers need to know what institutional policies are regarding which data types and formats should be archived, whether the institution has a data repository for storing their data files, and what procedures they should establish when sharing data with colleagues and community. In a content analysis of institutional data policies, Bohémier et al. identified six aspects of data policies that should be addressed: data curation, management, use, access, publishing, and sharing. They discovered that data policies are implanted unevenly across institutions: only 15% of all policies applied to the institutions as a whole while most applied only to specific disciplines, collections, or projects [8]. 

In many ways, the process of developing data policies is also a process of institutionalization. “To institutionalizing something means to establish a standard practice or custom within a human system” [9]. Data management in many institutions and disciplinary fields is still an area to be studied. The survey findings mentioned at the beginning of this paper demonstrate the importance of institutionalization of data management, which includes establishing data policies, administrative support that will ensure the funding and personnel for data management operations, and best practice guidelines. 


[1] Science Staff, “Challenges and opportunities,” Introduction to special section Dealing with Data. Science, 11 February 2011: Vol. 331, pp. 692-693. 

[2] C. Humphrey, “e-Science and the life cycle of research,” unpublished, 2006. http://datalib.library.ualberta.ca/~humphrey/lifecycle-science060308.doc‎. 

[3] Gartner, “IT glossary”, http://www.gartner.com/it-glossary/infrastructure-as-a-service-iaas/.

[4] S.L. Star & K. Ruhleder, “Steps toward an ecology of infrastructure: Design and access for large information space.” Information Systems Research, Vol. 7, pp. 111-134, 1996. 

[5] W. L. Anderson, “Some challenges and issues in managing, and preserving access to, long-lived collections of digital scientific and technical data.” Data Science Journal, Vol. 3, pp. 191–202. http://www.jstage.jst.go.jp/article/dsj/3/0/191/_pdf

[6] National Institutes of Health, “NIH Data sharing policy and implementation guidance,” Office of extramural research., March 5, 2003, http://grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm  

[7] National Science Foundation, “NSF Data Management Plan Requirements.” December 8, 2010,. http://www.nsf.gov/eng/general/dmp.jsp

[8] K.T. Bohémier, A. Atwood, A. Kuehn, & J. Qin,  “A content analysis of institutional data policies,” Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries (JCDL’11), June 13-17, 2011, Ottawa, Canada, pp. 409-410. http://eslib.ischool.syr.edu/wp/a-content-analysis-of-institutional-data-policies-2/

[9] M. Kramer, “Make it last forever: The institutionalization of service learning in America,” 2000, pp. 14. ,http://www.nationalserviceresources.org/filemanager/download/NatlServFellows/kramer.pdf

Created by Jian Qin on 2014/08/04 11:03

XWiki Enterprise 5.1-milestone-1 - Documentation