A Capability Maturity Model for Research Data Management
Welcome to CMM for RDM » The CMM for RDM Blog » Infrastructure, Standards, and Policies for Research Data Management
Last modified by Arden Kirkland on 2014/10/05 15:20
From version 1.2
edited by Jian Qin
on 2014/08/04 11:03
To version 2.1
edited by Jian Qin
on 2014/08/04 11:38
Change comment: There is no comment for this version

Object changes

Object number 0 of type Blog.BlogPostClass modified

Content
... ... @@ -13,15 +13,15 @@
13 13
14 14 Lifecycle is a term frequently used in our technology-driven society. Examples include information systems lifecycle, information transfer lifecycle, and many other variations depending on for which domain the term lifecycle is used. In the science data management domain, this term is used in several contexts: research lifecycle, data lifecycle, data curation lifecycle, and data management lifecycle. Each version has a different emphasis but they are often related or overlap in one way or the other.
15 15
16 -A research lifecycle generally includes study concept and design, data collection, data processing, data access and dissemination, and analysis [3]. As a research project progresses along the stages, different data will be collected, processed, calibrated, transformed, segmented or merged. Data at these stages go through one state to the next after certain processing or condition is performed on them. Some of these data are in the active state and may be changed frequently while others such as raw data and analysis-ready datasets will be tagged with metadata for discovery and reuse. At each stage of this lifecycle, the context and type of research can directly affect the types of data generated and requirements for how the data will be processed, stored, managed, and preserved.
16 +A research lifecycle generally includes study concept and design, data collection, data processing, data access and dissemination, and analysis [2]. As a research project progresses along the stages, different data will be collected, processed, calibrated, transformed, segmented or merged. Data at these stages go through one state to the next after certain processing or condition is performed on them. Some of these data are in the active state and may be changed frequently while others such as raw data and analysis-ready datasets will be tagged with metadata for discovery and reuse. At each stage of this lifecycle, the context and type of research can directly affect the types of data generated and requirements for how the data will be processed, stored, managed, and preserved.
17 17
18 18 Regardless of the context and nature of research, scientific data need to be stored, organized, documented, preserved (or discarded), and made discoverable and usable. The amount of work and time involved in these processes is daunting and intellectually intensive as well as costly. The personnel performing these tasks must be highly trained in technology and subject fields and able to effectively communicate between different stakeholders. In this sense, the lifecycle of research and data is not only a technical domain but also a domain requiring management and communication skills. To be able to manage scientific data at community, institution, and project levels without reinventing-the-wheel, a data infrastructure is necessary to provide the efficiency and services for scientific research as well as data management.
19 19
20 20 ==== Research data management as an infrastructure service ====
21 21
22 -The data-centric research lifecycle no doubt relies heavily on effective research data management. But what is research data management? In a nutshell, research data management is essentially a series of services that an organization develops and implements through institutionalized data policies, technological infrastructures, and information standards. The concept of data infrastructure adopts the principle of “Infrastructure as a Service (IaaS),” which is “a standardized, highly automated offering, where compute resources, complemented by storage and networking capabilities are owned and hosted by a service provider and offered to customers on-demand” [4]. In the context of a data infrastructure, stakeholders will be able to carry out data management functions through a Web-based user interface.
22 +The data-centric research lifecycle no doubt relies heavily on effective research data management. But what is research data management? In a nutshell, research data management is essentially a series of services that an organization develops and implements through institutionalized data policies, technological infrastructures, and information standards. The concept of data infrastructure adopts the principle of “Infrastructure as a Service (IaaS),” which is “a standardized, highly automated offering, where compute resources, complemented by storage and networking capabilities are owned and hosted by a service provider and offered to customers on-demand” [3]. In the context of a data infrastructure, stakeholders will be able to carry out data management functions through a Web-based user interface.
23 23
24 -Infrastructure is a notion of modern society. Being modern is to live within and by means of infrastructures: basic systems and services that are reliable, standardized, and widely accessible, at least within a community. Susan Leigh Star and Karen Ruehdler [5] neatly summarized the features of infrastructures:
24 +Infrastructure is a notion of modern society. Being modern is to live within and by means of infrastructures: basic systems and services that are reliable, standardized, and widely accessible, at least within a community. Susan Leigh Star and Karen Ruehdler [4] neatly summarized the features of infrastructures:
25 25
26 26
27 27 * (% style="font-size: 14px;" %)Embeddedness. Infrastructure is sunk into, inside of, other structures, social arrangements, and technologies.
... ... @@ -31,7 +31,7 @@
31 31 * (% style="font-size: 14px;" %)Embodiment of standards.
32 32 * (% style="font-size: 14px;" %)Built on an installed base.
33 33 * (% style="font-size: 14px;" %)Becomes visible upon breakdown.
34 -* (% style="font-size: 14px;" %)Is fixed in modular increments, not all at once or globally. [5]
34 +* (% style="font-size: 14px;" %)Is fixed in modular increments, not all at once or globally. [4]
35 35
36 36 (((
37 37 These characteristics can also well describe the one that supports science data management. For example, a service that ingests a large number of small data files to build a searchable and filterable database can be scaled up for any disciplines that have the same data management need.
... ... @@ -38,9 +38,48 @@
38 38
39 39 == Three dimensions of data infrastructure services ==
40 40
41 -[[image:data-infrastrucuture.jpg||height="80%"]]
41 +==== The Technology Dimension ====
42 +
43 +The technology infrastructure covers a wide range of technologies for collecting, storing, processing, organizing, transmitting, and preserving data as well as platforms for communication and collaboration. Included in this dimension of the data infrastructure are networks, databases, authentication systems, and software applications. Scientific data and databases are different from conventional ones used for business transactions or employee records due to the idiosyncrasies of scientific data. Not only are scientific data collected from various sources such as observations, experiments, crowd-contributions (e.g., data generated from citizen science projects), or computer modeling /simulations, but also come with a wide variety of types and formats as well as varying levels of processing. Raw data collected from observations, experiments, modeling, or simulations often need to go through a series processing, transformation, and quality check before the data can be used for analysis. Differences in data types and formats cross disciplines or even within the same discipline field can become barriers for data sharing and reuse [5]. The technological dimension of data infrastructure, therefore, is not just a simple technical issue but rather, is closely tied with the policies and standards.
44 +
45 +==== The Dimension of Data and Metadata Standards ====
46 +
47 +Another important dimension of a data infrastructure is data and metadata standards. Scientific data can be grouped into three large blocks based on discipline and type:
48 +
49 +
50 +* (% style="font-size: 14px;" %)Physical and chemical data: include element data, chemical data, isotope data, and particle data;
51 +* (% style="font-size: 14px;" %)Earth and astronomical data: range from weather and climate data, geodesy data to astronomical data for static and dynamic properties of stars, planets, and other objects; and
52 +* (% style="font-size: 14px;" %)Life sciences data: this group contains a long list of varieties, including genome data, flora and fauna data, protein data, nucleotide sequences, biomedical and clinical data, and the list can go on.
53 +
54 +(% style="font-size: 14px;" %)What complicates the diverse types of scientific data is the large number of data format standards that were developed since the introduction of computer into research. Data formats range from the very basic physical level to metaformats to specialized scientific data formats. As data formats move from basic level to more specialized formats, the diversity and complexity increases drastically.
55 +
56 +(% style="font-size: 14px;" %)Metadata standards for scientific data are designed to document details about who collected the data at where, what the data content is about, and how the data were collected. All these are critical for effective data discovery and use. The complexity of scientific data mentioned earlier has led to complex metadata standards. It is not uncommon for metadata standards in the scientific data domain to have hundreds of elements with deep layers of structures. While complex, large metadata standards do provide a comprehensive description for data sets and satisfy the requirements for data discovery and use, their sizes can become barriers for metadata description because large standards make automatic metadata creation almost impossible and at the same time, manual metadata creation is time consuming and expensive and can never keep up with the pace of scientific data growth. At present, each metadata standard has its own tool(s) and most of them are standalone, that is, names of entities and controlled vocabularies are not automatically linked and relationships between data and publications need to be manually added. A data infrastructure will be able to tackle these problems by making metadata schema, entity instances, and controlled vocabularies into infrastructural services.
57 +
58 +==== The Policy Dimension ====
59 +
60 +Policies for scientific data cover a wide range of topics. From national and global perspectives, data policies are mostly related to data sharing, intellectual property protection, ethical issues, and open access [5]. At this level, the role of data policies is to guide the practices of data management, sharing, and use. The National Institute for Health (NIH) has implemented guidelines on data sharing as early as in 2003, which require projects exceeding $500,000 “in direct cost in any year” to include plans for data sharing [6]. NSF also made it mandatory in 2011 that research grant proposals submitted to NSF must include a supplementary document with a label “Data Management Plan” (DMP). This supplementary document should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results [7].
61 +
62 +The NSF mandate for data management plan also brought up many issues that many institutions have not well thought out before. For example, DMP requires research proposals to specify the types and formats of data to be produced and how they will be stored, shared, and managed. To address these requirements, researchers must make their DMP compliant with their institutional data policies in addition to the federal mandate. Researchers need to know what institutional policies are regarding which data types and formats should be archived, whether the institution has a data repository for storing their data files, and what procedures they should establish when sharing data with colleagues and community. In a content analysis of institutional data policies, Bohémier et al. identified six aspects of data policies that should be addressed: data curation, management, use, access, publishing, and sharing. They discovered that data policies are implanted unevenly across institutions: only 15% of all policies applied to the institutions as a whole while most applied only to specific disciplines, collections, or projects [8].
63 +
64 +In many ways, the process of developing data policies is also a process of institutionalization. “To institutionalizing something means to establish a standard practice or custom within a human system” [9]. Data management in many institutions and disciplinary fields is still an area to be studied. The survey findings mentioned at the beginning of this paper demonstrate the importance of institutionalization of data management, which includes establishing data policies, administrative support that will ensure the funding and personnel for data management operations, and best practice guidelines.
42 42 )))
43 43
44 44 == References ==
45 45
46 46 [1] Science Staff, “Challenges and opportunities,” Introduction to special section Dealing with Data. //Science//, 11 February 2011: Vol. 331, pp. 692-693.
70 +
71 +[2] (% style="font-size: 14px;" %)C. Humphrey, “e-Science and the life cycle of research,” unpublished, 2006. [[http:~~/~~/datalib.library.ualberta.ca/~~~~humphrey/lifecycle-science060308.doc>>http://datalib.library.ualberta.ca/~~humphrey/lifecycle-science060308.doc||rel="__blank"]]‎.
72 +
73 +[3] Gartner, “IT glossary”, [[http:~~/~~/www.gartner.com/it-glossary/infrastructure-as-a-service-iaas/>>http://www.gartner.com/it-glossary/infrastructure-as-a-service-iaas/||rel="__blank"]].
74 +
75 +[4] S.L. Star & K. Ruhleder, “Steps toward an ecology of infrastructure: Design and access for large information space.” Information Systems Research, Vol. 7, pp. 111-134, 1996.
76 +
77 +(% style="font-size: 14px;" %)[5] W. L. Anderson, “Some challenges and issues in managing, and preserving access to, long-lived collections of digital scientific and technical data.” Data Science Journal, Vol. 3, pp. 191–202. [[http:~~/~~/www.jstage.jst.go.jp/article/dsj/3/0/191/_pdf>>http://www.jstage.jst.go.jp/article/dsj/3/0/191/_pdf||rel="__blank"]]
78 +
79 +(% style="font-size: 14px;" %)[6] National Institutes of Health, “NIH Data sharing policy and implementation guidance,” Office of extramural research., March 5, 2003, [[http:~~/~~/grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm>>http://grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm||rel="__blank"]]
80 +
81 +(% style="font-size: 14px;" %)[7] National Science Foundation, “NSF Data Management Plan Requirements.” December 8, 2010,. [[http:~~/~~/www.nsf.gov/eng/general/dmp.jsp>>http://www.nsf.gov/eng/general/dmp.jsp||rel="__blank"]]
82 +
83 +(% style="font-size: 14px;" %)[8] K.T. Bohémier, A. Atwood, A. Kuehn, & J. Qin, “A content analysis of institutional data policies,” Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries (JCDL’11), June 13-17, 2011, Ottawa, Canada, pp. 409-410. [[http:~~/~~/eslib.ischool.syr.edu/wp/a-content-analysis-of-institutional-data-policies-2/>>http://eslib.ischool.syr.edu/wp/a-content-analysis-of-institutional-data-policies-2/||rel="__blank"]]
84 +
85 +(% style="font-size: 14px;" %)[9] M. Kramer, “Make it last forever: The institutionalization of service learning in America,” 2000, pp. 14. ,[[http:~~/~~/www.nationalserviceresources.org/filemanager/download/NatlServFellows/kramer.pdf>>http://www.nationalserviceresources.org/filemanager/download/NatlServFellows/kramer.pdf||rel="__blank"]]
Extract
... ... @@ -1,0 +1,1 @@
1 +Although many resources have been made available for research data management, most of them are developed as “islands” and lack linking mechanisms. The lack of integrated and interconnected resources has contributed to high cost and duplicated efforts in data management operations. The vision of research data management as an infrastructure service is not only to improve the efficiency of research data management but also the productivity of the research enterprise. Each of the three dimensions—infrastructure, standards, and policies—addresses a critical aspect of research data management to make the data infrastructure services work.
Is published
NoYes

XWiki Enterprise 5.1-milestone-1 - Documentation