A Capability Maturity Model for Research Data Management
CMM for RDM » 0. Introduction » 0.3 Research Data Management Maturity Levels

0.3 Research Data Management Maturity Levels

Last modified by Arden Kirkland on 2014/06/30 09:36

0.3 Research Data Management Maturity Levels

Perhaps the most well-known aspect of the CMM is five levels of process or capability maturity, which describe the level of development of the practices in a particular organization, representing the “degree of process improvement across a predefined set of process areas” and corresponding to the generic goals listed in the previous section. The initial level describes an organization with no defined processes: in the original CMM, meaning that software is developed (i.e., the specific software related goals are achieved), but in an ad hoc and unrepeatable way, making it impossible to plan or predict the results of the next development project. As the organization increases in maturity, processes become more refined, institutionalized and standardized, achieving the higher numbered generic processes and meaning that the organization can be assured of project results. The CMM thus described an evolutionary improvement path from ad hoc, immature processes to disciplined, mature processes with improved software quality and organizational effectiveness (CMMI Product Team, 2006, p. 535).

Our goal in this document is to lay out a similar path for the improvement of research data management. RDM practices as carried out in research projects similarly range from ad hoc to well-planned and well-managed processes (D’Ignazio & Qin, 2008; Steinhart et al., 2008). The generic practices described above provide a basis for mapping these maturity levels into the context of RDM, as illustrated in Figure 1 and described below. 


Figure 1. Capability maturity levels for research data management

0.3.1 Level 1: Initial

The initial level of the CMM describes an organization with no defined or stable processes. Paulk et al. describe this level thusly: “In an immature organization,… processes are generally improvised by practitioners and their managers during a project” (1993, p. 19). At this level, RDM is needs-based, ad hoc in nature and tends to be done intuitively. Rather than documented processes, the effectiveness of RDM relies on competent people and heroic efforts. The knowledge of the field and skills of the individuals involved (often graduate students working with little input) limits the effectiveness of data management. When those individuals move on or focus elsewhere, there is a danger that RDM will not be sustained; these changes in personnel will have a great impact on the outcomes (e.g., the data collection process will change depending on the person doing it), rendering the data management process unreliable.

0.3.2 Level 2: Managed

Maturity level 2 characterizes projects with processes that are managed through policies and procedures established within the project. At this level of maturity, the research group has discussed and developed a plan for RDM. For example, local data file naming conventions and directory organization structures may be documented. However, these policies and procedures are idiosyncratic to the project meaning that the RDM capability resides at the project level rather than drawing from organizational or community processes definitions. For example, in a survey of science, technology, engineering and mathematics (STEM) faculty, Qin and D’Ignazio (2010) found that respondents predominately used local sources to decide what metadata to create when representing their datasets, either through their own planning, in discussion with their lab groups or somewhat less so through the examples provided by peer researchers. Of far less impact were guidelines from research centers or discipline-based sources. Government requirements or standards also seemed to provide comparatively little help (Qin and D’Ignazio, 2010). As a result, at this level, developing a new project requires redeveloping processes, with possible risks to the effectiveness of RDM. Individual researchers will likely have to learn new processes as they move from project to project. Furthermore, aggregating or sharing data across multiple projects will be hindered by the differences in practices across projects.

0.3.3 Level 3: Defined

In the original CMM, “Defined” means that the processes are documented across the organization and then tailored and applied for particular projects. Defined processes are those with inputs, standards, work procedures, validation procedures and compliance criteria. At this level, an organization can establish new projects with confidence in stable and repeatable execution of processes, rather than the new project having to invent these from scratch. For example, projects at this level likely employ a metadata standard with best practice guidelines. Data sets/products are represented by some formal semantic structures (controlled vocabulary, ontology, or taxonomies), though these standards may be adapted to fit to the project. For example, the adoption of a metadata standard for describing datasets often involves modification and customization of standards in order to meet project needs.

In parallel to the SEI CMM, the RDM process adopted might reflect institutional initiatives in which organizational members or task forces within the institution discuss policies and plans for data management, set best practices for technology and adopt and implement data standards. For example, the Purdue Distributed Data Curation Center (D2C2, http://d2c2.lib.purdue.edu/) brings researchers together to develop optimal ways to manage data, which could lead to formally maintained descriptions of RDM practices. Level 3 organizations can also draw on research-community-based efforts to define processes. Examples include the Hubbard Brook Ecosystem Studies (http://www.hubbardbrook.org/), the Long Term Ecological Research Network (LTER, http://www.lternet.edu/) and Global Biodiversity Information Facility (GBIF, http://www.gbif.org/). Government requirements and standards in regard to research data are often targeted to higher level of data management, e.g., community level or discipline level.

0.3.4 Level 4: Quantitatively Managed

Level 4 in the original CMM means the processes have quantitative quality goals for the products and processes. The processes are instrumented and data are systematically collected and analyzed to evaluate the processes.

For the level 3 capability maturity to reach level 4, the quantitatively managed RDM processes, institutions and projects will "establish quantitative objectives for quality and process performance and use them as criteria in managing processes" (CMMI Product Team, 2006, p. 37). These quantitative objectives are determined based on the goals and user requirements of RDM. For example, if one of the goals is to minimize unnecessary repetitive data entry when researchers submitting datasets to a repository, then it might be useful to ask data submission interface users to record the number of times a same piece of data (author name, organization name, project name, etc.) is keyed in. An analysis of unnecessary repetitions in data entry may inform where in the RDM process the efficiency of data entry may be improved. The key here is to collect the statistics while action is being taken rather than after the fact. This means that a quantitatively managed maturity level has better predictability of process performance, because "the performance of processes is controlled using statistical and other quantitative techniques, and is quantitatively predictive" (CMMI Product Team, 2006, p. 38).

0.3.5 Level 5: Optimizing

Level 5, Optimizing, means that the organization is focused on improving the processes: weaknesses are identified and defects are addressed proactively. Processes introduced at these levels of maturity address generic techniques for process improvement.

While CMM has been around for two decades and applied in various contexts for improving processes and performance, it just began to draw attention from the research data management community. RDM is still a relatively new domain and much of the research has been devoted to the specific fields and practices such as metadata and data repositories. Examples of using CMM for data management processes and other goals began to emerge in the last couple of years (see note 1), with slightly different focus and interpretations. This document takes a holistic view of RDM and uses the CMM lens to examine RDM processes in the hope that we can identify the weaknesses of RDM and find ways to improve RDM processes.  


Brooks Jr, F. P. (2010). The design of design: Essays from a computer scientist. Pearson Education.

CMMI Product Team. (2006). CMMI for Development Version 1.2. CMU/SEI-2006-TR-008. Pittsburgh, PA, USA: Carnegie Mellon Software Engineering Institute. Retrieved from http://repository.cmu.edu/sei/387

D’Ignazio, J., & Qin, J. (2008). Faculty data management practices: A campus-wide census of STEM departments. Proceedings of the American Society for Information Science and Technology, 45(1), 1–6. doi:10.1002/meet.2008.14504503139 . Retrieved from http://onlinelibrary.wiley.com/doi/10.1002/meet.2008.14504503139/abstract

Paulk, M. C., Curtis, B., Chrissis, M. B., & Weber, C. (1993). Capability maturity model, Version 1.1. IEEE Software, 10(4): 18-27. Retrieved from http://www.computer.org/csdl/mags/so/1993/04/s4018-abs.html

Qin, J. & D’Ignazio, J. (2010). The central role of metadata in a science data literacy course. Journal of Library Metadata, 10(2), 188-204. doi:10.1080/19386389.2010.506379. Retrieved from http://www.tandfonline.com/doi/abs/10.1080/19386389.2010.506379

Steinhart, G., Saylor, J., Albert, P., Alpi, K., Baxter, P., Brown, E., et al. (2008). Digital Research Data Curation: Overview of Issues, Current Activities, and Opportunities for the Cornell University Library (Working Paper). Retrieved from http://hdl.handle.net/1813/10903

<--Previous Page / Next Page -->

Created by Jian Qin on 2013/07/28 11:27

XWiki Enterprise 5.1-milestone-1 - Documentation