A Capability Maturity Model for Research Data Management

2.2 Ability to Perform

Last modified by Arden Kirkland on 2014/06/06 12:54

2.2 Ability to Perform

Ability to Perform describes the preconditions that must exist in the project or organization to implement the process competently. Ability to Perform typically involves resources, organizational structures, and training.

2.2.1 Develop data file formats

Typically collected data for a research study form a data set that includes a set of data files, where each data file includes a set of data items representing the observed data as well as data about how those data were collected. The project should define and document the formats of the files that will store collected data, both at the level of whole files and for the specific data items within a file (Hook et al., 2010). 

It is important to develop data file formats carefully to ensure that data are stored consistently both within and across files (Hook et al., 2010). Data need to be represented in consistent formats to facilitate integration with data in other data files and data sets (Hale et al., 2010, and DataONE, 2011a). Documentation of data file formats is necessary to ensure that data creators store data correctly and data users interpret data correctly. 

At the whole file level, electronic data files should be stored in non-proprietary formats, e.g., a simple text format such as tab- or comma-separated values (CSV) (DataONE, 2011j) or a more complex format such as NetCDF (Network Common Data Form) or Hierarchical Data Format (HDF). More complex formats offer additional features, such as error correcting codes to detect and recovery from errors in the underlying data store. Use of software such as spreadsheets (e.g., Excel) that save data in proprietary formats limit how data can be used and increase the risk of the data becoming unreadable due to file corruption or changes in the software (DataONE, 2011h). Data that are stored in a proprietary format should include documentation of the specific software and versions used to create it (Hook et al., 2010). The format of multimedia files such as sound, images or video should similarly be documented. 

It is also important to document the layout of data within each file. Observational data files are generally structured like spreadsheets, with rows and columns and a value at the intersection of each row and column, each row representing an observation and each column, data about the observation (e.g., time or location) or a type of data collected. 

The format of the file should be such that only rows are added for additional observations, not columns (Borer et al., 2009). Each row should have one column or set of columns that uniquely identify the observation (a key field) (Borer et al., 2009). 

Each column of a data file should represent a single type of data (DataONE, 2011h). Storing multiple values in a single cell complicates data analysis (Borer et al., 2009). Each column should have a header that describes the variable in that column (Borer et al., 2009). Data and annotations of data should be stored in separate columns (Hook et al., 2010). A separate column should also be used for data qualifiers, descriptions and flags (DataONE, 2011i).  

Format for representing collected data items should be clearly defined. The data type and precision (i.e., how many digits) should be selected to be appropriate for the data in each column (DataONE, 2011g). It is important to establish these formats to ensure that stored data are consistently recorded and can be unambiguously interpreted, and to reduce the complexity of processing data. 

A consistent set of data types should be used across a data set (DataONE, 2011e). Date and time formats in particular should be consistent across the data set (DataONE, 2011b). If the date or time associated with an observation is not completely known (e.g., only date but not time for certain observations), then separate columns should be used to separate the parts that are known (DataONE, 2011b). If data are collected at diverse locations, it may be necessary to capture the timezone of times (Hook et al., 2010). Location information in a data set should all use the same coordinate system and representation (Hook et al., 2010). Categorical values should be represented by a consistent set of terms or codes (DataONE, 2011k). These should not be specific to a particular column or data file but should be consistent across the data set. Missing values should be represented in a consistent way across a data set (DataONE, 2011f). 

The format of observations stored in a single file should be consistent. Ideally, each observation would correspond to one row in the file. An optimal data format has data in each column rather than being sparse, with many blank cells (DataONE, 2011d). Mixing different kinds of data (e.g., from different types of observations) in a single file complicates further processing or integration of the data. If many observations of different types of measurements are collected, each measurement should be recorded in a separate file (Hook et al., 2010). 

2.2.2 Develop data quality control procedures

Projects should develop and document procedures for controlling the quality of data collected (DataONE, 2011c). Procedures can address control of quality in both data collection and capture.

Having documented procedures is important to ensure that data quality tasks are performed consistently and correctly. 

The specific tasks required are highly dependent on the type of data and the observations. For example, a simple procedure is to establish reasonable ranges for data items and to double check recorded values that fall outside these ranges. If a batch of data are entered (e.g., from a hand-written data collection form), a simple check is that the number of items entered match the number recorded in the original document. Slightly more complicated is the technique of "casting out nines": repeatedly adding up all of the digits entered and comparing the sum to the sum of the digits in the original document. For some kinds of data, it may be possible to audit a sample of data to ensure that they were collected and recorded correctly and to estimate the proportion of erroneous data in the unaudited dataset. 

Procedures should be reviewed periodically to ensure that they are up to date, complete and effective (DataONE, 2011c).

Rubric

 Rubric for  2.2 - Ability to Perform
Level 0
This process or practice is not being observed 
No steps have been taken to provide for resources, structure, or training with regards to file formats or quality control procedures
Level 1: Initial
Data are managed intuitively at project level without clear goals and practices 
Resources, structure, and training with regards to file formats or quality control procedures have been considered minimally by individual team members, but not codified
Level 2: Managed
DM process is characterized for projects and often reactive 
Resources, structure, and training with regards to file formats or quality control procedures have been recorded for this project, but have not taken wider community needs or standards into account
Level 3: Defined
DM is characterized for the organization/community and proactive 
The project provides resources, structure, and training with regards to file formats or quality control procedures as defined for the entire community or institution
Level 4: Quantitatively Managed
DM is measured and controlled  
Quantitative quality goals have been established for resources, structure, and training with regards to file formats or quality control procedures, and both data and practices are systematically measured for quality
Level 5: Optimizing
Focus on process improvement  
Processes regarding resources, structure, and training, with regards to file formats or quality control procedures, are evaluated on a regular basis, and necessary improvements are implemented

References


Borer, E. T., Seabloom, E. W., Jones, M. B., & Schildhauer, M. (2009). Some Simple Guidelines for Effective Data Management. Bulletin of the Ecological Society of America, 90(2), 205–214. http://dx.doi.org/10.1890/0012-9623-90.2.205


DataONE. (2011a). Consider the compatibility of the data you are integrating. Retrieved from https://www.dataone.org/best-practices/consider-compatibility-data-you-are-integrating


DataONE. (2011b). Describe formats for date and time. Retrieved from https://www.dataone.org/best-practices/describe-formats-date-and-time


DataONE. (2011c). Develop a quality assurance and quality control plan. Retrieved from https://www.dataone.org/best-practices/develop-quality-assurance-and-quality-control-plan


DataONE. (2011d). Document your data organization strategy. Retrieved from https://www.dataone.org/best-practices/document-your-data-organization-strategy


DataONE. (2011e). Ensure basic quality control. Retrieved from https://www.dataone.org/best-practices/ensure-basic-quality-control


DataONE. (2011f). Identify missing values and define missing value codes. Retrieved from https://www.dataone.org/best-practices/identify-missing-values-and-define-missing-value-codes


DataONE. (2011g). Maintain consistent data typing. Retrieved from https://www.dataone.org/best-practices/maintain-consistent-data-typing


DataONE. (2011h). Preserve information: keep your raw data raw. Retrieved from https://www.dataone.org/best-practices/preserve-information-keep-your-raw-data-raw


DataONE. (2011i). Separate data values from annotations. Retrieved from https://www.dataone.org/best-practices/separate-data-values-annotations


DataONE. (2011j). Use appropriate field delimiters. Retrieved from https://www.dataone.org/best-practices/use-appropriate-field-delimiters


DataONE. (2011k). Use consistent codes. Retrieved from https://www.dataone.org/best-practices/use-consistent-codes


Hale, S. S., Miglarese, A. H., Bradley, M. P., Belton, T. J., Cooper, L. D., Frame, M. T., et al. (2003). Managing Troubled Data: Coastal Data Partnerships Smooth Data Integration. Environmental Monitoring and Assessment, 81(1-3), 133–148. doi:10.1023/A:1021372923589. Retrieved from http://link.springer.com/article/10.1023%2FA%3A1021372923589


Hook, L. A., Vannan, S. K. S., Beaty, T. W., Cook, R. B., & Wilson, B. E. (2010). Best Practices for Preparing Environmental Data Sets to Share and Archive. Oak Ridge National Laboratory Distributed Active  Archive Center. Retrieved from http://daac.ornl.gov/PI/BestPractices-2010.pdf

<--Previous Page / Next Page -->

Created by Jian Qin on 2013/10/08 20:52

XWiki Enterprise 5.1-milestone-1 - Documentation