A Capability Maturity Model for Research Data Management
CMM for RDM » 2. Data acquisition, processing and quality assurance
Last modified by Arden Kirkland on 2014/05/11 15:33
From version 19.2
edited by crowston
on 2013/09/22 08:04
To version 20.1
edited by crowston
on 2013/09/22 08:21
Change comment: There is no comment for this version

Content changes

... ... @@ -20,32 +20,36 @@
20 20
21 21 == 2.2 Ability to perform ==
22 22
23 -**Ability to Perform** describes the preconditions that must exist in the project or organization to implement the process competently. Ability to Perform typically involves resources, organizational structures, and training.
23 +**Ability to Perform** describes the preconditions that must exist in the project or organization to implement the process competently. Ability to Perform typically involves resources, organizational structures, and training, in this case for data collection, processing and quality assurance.
24 24
25 25 === 2.2.1 Develop data file formats ===
26 26
27 -Data collected form a data set that includes a set of data files, electronic or on paper. Each data file includes a set of data items representing the observed data. The project should define and document formats for the files that will store collected data.
27 +Data collected form a data set that includes a set of data files. Each data file includes a set of data items representing the observed data as well as data about how those data were collected. The project should define and document the formats of the files that will store collected data.
28 28
29 -It is important to develop data file formats carefully to ensure that data are stored in consistent formats both within and across files. Data need to be represented in consistent formats to facilitate integration with data in other data files and data sets. Documentation of data file formats is necessary to ensure that data creators store data correctly and data users interpret data correctly.
29 +It is important to develop data file formats carefully to ensure that data are stored consistently both within and across files. Data need to be represented in consistent formats to facilitate integration with data in other data files and data sets. Documentation of data file formats is necessary to ensure that data creators store data correctly and data users interpret data correctly.
30 30
31 -(% style="font-size: 14px;" %)Electronic data files should be stored in non-proprietary formats. Use of software such as spreadsheets that save data in proprietary formats limit how data can be used and increase the risk of the data becoming unreadable (e.g., due to changes in the software). Data that are stored in a proprietary format should include documentation of the specific software and versions used to create it.
31 +Data files are structured like spreadsheets, with rows and columns and a value at the intersection of each row and column.
32 32
33 -The format of data stored in each file should be consistent. Mixing different kinds of data (e.g., from different kinds of observations) in a single file makes further processing or integration of the data difficult. If many observations of different types of measurements are collected, each measurement should be stored in a separate file.
33 +(% style="font-size: 14px;" %)Each column of a data file should represent a single type of data. Storing multiple values in a single cell complicates data analysis. Each column should have a header that describes the variable in that column. Data and annotations of data should be stored in separate columns. A separate column should be used for data qualifiers, descriptions and flags. Time zones for times should be stored in a separate column.
34 34
35 -Within a file, data should be organized in columns with each column representing a single kind of data. Each column should have a header that describes the variable in that column. The format of the file should be such that only rows are added for additional observations, not columns. Each row should have a column or set of columns that uniquely identify the observation (a key field). An optimal data format has data in each column rather than being sparse, with many blank cells. Again, if there are different kinds of observations with different fields, these can be stored in separate files.
35 +The format of the file should be such that only rows are added for additional observations, not columns. Each row should have a column or set of columns that uniquely identify the observation (a key field).
36 36
37 -=== (% style="font-size: 20px; line-height: 1.2em; color: rgb(72, 92, 90);" %)2.2.2 Develop data item formats(%%) ===
38 38
39 -Projects should clearly define the format for representing collected data items. Data type and precision should be selected to be appropriate for the data in each column. It is important to establish these formats to ensure that stored data can be unambiguously interpreted and to reduce the complexity of processing data.
38 +Format for representing collected data items should be clearly defined. The data type and precision (i.e., how many digits) should be selected to be appropriate for the data in each column. It is important to establish these formats to ensure that stored data are consistently recorded and can be unambiguously interpreted, and to reduce the complexity of processing data.
40 40
41 -A consistent set of data types should be used across a data set. (% style="font-size: 14px;" %)Date and time formats in particular should be consistent across the data set. If the date or time associated with an observation is not completely known, then separate columns should be used to separate the parts that are known. Location information in a data set should all use the same coordinate system and representation. Categorical values should be represented by a consistent set of codes. These should not be specific to a particular column or data file but should be consistent across the data set. Missing values should be represented in a consistent way across a data set.
40 +A consistent set of data types should be used across a data set. Date and time formats in particular should be consistent across the data set. If the date or time associated with an observation is not completely known (e.g., only date but not time for certain observations), then separate columns should be used to separate the parts that are known. Location information in a data set should all use the same coordinate system and representation. Categorical values should be represented by a consistent set of codes. These should not be specific to a particular column or data file but should be consistent across the data set. Missing values should be represented in a consistent way across a data set.
42 42
43 -Data and annotations of data should be sorted separately. A separate column should be used for data qualifiers, descriptions and flags. Time zones for times should be stored in a separate field.
42 +The format of observations stored in a single file should be consistent. An optimal data format has data in each column rather than being sparse, with many blank cells. Mixing different kinds of data (e.g., from different types of observations) in a single file complicates further processing or integration of the data. If many observations of different types of measurements are collected, each measurement should be recorded in a separate file.
44 44
45 -=== 2.2.3 Develop data quality control procedures ===
46 46
47 -The project should develop and document procedures for ensuring the quality of data collected. Having documented procedures is important to ensure that data quality tasks are performed consistently and correctly. The specific tasks required are highly dependent on the type of data and the observations.
48 48
46 +Electronic data files should be stored in non-proprietary formats such as tab- or comma-separated values (CSV). Use of software such as spreadsheets that save data in proprietary formats limit how data can be used and increase the risk of the data becoming unreadable due to file corruption or changes in the software. Data that are stored in a proprietary format should include documentation of the specific software and versions used to create it.
47 +
48 +
49 +=== 2.2.2 Develop data quality control procedures ===
50 +
51 +The project should develop and document procedures for controlling the quality of data collected. Having documented procedures is important to ensure that data quality tasks are performed consistently and correctly. The specific tasks required are highly dependent on the type of data and the observations. The procedures should be reviewed periodically to ensure that they are up to date, complete and effective.
52 +
49 49 == 2.3 Activity performed ==
50 50
51 51 **Activities Performed** describes the roles and procedures necessary to implement a key process area, in this case, to acquire, process and assure the quality of data.

XWiki Enterprise 5.1-milestone-1 - Documentation