A vignette for datacheck package (version 0.9.6)

Reinhard Simon, International Potato Center, Lima, Peru

The library datacheck provides some simple functions to check the consistency of a dataset. It assumes data are available in tabular format - typically a csv file with objects or records in rows and attributes or variables in the columns.

In a database setting the variables would be controlled by the database - at least conformance to types (character, numeric, etc) and allowed min/maximum values. However, often data are gathered in simple spreadsheets or are for other reasons without such constraints. Here, data constraints like allowed types or values, expected values and relationships can be defined using R commands and syntax. This allows much more flexibility and fine grained control. Typically it demands also a lot of domain knowledge from the user. It is therefore often useful to re-use such domain aware rule files across tables with similar content. Therefore this tool is foregiving if rules cannot be executed if a variable is not present in the table to be analyzed allowing the reuse of such rule files.

Using the HTML interface

Use the following commands to copy some example files to your current working directory (uncomment the file.copy command):

atable = system.file("examples/soilsamples.csv", package = "datacheck")
srules = system.file("examples/soil_rules.R", package = "datacheck")

# Uncomment the next two lines

# file.copy(atable, 'soilsamples.csv') file.copy(srules, 'soil_rules.R')

Then type in the command runDatacheck() in the R editor.

Use the upload buttons to load the respective files in your working directory. Review the results.

Using the command line interface

Assuming you have copied the above mentioned files in your working directory proceed to read in the data.



atable = read.csv(atable, header = TRUE, stringsAsFactors = FALSE)
srules = read.rules(srules)
profil = datadict.profile(atable, srules)

You can inspect a graphical summary of rules per variable:

ruleCoverage(profil)

plot of chunk unnamed-chunk-3

The cumulative number of records with increasing scores.

scoreSum(profil)

plot of chunk unnamed-chunk-4

Or see the tables (only the first 20 records and first 6 columns shown):

xtable(atable[1:20, 1:6])
ID Latitude Longitude Country Adm1 Adm2
1 1 -7.48 -78.97 Peru Cajamarca Contumazá
2 2 -7.48 -78.97 Peru Cajamarca Contumazá
3 3 -7.48 -78.97 Peru Cajamarca Contumazá
4 4 -18.18 -70.47 Peru Tacna Tacna
5 5 -12.26 -75.07 Peru Huancavelica Tayacaja
6 6 -12.26 -75.07 Peru Huancavelica Tayacaja
7 7 -12.24 -75.05 Peru Huancavelica Tayacaja
8 8 -12.24 -75.05 Peru Huancavelica Tayacaja
9 9 -12.08 -76.95 Peru Lima Lima
10 10 -12.08 -76.95 Peru Lima Lima
11 11 -12.03 -75.24 Peru Junin Huancayo
12 12 -11.13 -75.36 Peru Junin Chanchamayo
13 13 -10.58 -75.40 Peru Pasco Oxapampa
14 14 -9.10 -76.59 Peru Huanuco Huacaybamba
15 15 -5.89 -76.11 Peru Loreto Alto Amazonas
16 16 -3.80 -73.32 Peru Loreto Maynas
17 17 -3.80 -73.32 Peru Loreto Maynas
18 18 -3.80 -73.32 Peru Loreto Maynas
19 19 -3.80 -73.32 Peru Loreto Maynas
20 20 -3.80 -73.32 Peru Loreto Maynas

Similarly for the score table; however, this table contains also the total counts of scores by records and variables. In addition, the maximum score by variable.

ps = profil$scores
recs = c(1:10, nrow(ps) - 1, nrow(ps))
cols = c(1:4, ncol(ps))
xtable(ps[recs, cols])
ID Latitude Longitude Country Record.score
1 3.00 2.00 3.00 2.00 35.00
2 3.00 2.00 3.00 2.00 35.00
3 3.00 2.00 3.00 2.00 35.00
4 3.00 2.00 3.00 2.00 35.00
5 3.00 2.00 3.00 2.00 35.00
6 3.00 2.00 3.00 2.00 35.00
7 3.00 2.00 3.00 2.00 35.00
8 3.00 2.00 3.00 2.00 35.00
9 3.00 2.00 3.00 2.00 35.00
10 3.00 2.00 3.00 2.00 31.00
Attribute.score 5259.00 3490.00 5243.00 3506.00 61055.00
Rules.per.variable 3.00 2.00 3.00 2.00 35.00

A last visualization is a heatmap of the score table to organize similar records and similar rule profiles to help detect any patterns,

plot of chunk unnamed-chunk-7

Checking tables with data inconsistencies

For comparative purposes we purposely introduce a few errors in our table as below. We also exclude a rule on soil types for better display.

atable$P[1] = -100
atable$pH[11] = -200
srule1 = srules[-c(33), ]
profil = datadict.profile(atable, srule1)

To get a better handle on the data it is always informative to review simple descriptive summaries of the data. A custom summary function is included in the package to display this summary in tabular form:

xtable(shortSummary(atable))
n missing unique value min max Mean sd .05 .10 .25 .50 .75 .90 .95
ID 1753 0 1753 1 1753 877 506.19 88.6 176.2 439.0 877.0 1315.0 1577.8 1665.4
Latitude 1737 16 168 -18.2976 -3.6159 -12.28 3.23 -18.182 -15.884 -15.833 -12.070 -11.130 -7.157 -5.894
Longitude 1737 16 169 -80.823 -69.0654 -74.6 2.71 -77.61 -76.95 -76.95 -75.35 -72.10 -70.03 -70.03
Country 1753 0 1 Peru
Adm1 1753 0 22
Adm2 1743 10 58
Adm3 1738 15 110
pH 1751 2 333 -200 10.46 6.361 5.12 4.100 4.510 5.200 7.100 7.600 7.900 8.185
Conductivity 1752 1 443 0.02 42.4 1.571 2.51 0.08 0.11 0.21 0.52 2.34 3.95 5.25
CaCO3 1752 1 190 0 94.8 1.722 6.69 0.000 0.000 0.000 0.000 0.380 3.393 10.100
Organic_matter 1752 1 370 0.03 50.9 2.336 3.71 0.290 0.500 0.890 1.500 2.400 4.210 6.918
P 1752 1 453 -100 503.7 19.36 25.57 2.80 3.70 6.00 13.20 22.30 46.48 58.90
Sand 1687 66 73 0 100 54.97 16.29 26.0 33.2 46.0 54.0 66.0 76.0 82.0
Lime 1686 67 58 0 74 29.01 9.82 12 18 24 28 35 40 44
Clay 1686 67 54 0 76 16.01 10.25 2 4 8 16 20 28 36
Soil_texture 1692 61 12
Altitude 1753 0 157 -9999 4417 1661 1940.9 78 78 235 839 3299 3846 3846

A summary of the results by rule can be seen from the profil object:

xtable(profil$checks)
Variable Type Rule Comment Execution Error.sum Error.list
1 ID integer sapply(ID, is.integer) None ok 0 none
2 ID integer !duplicated(ID) None ok 0 none
3 ID integer ID > 0 & ID < 1754 None ok 0 none
4 Latitude numeric sapply(Latitude, is.numeric) None ok 0 none
5 Latitude numeric Latitude < 0 None ok 0 none
6 Longitude numeric sapply(Longitude, is.numeric) None ok 0 none
7 Longitude numeric Longitude < 180 & Longitude > -180 None ok 0 none
8 Longitude numeric is.null(Longitude) == is.null(Latitude) None ok 0 none
9 Adm1 character sapply(Adm1, is.character) None ok 0 none
10 Adm2 character sapply(Adm2, is.character) None ok 0 none
11 Adm3 character sapply(Adm3, is.character) None ok 0 none
12 Country character sapply(Country, is.character) None ok 0 none
13 Altitude integer sapply(Altitude, is.integer) None ok 0 none
14 Adm1 character is.null(Adm1) == is.null(Longitude) ok 0 none
15 Adm2 character is.null(Adm2) == is.null(Longitude) None ok 0 none
16 Adm3 character is.null(Adm3) == is.null(Longitude) None ok 0 none
17 Country character is.null(Country) == is.null(Longitude) None ok 0 none
18 Altitude integer is.null(Altitude) == is.null(Longitude) None ok 0 none
19 pH numeric sapply(pH, is.numeric) None ok 0 none
20 pH numeric pH > = 0 pH bigger than ok 1 11
21 pH numeric pH < = 14 pH lesser than ok 0 none
22 Conductivity numeric sapply(Conductivity, is.numeric) None ok 0 none
23 Conductivity numeric Conductivity > = 0 None ok 0 none
24 CaCO3 numeric sapply(CaCO3, is.numeric) None ok 0 none
25 CaCO3 numeric CaCO3 > = 0 None ok 0 none
26 Sand numeric sapply(Sand, is.numeric) None ok 0 none
27 Sand numeric sapply(Sand, is.withinRange, 0, 100) None ok 0 none
28 Lime numeric sapply(Lime, is.numeric) None ok 0 none
29 Lime numeric sapply(Lime, is.withinRange, 0, 100) None ok 0 none
30 Clay numeric sapply(Clay, is.numeric) None ok 0 none
31 Clay numeric sapply(Clay, is.withinRange, 0, 100) None ok 0 none
32 Soil_texture character sapply(Soil_texture, is.character) None ok 0 none
34 P numeric sapply(P, is.numeric) None ok 0 none
35 P numeric P > = 0 None ok 1 1

The checks part lists all erroneous records in the last column for each rule. This may be too long for printing. To this end a custom print report function only displays the first n records where n=5 is the default.

atable$Sand[20:30] = -1
profil = datadict.profile(atable, srule1)

xtable(prep4rep(profil$checks))
Variable Type Rule Comment Execution Error.sum Error.list
1 ID integer sapply(ID, is.integer) None ok 0 none
2 ID integer !duplicated(ID) None ok 0 none
3 ID integer ID > 0 & ID < 1754 None ok 0 none
4 Latitude numeric sapply(Latitude, is.numeric) None ok 0 none
5 Latitude numeric Latitude < 0 None ok 0 none
6 Longitude numeric sapply(Longitude, is.numeric) None ok 0 none
7 Longitude numeric Longitude < 180 & Longitude > -180 None ok 0 none
8 Longitude numeric is.null(Longitude) == is.null(Latitude) None ok 0 none
9 Adm1 character sapply(Adm1, is.character) None ok 0 none
10 Adm2 character sapply(Adm2, is.character) None ok 0 none
11 Adm3 character sapply(Adm3, is.character) None ok 0 none
12 Country character sapply(Country, is.character) None ok 0 none
13 Altitude integer sapply(Altitude, is.integer) None ok 0 none
14 Adm1 character is.null(Adm1) == is.null(Longitude) ok 0 none
15 Adm2 character is.null(Adm2) == is.null(Longitude) None ok 0 none
16 Adm3 character is.null(Adm3) == is.null(Longitude) None ok 0 none
17 Country character is.null(Country) == is.null(Longitude) None ok 0 none
18 Altitude integer is.null(Altitude) == is.null(Longitude) None ok 0 none
19 pH numeric sapply(pH, is.numeric) None ok 0 none
20 pH numeric pH > = 0 pH bigger than ok 1 11
21 pH numeric pH < = 14 pH lesser than ok 0 none
22 Conductivity numeric sapply(Conductivity, is.numeric) None ok 0 none
23 Conductivity numeric Conductivity > = 0 None ok 0 none
24 CaCO3 numeric sapply(CaCO3, is.numeric) None ok 0 none
25 CaCO3 numeric CaCO3 > = 0 None ok 0 none
26 Sand numeric sapply(Sand, is.numeric) None ok 0 none
27 Sand numeric sapply(Sand, is.withinRange, 0, 100) None ok 11 20,21,22,23,24 … more
28 Lime numeric sapply(Lime, is.numeric) None ok 0 none
29 Lime numeric sapply(Lime, is.withinRange, 0, 100) None ok 0 none
30 Clay numeric sapply(Clay, is.numeric) None ok 0 none
31 Clay numeric sapply(Clay, is.withinRange, 0, 100) None ok 0 none
32 Soil_texture character sapply(Soil_texture, is.character) None ok 0 none
34 P numeric sapply(P, is.numeric) None ok 0 none
35 P numeric P > = 0 None ok 1 1

Using rules that can't be executed

This may happen if the syntax is wrong. Another reason - particularly if re-using rule files across tables - maybe that a particular variable name is not present amongst the column names of the present table. The tool will just ignore it and report a 'failed' execution. Let us simply modify an existing rule as below:

srule1$Variable[25] = "caCO3"
srule1$Rule[25] = "caCO3 >= 0"
profil = datadict.profile(atable, srule1)

Now let us just look at an excerpt of the results table:

xtable(prep4rep(profil$checks[20:30, ]))
Variable Type Rule Comment Execution Error.sum Error.list
20 pH numeric pH > = 0 pH bigger than ok 1 11
21 pH numeric pH < = 14 pH lesser than ok 0 none
22 Conductivity numeric sapply(Conductivity, is.numeric) None ok 0 none
23 Conductivity numeric Conductivity > = 0 None ok 0 none
24 CaCO3 numeric sapply(CaCO3, is.numeric) None ok 0 none
25 caCO3 numeric caCO3 > = 0 None failed 0 NA
26 Sand numeric sapply(Sand, is.numeric) None ok 0 none
27 Sand numeric sapply(Sand, is.withinRange, 0, 100) None ok 11 20,21,22,23,24 … more
28 Lime numeric sapply(Lime, is.numeric) None ok 0 none
29 Lime numeric sapply(Lime, is.withinRange, 0, 100) None ok 0 none
30 Clay numeric sapply(Clay, is.numeric) None ok 0 none

End of tutorial