‘vtreat’ is a package that prepares arbitrary data frames into clean data frames that are ready for analysis. A clean data frame:

To achieve this a number of techniques are used. Principally:

For more details see: the ‘vtreat’ article and update.

The main pattern is the use of ‘designTreatmentsC()’ or ‘designTreatmentsN()’ to design a treatment plan and then use the returned structure with ‘prepare()’ to apply the plan to data frames. The main feature of ‘vtreat’ is all data preparation is “y-aware” or uses the relations of effective variables to the dependent or outcome variable to encode the effective variables.

The structure returned from ‘designTreatmentsN()’ or ‘designTreatmentsC()’ includes informational fields. The main fields are mostly vectors with names (all with the same names in the same order):

In addition to these vectors ‘designTreatmentsC()’ and ‘designTreatmentsN()’ return a data frame named ‘scoreFrame’ which contains columns: - ‘varName’: name of new variable - ‘origName’: name of original variable variable was derived from (can repeat) - ‘varMoves’ : logical TRUE if the variable varied during training, only variables that move will be in the treated frame. - ‘PRESSRsquared’ : a PRESS-held out R-squared of a linear fit from each variable to the y-value. Scores of zero and below are very bad, scores near one are very good. - ‘psig’ : significance of observed variable ‘PRESSRsquared’ value under an in-sample permutation test. - ‘catPRSquared’ : for categorical outcomes: deviance based pseudo-Rsquared. - ‘csig’ : for categorical outcomes: significance of observed variable catPRSquared value under an in-sample permutation test. - ‘sig’ : ‘csig’ for categorical outcomes, ‘psig’ otherwise.

In all cases we have two undesirable upward biases on the scores:

‘vtreat’ uses a number of cross-training and jackknife style procedures to try to mitigate these effects. The suggested best practice is (if you have enough data) to split your randomly into at least the following disjoint data sets:

The idea is: taking the extra step to perform the ‘designTreatmentsC()’ or ‘designTreatmentsN()’ on data disjoint from training makes the training data more exchangeable with test and avoids the issue that ‘vtreat’ may be hiding a large number of degrees of freedom in variables it derives from large categoricals.

An trivial execution example (not demonstrating any cal/train/test split) is given below. Variables that do not move during hold-out testing are considered “not to move.”

library(vtreat)
dTrainC <- data.frame(x=c('a','a','a','b','b',NA),
   z=c(1,2,3,4,NA,6),y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE))
head(dTrainC)
##      x  z     y
## 1    a  1 FALSE
## 2    a  2 FALSE
## 3    a  3  TRUE
## 4    b  4 FALSE
## 5    b NA  TRUE
## 6 <NA>  6  TRUE
dTestC <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA))
head(dTestC)
##      x  z
## 1    a 10
## 2    b 20
## 3    c 30
## 4 <NA> NA
treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE)
## [1] "desigining treatments Wed Oct  7 10:14:48 2015"
## [1] "design var x Wed Oct  7 10:14:48 2015"
## [1] "design var z Wed Oct  7 10:14:48 2015"
## [1] "scoring treatments Wed Oct  7 10:14:48 2015"
## [1] "WARNING skipped vars: x"
## [1] "have treatment plan Wed Oct  7 10:14:48 2015"
print(treatmentsC)
## $treatments
## $treatments[[1]]
## [1] "vtreat 'Bayesian Impact Code'('x'->character->'x_catB')"
## 
## $treatments[[2]]
## [1] "vtreat 'Scalable pass through'('z'->numeric->'z_clean')"
## 
## $treatments[[3]]
## [1] "vtreat 'is.bad'('z'->numeric->'z_isBAD')"
## 
## 
## $vars
## [1] "z_clean" "z_isBAD"
## 
## $varMoves
## z_clean z_isBAD 
##    TRUE   FALSE 
## 
## $sig
##   z_clean   z_isBAD 
## 0.2601608 1.0000000 
## 
## $scoreFrame
##   varName origName varMoves PRESSRsquared psig       sig catPRSquared
## 1 z_clean        z     TRUE    -0.8237958    1 0.2601608    0.1524329
## 2 z_isBAD        z    FALSE     0.0000000    1 1.0000000    0.0000000
##        csig
## 1 0.2601608
## 2 1.0000000
## 
## $nmMap
## $nmMap[[1]]
## $nmMap[[1]]$new
## [1] "x_catB"
## 
## $nmMap[[1]]$orig
## [1] "x"
## 
## 
## $nmMap[[2]]
## $nmMap[[2]]$new
## [1] "z_clean"
## 
## $nmMap[[2]]$orig
## [1] "z"
## 
## 
## $nmMap[[3]]
## $nmMap[[3]]$new
## [1] "z_isBAD"
## 
## $nmMap[[3]]$orig
## [1] "z"
## 
## 
## 
## $outcomename
## [1] "y"
## 
## $meanY
## [1] 0.5
## 
## $ndat
## [1] 6
## 
## $skippedVars
## [1] "x"
## 
## attr(,"class")
## [1] "treatmentplan"
print(treatmentsC$treatments[[1]])
## [1] "vtreat 'Bayesian Impact Code'('x'->character->'x_catB')"
dTrainCTreated <- prepare(treatmentsC,dTrainC,pruneSig=c(),scale=TRUE)
head(dTrainCTreated)
##         z_clean     y
## 1 -3.864865e-01 FALSE
## 2 -2.108108e-01 FALSE
## 3 -3.513514e-02  TRUE
## 4  1.405405e-01 FALSE
## 5 -2.220446e-16  TRUE
## 6  4.918919e-01  TRUE
varsC <- setdiff(colnames(dTrainCTreated),'y')
# all input variables should be mean 0
sapply(dTrainCTreated[,varsC,drop=FALSE],mean)
##      z_clean 
## -1.94289e-16
# all slopes should be 1
sapply(varsC,function(c) { lm(paste('y',c,sep='~'),
   data=dTrainCTreated)$coefficients[[2]]})
## z_clean 
##       1
dTestCTreated <- prepare(treatmentsC,dTestC,pruneSig=c(),scale=TRUE)
head(dTestCTreated)
##         z_clean
## 1  4.918919e-01
## 2  4.918919e-01
## 3  4.918919e-01
## 4 -2.220446e-16
# numeric example
dTrainN <- data.frame(x=c('a','a','a','a','b','b',NA),
   z=c(1,2,3,4,5,NA,7),y=c(0,0,0,1,0,1,1))
head(dTrainN)
##   x  z y
## 1 a  1 0
## 2 a  2 0
## 3 a  3 0
## 4 a  4 1
## 5 b  5 0
## 6 b NA 1
dTestN <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA))
head(dTestN)
##      x  z
## 1    a 10
## 2    b 20
## 3    c 30
## 4 <NA> NA
treatmentsN = designTreatmentsN(dTrainN,colnames(dTrainN),'y')
## [1] "desigining treatments Wed Oct  7 10:14:48 2015"
## [1] "design var x Wed Oct  7 10:14:48 2015"
## [1] "design var z Wed Oct  7 10:14:48 2015"
## [1] "scoring treatments Wed Oct  7 10:14:48 2015"
## [1] "have treatment plan Wed Oct  7 10:14:48 2015"
print(treatmentsN)
## $treatments
## $treatments[[1]]
## [1] "vtreat 'Scalable Impact Code'('x'->character->'x_catN')"
## 
## $treatments[[2]]
## [1] "vtreat 'Scalable pass through'('z'->numeric->'z_clean')"
## 
## $treatments[[3]]
## [1] "vtreat 'is.bad'('z'->numeric->'z_isBAD')"
## 
## 
## $vars
## [1] "x_catN"  "z_clean" "z_isBAD"
## 
## $varMoves
##  x_catN z_clean z_isBAD 
##    TRUE    TRUE   FALSE 
## 
## $sig
##  x_catN z_clean z_isBAD 
##       1       1       1 
## 
## $scoreFrame
##   varName origName varMoves PRESSRsquared psig sig
## 1  x_catN        x     TRUE    -0.7200931    1   1
## 2 z_clean        z     TRUE    -0.4545128    1   1
## 3 z_isBAD        z    FALSE     0.0000000    1   1
## 
## $nmMap
## $nmMap[[1]]
## $nmMap[[1]]$new
## [1] "x_catN"
## 
## $nmMap[[1]]$orig
## [1] "x"
## 
## 
## $nmMap[[2]]
## $nmMap[[2]]$new
## [1] "z_clean"
## 
## $nmMap[[2]]$orig
## [1] "z"
## 
## 
## $nmMap[[3]]
## $nmMap[[3]]$new
## [1] "z_isBAD"
## 
## $nmMap[[3]]$orig
## [1] "z"
## 
## 
## 
## $outcomename
## [1] "y"
## 
## $meanY
## [1] 0.4285714
## 
## $ndat
## [1] 7
## 
## $skippedVars
## character(0)
## 
## attr(,"class")
## [1] "treatmentplan"
dTrainNTreated <- prepare(treatmentsN,dTrainN,
                          pruneSig=c(),scale=TRUE)
head(dTrainNTreated)
##       x_catN     z_clean y
## 1 -0.1785714 -0.41904762 0
## 2 -0.1785714 -0.26190476 0
## 3 -0.1785714 -0.10476190 0
## 4 -0.1785714  0.05238095 1
## 5  0.2380952  0.20952381 0
## 6  0.2380952  0.00000000 1
varsN <- setdiff(colnames(dTrainNTreated),'y')
# all input variables should be mean 0
sapply(dTrainNTreated[,varsN,drop=FALSE],mean) 
##        x_catN       z_clean 
## -2.379437e-17  4.757324e-17
# all slopes should be 1
sapply(varsN,function(c) { lm(paste('y',c,sep='~'),
   data=dTrainNTreated)$coefficients[[2]]}) 
##  x_catN z_clean 
##       1       1
# prepared frame
dTestNTreated <- prepare(treatmentsN,dTestN,
                         pruneSig=c())
head(dTestNTreated)
##      x_catN  z_clean
## 1 0.0000000 7.000000
## 2 0.2380952 7.000000
## 3 0.2380952 7.000000
## 4 0.2380952 3.666667
# scaled prepared frame
dTestNTreatedS <- prepare(treatmentsN,dTestN,
                         pruneSig=c(),scale=TRUE)
head(dTestNTreatedS)
##       x_catN   z_clean
## 1 -0.1785714 0.5238095
## 2  0.2380952 0.5238095
## 3  0.2380952 0.5238095
## 4  0.2380952 0.0000000