The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
BEST is Decision Tree algorithm that permits the user to define a precise ordering in the partitionning process. As a statistician I believe the data should speak for it self as much as possible but sometimes guiding the algorithm can be helpfull if the data set contains few observations or if we would like to utilize some expert external knowledge about the structure of the data.
Here we will show how to utilize this feature to produces a Decision Tree on a data set containing missing values.
To begin let us generate a simple data set :
set.seed(100)
n=1000
X1 <- rnorm(n,0,sd=1)
X2 <- rnorm(n,2,sd=2)
X3 <- runif(n,0,1)
X4 <- runif(n,-2,2)
Y <- 1*(X1<0)*(X4<0.5)+0*(X1>0)*(X4<0.5)+1*(X3>0.5)*(X4>0.5)+0*(X3<0.5)*(X4>0.5)
#Add some randomized Y
RY <- sample(1000,150)
Y[RY] <- 1-Y[RY]
Now, let us make one important predictor missing :
X3[X3>0.5] <- NA
Data <- cbind(X1,X2,X3,X4,as.factor(Y))
Now that we have our data set with missing values, let us use BEST. To begin, let’s create a dummy variable indicating if \(X_3\) is missing. Then let us use the ForgeVA function to build the list that will guide BETS through the data partitionning process:
X5 <- is.na(X3)*1
NewData <- cbind(Data[,1:4],X5,Data[,ncol(Data)])
Training <- NewData[1:800,]
Valid <- NewData[801:900,]
Testing <- NewData[901:1000,]
d = ncol(NewData)-1 #number of predictor
VA <- BESTree::ForgeVA(d,5,3)
Let us quickly examine what ForgeVA does, it might be the most confusing part of this package. The first input is the number of predictor, the second the location of the gating variable and the third is the location of the variable with missing value. The list looks like :
VA
#> [[1]]
#> [1] 1 1 0 1 1
#>
#> [[2]]
#> [[2]][[1]]
#> [1] 0
#>
#> [[2]][[2]]
#> [1] 0 0 0 0 0
#>
#> [[2]][[3]]
#> [1] 0 0 0 0 0
#>
#>
#> [[3]]
#> [[3]][[1]]
#> [1] 0
#>
#> [[3]][[2]]
#> [1] 0 0 0 0 0
#>
#> [[3]][[3]]
#> [1] 0 0 0 0 0
#>
#>
#> [[4]]
#> [[4]][[1]]
#> [1] 0
#>
#> [[4]][[2]]
#> [1] 0 0 0 0 0
#>
#> [[4]][[3]]
#> [1] 0 0 0 0 0
#>
#>
#> [[5]]
#> [[5]][[1]]
#> [1] 0
#>
#> [[5]][[2]]
#> [1] 0 0 0 0 0
#>
#> [[5]][[3]]
#> [1] 0 0 0 0 0
#>
#>
#> [[6]]
#> [[6]][[1]]
#> [1] 0.5
#>
#> [[6]][[2]]
#> [1] 0 0 1 0 0
#>
#> [[6]][[3]]
#> [1] 0 0 0 0 0
Where the first element ([1]) is the variable usable when begining, every variables except the ones with missing values. Then the elements at location [d+1] in the list represent the gating abilities of individual predictor. Note in [5+1] that for the branch \(X_5 < 0.5\) we will add the predictor \(X_3\) (the threshold value 0.5 is included in [[6]][[1]] and the variable added on \(X_5 < 0.5\) is included in [[6]][[2]]).
Finally, let’s run BEST on the training set, prune it according to the validations et and check it’s accuracy on the test set :
Fit <- BESTree::BEST(Training,10,VA)
PTree <- BESTree::TreePruning(Fit,Valid)
Fit[[1]] <- PTree
preds <- BESTree::MPredict(Testing[,1:d],Fit)
BESTree::Acc(preds,Testing[,d+1])
#> [1] 0.89
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.