The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Here we will examine how AdaSampling works on the Wisconsin Breast Cancer dataset, brca
, from the UCI Machine Learning Repository and included as part of this package. For more information about the variables, try ?brca
. This dataset contains ten features, with an eleventh column containing the class labels, malignant or benign.
head(brca)
#> clt ucs uch mad ecs nuc chr ncl mit cla
#> 1 8 10 10 8 7 10 9 7 1 malignant
#> 2 5 3 3 3 2 3 4 4 1 malignant
#> 3 8 7 5 10 7 9 5 5 4 malignant
#> 4 7 4 6 4 6 1 4 3 1 malignant
#> 5 10 7 7 6 4 10 4 1 2 malignant
#> 6 7 3 2 10 5 10 5 4 4 malignant
First, clean up the dataset to transform into the required format.
brca.mat <- apply(X = brca[,-10], MARGIN = 2, FUN = as.numeric)
brca.cls <- sapply(X = brca$cla, FUN = function(x) {ifelse(x == "malignant", 1, 0)})
rownames(brca.mat) <- paste("p", 1:nrow(brca.mat), sep="_")
Examining this dataset shows balanced proportions of classes.
table(brca.cls)
#> brca.cls
#> 0 1
#> 444 239
brca.cls
#> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
#> [246] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [281] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [316] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [351] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [386] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [421] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [456] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [491] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [526] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [561] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [596] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [631] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [666] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In order to demonstrate how AdaSampling eliminates noisy class label data it will be necessary to introduce some noise into this dataset, by randomly flipping a selected number of class labels. More noise will be added to the positive observations.
set.seed(1)
pos <- which(brca.cls == 1)
neg <- which(brca.cls == 0)
brca.cls.noisy <- brca.cls
brca.cls.noisy[sample(pos, floor(length(pos) * 0.4))] <- 0
brca.cls.noisy[sample(neg, floor(length(neg) * 0.3))] <- 1
Examining the noisy class labels reveals noise has been added:
table(brca.cls.noisy)
#> brca.cls.noisy
#> 0 1
#> 406 277
brca.cls.noisy
#> [1] 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 0 0 1 1 1 0 1 0 1 0 0 1
#> [36] 1 0 1 0 0 1 0 0 0 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0
#> [71] 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
#> [106] 0 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 0 1 1 0 0 1 1 1 0 1 1 1 1 1 1
#> [141] 0 1 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 1 0 0 1 0 0 0 1 1 1 0 0 1 0 0
#> [176] 1 1 1 0 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 0 0
#> [211] 0 1 0 1 0 0 0 1 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0
#> [246] 0 0 0 0 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 1 1
#> [281] 0 1 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 1 1 0 1 0
#> [316] 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1
#> [351] 0 1 0 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0
#> [386] 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1
#> [421] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0
#> [456] 0 1 1 0 1 0 0 1 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 1 1 0 1 0 0
#> [491] 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 1 1
#> [526] 0 0 0 1 0 0 1 1 1 1 1 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0
#> [561] 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 1
#> [596] 0 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0
#> [631] 0 1 1 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
#> [666] 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 0
We can now run AdaSampling on this data. For more information use ?adaSample()
.
Ps <- rownames(brca.mat)[which(brca.cls.noisy == 1)]
Ns <- rownames(brca.mat)[which(brca.cls.noisy == 0)]
brca.preds <- adaSample(Ps, Ns, train.mat=brca.mat, test.mat=brca.mat,
classifier = "knn", C= 1, sampleFactor = 1)
head(brca.preds)
#> P N
#> p_1 1.0000000 0.0000000
#> p_2 0.6666667 0.3333333
#> p_3 1.0000000 0.0000000
#> p_4 0.6000000 0.4000000
#> p_5 0.8000000 0.2000000
#> p_6 1.0000000 0.0000000
accuracy <- sum(brca.cls.noisy == brca.cls) / length(brca.cls)
accuracy
#> [1] 0.6661786
accuracyWithAdaSample <- sum(ifelse(brca.preds[,"P"] > 0.5, 1, 0) == brca.cls) / length(brca.cls)
accuracyWithAdaSample
#> [1] 0.9502196
The table gives the prediction probability for both a positive (āPā) and negative (āNā) class label for each row of the test set. In order to compare the improvement in performance of adaSample against learning without resampling, use the adaSvmBenchmark()
function.
In order to see how effective adaSample()
is at removing noise, we will use the adaSvmBenchmark()
function to compare its performance to a regular classification process.
This procedure compares classification across four conditions, firstly using the original dataset (with correct label information), the second with the noisy dataset (but without AdaSampling), the third with AdaSampling, and the fourth utilising AdaSampling multiple times in the form of an ensemble learning model.
adaSvmBenchmark(data.mat = brca.mat, data.cls = brca.cls.noisy, data.cls.truth = brca.cls, cvSeed=1)
#> Se Sp F1
#> Original 0.971 0.971 0.959
#> Baseline 0.748 0.975 0.831
#> AdaSingle 0.987 0.943 0.944
#> AdaEnsemble 0.987 0.957 0.956
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.